How does QueryPerformanceCounter measure time?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in the previous video I introduced the rdtsc instruction this instruction has been with us since the Pentium and as its name implies read timestamp counter it's something that gets us the value of this thing called a timestamp counter which is something that measures how many cycles have elapsed now unfortunately for us when they move to the invariant TSC era it meant that this no longer meant literal core cycles for a particular core we were running on and the reason for that is because they decided that rdtsc should return a consistent value across cores for whatever reason that's what they decided so when cores boost up to higher frequencies or when they Throttle Down to lower frequencies to conserve power we no longer have a TSC value that measures how many cycles of particular core has executed this creates problems for benchmarking and performance analysis which is what we're mostly concerned about in this class and of course what we'll be learning as we go forward is that there's other ways we can interrogate the chip to find out more accurate measures of things like Cycles but before we do that I would like to underscore some important parts about Rd TSC and that's what I'd like to do today first off Rd TSC is still very useful and that's why I taught it in the previous video it's still very useful because it's available basically everywhere you can always count on rdtsc being something that a processor supports if it's an x86 processor at all that's because it's been there since the Pentium and it's been working the entire time so while there are all these subtleties about it you have to learn and the behavior of the timestamp counter may change depending on the platform you still know you can call it so it's one of the few high frequency timing things you can actually do that you know will always be there everything else we're going to learn requires drivers or special tooling or maybe only is there on certain processors there's always a little catch there's a caveat for every other thing we're going to learn that might be better than RD TSC for a particular purpose but the other big reason to learn it is because Rd TSC now actually forms the high Precision timer for most operating system things this is something that's very poorly understood because a lot of information on the internet was written before that was true there was a thing called an hpet which was a thing that was used for operating system time beforehand and when you called things like query performance counter which again we saw in the previous video you are actually asking about some other timer that had nothing to do with rdtsc for example but now in the invariant TSC era it's important to understand that a lot of people who think they're calling something that isn't rdtsc are actually calling rdtsc so what I'd like to do in this video is show you what that means and to use the Assembly Language knowledge that you learned in part one of this course to show how you don't have to take anyone's word for it to find out what is actually the time basis in a particular operating system call on your platform let's take a look at what happens if we just inspect a call to query performance counter because I don't have the source code to Microsoft Windows what I have to do is use a debugger which can show me the disassembly of the windows functions that I call now not all windows functions in their entirety will be able to be disassembled because some of them will involve syscalls and if we wanted to debug actual sys calls which are things that transition down to ring 0 in the operating system we would need a more complicated setup for debugging but because query performance counter does not do a syscall we can use just regular user level debugging to inspect the disassembly of that call what I'm going to use is listing 71 the same thing we used last week for looking at query performance counter and I'm just going to go ahead and step into the read OS timer call that we wrote which just calls query performance counter now instead of stepping over this call like we normally would what I'm going to do is switch to the disassembly view here in my debugger and I'm just going to go ahead and step to the call instruction which of course we already know what that does we learned that in part one of the class so now I'm going to step into this call and what we can see is that I end up in the windows function query performance counter here in the actual disassembly now we don't know where we are other than that you can see in the call stack we're just in ntdll that's because I don't have any symbols installed for this part of the operating system so the debugger doesn't actually know where we are but that's okay because we don't really care we just want to look at what the call is doing in Assembly Language now if we look we can see that the beginning of the function does basic things we would expect like setting up the stack there just like we looked at it part one and then it does some testing none of which actually causes any branches now you know what all this stuff is Right test Jay Z test Jay Z right is doing jumpus zeros loading some things and doing some jump if zeros all stuff we already know now because none of these jumps seem to get taken we don't really have to care that much about what they are for our purposes if we were doing really hardcore reverse engine nearing we might want to spend some time looking at what those jumps actually do you know maybe we're investigating the security of this function or other things like that we'd have to look at them but we just want to know where the time is coming from and what we can see here is right away we get to a very simple sequence that pretty much tells us where the time is coming from our dtscp followed by shl followed by or that actually gives us a very good picture already of where query performance counter is getting its time from so there you can see without having to ask any questions or posts on stock overflow or do any other kind of guesswork you can see for yourself just by stepping into the call exactly what query performance counter is really measuring it's just measuring the timestamp counter it's issuing an RD TSC P instruction which is exactly the same as an rdtsc instruction with two slight caveats and let's explain what those are now first of all exactly like rdtsc it's going to put into eax the low part of the TSC and then into edx it's going to put the high part of the timestamp counter this is exactly the same as the rdtsc instruction that we looked at last time however the P part actually refers to another thing it's going to do into the ECX register it's going to put something called the PID or processor ID processor core ID might be a better way to think about it and what it does is it says which of the cores in the CPU was actually running this particular rdtscp meaning what core was the thread assigned to when it actually executed this instruction now this actually has its own instruction there is an instruction called RDP ID you can use just to get that particular piece of information but because it's the kind of thing that you might want to know when you're writing something like a profiler right you want to know which core is running what well why not get the timestamp counter and the processor ID in one convenient instruction and that is rdtscp which again has not existed for as long as rdtsc so it's not always available but it is available in most processors certainly any processor is released in the recent past now there's one other thing that rdtscp does that we won't really talk about much here but that's that it has some different characteristics with respect to the out of order execution of the processor that's something we'll cover a little later in the course so I'll just briefly mention that it exists here as a thing but once we get to the out of order window it will make more sense so we'll explain it more there for now we only really care about this part because it's the only part that the query performance counter timing code actually needs in order for us to understand it so what happened right after the Rd TSC P well it did the following instruction sequence if you recall it doesn't shl of RDX with Ox 20 20x well what's 20x it's 32 right so 1 2 4 8 in one hex digit right 16 32 2 in the second hex digit so this is a shift left by 32 of the 64-bit register that it put the high value in right now this makes perfect sense because the high value was put in the low 32 bits so we're shifting it up to put it in the high slot what does it do after that or RDX and Rax fuse together the two parts Rax is the 64-bit register that the low part was put into this is the low 32 bits in eax Rex is a 64 with one it does a 64-bit Fusion of those by doing an or this combines them together right so what are we doing there in this particular one we have a register we've got 32 bits in the bottom that's the high 32 bits of the value we want we're shifting it up to put it here then we've got another one which is going to have the low 32 bits in here already we're ordering these two together to get something that has the high and the low in one register we're just putting the original 64-bit TSC value back together again because it came into halves so we can see exactly exactly where query performance counter is getting its time Source from it's just the timestamp counter so when people are wondering should I call query performance counter or should I call rdtsc well the answer depends on what you're expecting to happen if you are trying to have the operating system just pick a good timing source for you because you don't really know maybe you've made an executable that's supposed to run on random end user machines and you just don't know you just want it to give you back some reliable time Source well query performance counter makes perfect sense it will use something like an hpet on older systems and it will use rdtscp on newer systems that's what you wanted if on the other hand what you're trying to do is profile things there's no point in calling query performance counter right because all it is is a call to rdtsc which you could have done yourself that as we'll see in a second throws away Precision to drop the resolution of the timer down to 10 megahertz now let's look at how that part happens we already know it does this because if you remember last week we did query performance frequency and it consistently returned on multiple machines 10 megahertz now we know that rdtsc's frequency is going to be up in things like three four five billion right three four five gigahertz because it runs CPU frequencies so how does it go from calling rdtscp which we know will be at a frequency in gigahertz how does it drop that down to 10 megahertz which it must if that's what it's actually going to start returning let's take a look at the rest of the assembly after reading the timestamp counter and reassembling the 64-bit value what you can see is query performance counter now loads Rax and rcx with two constants that we don't really know where they're coming from there are just some values that it's going to use and then it issues a mull RDX now the mole instruction as we know is an unsigned integer multiply and we haven't looked at it that much in the class but it's going to do the multiply with another implicit register that's not listed which is Rax so it's going to multiply the 64-bit timestamp counter we reconstructed here in RDX buy some magical constant that was loaded into Rax it's then going to load something into eax which it doesn't really look like this matters much to us because we then do an ad of RDX to rcx which is the other constant we loaded then we do some stuff with the x that doesn't really affect us because we don't do this jump we move another constant into CL and then we add R11 another constant we loaded into RDX and then we shift right RDX by by CL so in terms of what we actually did to RDX we multiplied it by a constant we added a constant to it we added another constant to it and then we shifted it by a constant so it's basically mole add shift that's what we're doing here now if we look at the end of the function that's it we never touch RDX again and in fact we write RDX out right here into that parameter that was passed in so really RDX holds the value we're going to hand back the entire time we construct the timestamp counter in it we do a mull against it we add to it twice and then we shift it and that's the thing we actually write back so what does that do so this sequence of operations is much less easy to understand we learned all of the instructions in part one of the course nothing is a surprise here like I said almost all x64 code you will read is something you could have understood even if you only knew x86 Assembly Language they're that similar so none of the instructions are weird but it's kind of an unusual sequence now if we think about what we would expect to see right like what would we be doing if we ourselves were supposed to write some code that was going to transform something from a 4.2 gigahertz timer to a 10 megahertz timer what would we want to write well we would assume that we would take our clock value right and we would multiply it by the ratio between the clock value we want and the clock value we're getting so we're basically saying all right we want a 10 megahertz clock and let's go ahead and divide that by what we're actually getting which is a 2.4 gigahertz clock yeah which is just equivalent to you know 10 over 4200 right that's that's really all we're doing here 1 over 420 might be the way to say it Elon Musk would be proud so effectively what we see here is this equation is what we actually would expect but what we're seeing instead is it loading effectively three constants actually Four constants but two of them are used for an ad so they could have actually been fused together one shift constant and one multiply constant effectively what we see is mole add shift and the shift is right so it's a divide effectively right it's it's reducing the number so we're loading a constant to multiply by and this multiply is a is a mull RDX right we are adding to that same value that RDX and we're shifting that RDX now we actually do two separate ads but that's neither here nor there because two separate ads could have been fused into one doesn't really do anything with the value in the interim so this is effectively the sequence we're executing now we haven't really studied the mall instruction very much right I talked about the fact that we would be seeing it and we sort of understand it but we haven't really studied it in detail so it's worth talking about one more time exactly what this is going to do a mall RDX now RDX that's the 64-bit register and mole on Intel Hardware is actually defined to do the full width multiply what this means is that this will multiply by the implicit other register which is Rax this will do RDX times Rex and it will produce the full 128-bit result the high result will go into RDX and the low result will go back into Rax it it overwrites both of those things with the full result so we're doing 64 times 64 and producing the full 128. so what we're seeing here is we're multiplying 64 by 64 keeping the high adding something to the high and then Shifting the high right we shift it by CL which we loaded what is this doing how could that possibly be Computing this we know it's Computing this because we know it's giving us back a 10 megahertz timer and we know the source is some gigahertz timer 4.2 something like that has to plug the book I plug all the time it's one of John blow's favorite books too I learned about this book from him and it is fantastic I use it all the time it's like the only programming book that's actually next to my desk that I read all the time and that is Hacker's Delight because this course is about performance aware programming I don't cover in detail all the little tricks you might use to shave an instruction off of an Assembly Language sequence that's a cool thing and you know we probably will have some stuff where I talk about things like that but overall our goal here is to learn about performance and how it works not to become expert instruction level optimizers but for those of you who want those skills hackers Delight has just a wealth of information and this little trick right here is an entire section of hackers Delight what is that trick the trick is doing integer divides by doing integer multiplies followed by a shift how does this work well read Hacker's Delight for the full version because since it is discrete math meaning these are not continuous numbers they are literally integer numbers there's a lot of subtlety to it and you need the full mole add shift in order to actually get precise results if that's what you wanted but I can show you the basis of the technique in continuous numbers and it's very easy to understand so let me explain basically how that works if I wanted to do a multiply that produced something like this right we know that in continuous math where we can just do floating Point uh values or you know if we have infinite Precision any of that stuff then we can just compute this ratio 10 you know we can do 1 over 420 compute that as a floating Point number and just do the multiply so it's easy to see how we would turn a divide into a multiply it's so simple we don't have to divide by 420 we multiply by one over for 20. problem solved well if you think about how this integer multiply works we're getting back the high 64 bits what are the high 64 bits of 128-bit number it's the 128 bit number divided by 2 to the 64th right we removed the whole bottom 64 bits of the number so this instruction implicitly multiplies by Rax and then divides by 2 to the 64. right so we're already doing a divide we're doing a divide by 2 to the 64. so all we really need to do is say well we want this ratio to be the same as the ratio we actually wanted right we want this value we want to solve for what that would be in order to produce the ratio we actually wanted so what would that value be well just do you know your basic math homework here right and cross multiply right if we load Rex with a value derived from this particular equation then when we do the mole High we will actually get the answer that is dividing by this 420. we can divide with a multiply because there's already a divisor built in and we just have to produce the numerator that accounts for when divided it becomes that ratio now again it really is that simple conceptually but in order to get precise results in discrete math if you actually wanted them to be precise which in this case I don't really think you would have needed it to be but that's separate story in order to get precise results you have to do a little bit more work you need to be able to vary this denominator so instead of 2 to the 64 you want to have everything between 2 to 64 and 2 to the 127 basically you want to be able to use an additional shift so after divide by 2 to 64 you divide by more that's what this shift right does so this CL here is if you needed to shift it more in order to get an accurate result the ad again is because for unsigned multipliers you have to have an adjustment in there sometimes again not going to go into this stuff because it's just a simple trick to turn divs into moles when you need this and that is often something people were trying to do especially when divides were even more expensive related to 125 than they are now but you know in general they're always moles are always cheaper so this you see this pattern fairly frequently so if you want those details go read hackers Delight it's got entire sections on this for signed and unsigned uh division by constants it's got all that it also talks about a whole bunch of related things with how to do divides faster divides what you know the remainder is zero for example which is very common when you're compiling the C language for example et cetera et cetera so that's actually what's going on here and just to prove it to you we can actually go in when we're stepping through the assembly we can see what value they loaded into Rax and if we want to we could verify that it actually is that ratio we can verify that Rax over the total shift we're going to have to look at the CL remember as well but we can look at what the CL value is and if it's non-zero we have to count for it look at what the Rax value is and reproduce the correct ratio and C does it equal one well does it equal 10 megahertz over our actual TSC frequency we expect which would be like the base frequency of the processor roughly or something like that so let's take a look is that what actually happens are we right that this is the technique they're using and that's why those instructions show up inquiry performance counter so first things first if I go ahead and run system info and I look to see what the actual speed of the processor I'm running on is to give us a sense of what we'd expect the time stamp counter to be we get that 4.2 gigahertz number which is what I would expect and furthermore if I run listing 73 from last time we can verify what the timestamp counter frequency appears to be again 4.2 gigahertz so we're pretty sure that the timestamp counter on this machine should be running at 4.2 gigahertz which means if we do that ratio of 10 megahertz versus 4.2 gigahertz we see a value of 0.00238 blah blah blah blah blah and so what we want to see now is does the value that we load into Rax for that query performance counter assembly divided by 2 to the 64 plus whatever the CL shift is does that give us the same ratio if it does we're at least pretty sure that what Curry performance counter is doing is exactly what we you think it's doing if it's something else then maybe we got it wrong I mean we are doing reverse engineering here so we're really just guessing as to what the code does based on our own kind of pattern matching ability right if I go ahead and repeat the process of running this program and stepping back into the Assembly Language for query performance counter now as we go we can watch all of these things happen and see what the values actually are first of all just for curiosity's sake if you'd like to see the reassembly of the 64-bit number we can look at the hexadecimal versions of RDX and Rex which is where the timestamp counter values got placed and you can watch it get reconstructed here is the high 32 bits of the value and watch it's going to shift it up to put them in the high 32 bits of the register there you go now it's going to take the low 32 bits of the timestamp counter and or those in see so just like we thought it's reconstructing the timestamp counter as a 64-bit value now it's going to load the multiply constant and the add con constant and we can look at what both of those are here is the multiply constant and here is the add constant we can see it's not actually using an add constant here but that's the multiply constant now we're going to have to copy this value because it's going to overwrite it right after it does the multiply so we have to copy this value out and put it into our calculator so we can use it in a second now let's double check with the other constants are at the end we are going to do an add with R11 what's that value and we're going to do a shift with cl what's that value so as we can see we don't have any additional shift so it's just 2 to the 64 is our denominator and then R11 to be honest don't have any idea what this value is it looks like some kind of offset value fortunately for us the ads don't really matter to us because if we want to see if we're right about the timestamp counter if we take two query performance counter measurements we're just going to subtract them right we do like 2 over 1 second and we'd subtract the two to see what the frequency was well when we subtract the two any constant that we are adding will just appear right it doesn't actually do anything because we're just subtracting it from itself so we really only care about the Rax value and the CL value in terms of our understanding of the function the add offset well we could try to reverse engineer what that value actually is for like why is it trying to offset these numbers but for our purposes just to see what the time base actually is it doesn't matter to us because it's just a fixed offset so back in our calculator we know that what we're comparing here is that ratio that we think it should be versus the value we saw in Rax divided by 2 to the 64. no additional shift because CL was Zero when we asked the calculator for that ratio lo and behold we get almost the exact same number you have to go out to the seventh significant digit before they actually vary and we would expect them to vary because again we're reproducing some exact ratio with just a fixed integer over 2 to 64. so there's a limit to how exact it can actually represent the original ratio but that's it that pretty much proves definitively that this this is what they're doing but of course we don't know everything about the function like we don't know how it's setting that large offset or why it's doing it because it wouldn't matter for most uses of great performance counter and also we don't know what all those things were that it was testing right it's obviously got a preamble where it's doing a bunch of checks to see something but we don't really know what those are either so there's plenty more reverse engineering you could do and I've never looked to see if someone on the internet perhaps has done all that reverse engineering work but it's an interesting thing to ask like what's all the other stuff it's doing and why is it doing it but for our purposes we simply don't care we got the answer we wanted and now we're fairly confident we know what query performance counter is doing to produce its time values so there you have it now you know exactly what career performance counter does under the hood and you can understand what all the things that people have said about it or what was true at times or not true at times why all of that it's important to sort of say okay take everything with a grain of salt let's understand what this actually does so we can make informed decisions about it and what you can see here is that well if what what you want is to just call the operating system and get a reliable timer that's relatively low Precision like 10 megahertz in this case then query performance counter might be a good bet for example you may not want to use something like rdtsc when you just ship a game or ship an application who's going to measure wall clock time because you don't know if sometime in the future Intel might change how rdtsc works on one of their processors or AMD will change how rdtsc works on their processors and now you're not insulated from that you will have to ship a patch to account for that If instead you just call query performance counter you can count on hopefully fingers crossed Microsoft to update the version of query performance counter that they use in the operating system on those chips to account for this so what you're doing when you call query performance counter instead of rdtsc is you're asking for that layer of insulation and hoping that Microsoft provides it for you now you know and you can make that decision on the other hand if what you're doing is collecting performance measurements there's no reason to ever call query performance counter it's completely useless because all it's doing is doing the exact same rdtsc call you could have done directly in your code and then stripping it of all its Precision by doing an unnecessary multiply and a ton of extra work so it's basically taking more time to issue the same instruction and then getting rid of all the Precision that instruction offers so you'd be crazy to call query performance counter when you are taking performance measurements on your own machine because it's just a strictly worse version than the rdtsc instruction is when used directly now if you're taking the course my performance will wear a programming course then your homework is very straightforward this week I want you to do the same thing that I just did on your machine you need to get practice stepping through Assembly Language because we're going to do that more and more you've learned it now in part one in part two we're going to be looking at Assembly Language fairly often to understand why the things that we see happening in performance are happening so this is a great opportunity to do that you've seen how I did it now you do the same thing does your operating system on your processor do the same thing that create performance counter is doing on mine is it just a thinly wrapped rdtsc instruction that throws away Precision or rdtscp instruction I should say or are you running on an older chip that does something else or maybe completely different Hardware entirely you should be able to make that determination now if you see something very much like what I saw you can also take the next step and say can you verify the Rax equation to verify that you're actually going to get back your time step counter frequency and remember we found that time step counter frequency last week by using that correlation to query performance frequency if you're not taking the course and you like this level of understanding you like figuring out what's really going on and not just having to take some random post on the internet's word for it then perhaps you would like to take the course and if you would go ahead to computerenhance.com and check it out there's a link in the description below to the table of contents and you can see everything we've done in the course so far you can go through it at your own rate and of course you will receive in your inbox all of the new stuff when it comes out so please check that out if you're interested for now that's it for rdtsc and query performance counter next week we'll be looking at some more of the subtleties here and we'll be actually going into what we're going to start on for our next big thing which is starting to do individual measurements and figuring out how to use rdtsc and its friends to look at smaller Snippets of code which of course is harder than what we did last time when we're just timing Big Blocks hope to see you back here for that until then have fun programming

Info

Channel: Molly Rocket

Views: 10,399

Rating: undefined out of 5

Keywords: Performance-Aware Programming

Id: pZ0MF1q_LUE

Channel Id: undefined

Length: 31min 44sec (1904 seconds)

Published: Mon Jun 19 2023