REVEALED: The Quake III Secret Algorithm! Part 2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey, I m Dave, welcome to my Shop! I m Dave Plummer, retired operating systems engineer from Microsoft going back to the MS-DOS and Win95 days, and today we re wrapping up the story of the amazing Fast Inverse Square Root algorithm that was discovered in the Quake III Arena source code when it was released to the public back in 2005. We ll find out precisely just how fast it is later in the episode when we heads-up drag race four variations of inverse square root in a sudden death, no-holds barred cage match of inverse square root fury. Watch all your favorites, such C Runtime , and SSE intrinsic as they battle the FISR and a classically optimized integer version for ultimate square root speed supremacy. What did we learn last episode? Well, in brief, we covered how everything in computer graphics is really based around processing mathematical vectors, and we looked at how to normalize vectors to make their length one without changing their direction. In so doing we learned that the fundamental operation we need to perform is dividing each of the X, Y, and Z by the vector s overall magnitude. Because in America, first you normalize the vectors, then you get the money, and then you get the power. And, well, you know the rest of that one The Quake III Arena FISR, or Fast Inverse Square Root, is a confounding, confusing, and complicated algorithm that solves square roots in a few quick steps with no loops or branches. In the code we can see that it twice does what is actually an undefined operation in C. It takes the raw storage data for an IEEE-754 floating point number and converts it directly to a long value, then just starts using it. It later stuffs the resulting bits back into the floating-point memory slot as though that were all perfectly normal, although it s most certainly anything but! We ran out of time last episode which means we have a few items to wrap up today, and those are: - How does the floating-point to integer and back part of the algorithm work, and why? - Who is ultimately responsible for the algorithm, the magic constant in it, and where did it all come from? - And finally, just how fast is the FISR algorithm in real life testing? We ll cover all three points in today s episode. To test the speed, I ve written a benchmark that does billions of inverse square roots per second to compare four different approaches. The first is the compiler s provided square root function. Next is the intrinsic for the estimated reciprocal square root instruction that is now built into the processor s specialized floating point unit. Third is an optimized C version of an integer square root, and finally we have the Quake III FISR itself. After all, we have the actual source code to it, so why not put it to the test? If you re a fan of this channel, then you ve no doubt seen the Fastest Computer Language Drag racing series where we test the speed of most every computer language ever invented. If not, it s my most popular series, so do make sure you check it out next! To compare the speed of the languages, we rely on prime number benchmarks submitted by the community and operated by our tame racing programmer, The Stack. Some say he codes with just the ones and doesn t even need the zeros, and others say he s the illegitimate love child of Grace Hopper and Ray Noorda. All we know is he s called the Stack, and he does all the code racing for us. Predictably, those fancy results and charts and graphs are coming up later in the episode. In the code we see a float get stuffed into an integer, operated on as if it were meaningful to do so, and then the result is stuffed back into a float. I ll tell you right up front, that doesn t work at least not the way you think it might. To appreciate why that is, we need to understand what a floating-point number is and how it s stored and manipulated. Then we ll not only get a sense for how alien the notion of smashing the bits back and forth is, but we ll be also able to appreciate how and why they re actually doing it. A normal 32-bit integer is stored in memory much as I suspect you think it is. Imagine the lowest bit on the far right. It has a value of 1 if its set. Moving left, each bit position doubles in value: 2, 4, 8, 16, and so on. The topmost bit, however, is reserved as the sign bit. When it s 0, the number is positive. When it s 1, the number is negative. Yes, you in the front row, what s up? Ah, the smart kid wants to know how it handles having both a positive and negative zero. Are they equal when all the bits are zero even if the sign bit is different? Well, the trick is that almost all modern computers use a binary representation known as two s complement. When the sign bit is set you obtain the real value by negating all the other bits and adding one. We don t want to drift too far out of our lane with the how and why, but it avoids having two representations for zero and it means that you can take advantage of the full range of numbers. For example, in an 8-bit value if you just used a na ve sign bit, you d have the range from negative 127 to positive 127, but that s only 255 values. With two s complement, the values start at negative 128 and run up to the same positive 127, so you get all 256 values made use of. Better yet, addition and subtraction can be performed blindly without worrying about the sign bit and it all just works . When we last left floating point, we learned that Bill Gates and Paul Allen hired a Harvard guy named Monte Davidoff to write the floating-point implementation for their Altair BASIC. It was completely up to him to pick a format and figure out how to do it, and the format he came up with is quite like what is now enshrined as the official spec in IEEE standard number 754. Rather than speculate, I directly asked Monte where he got the inspiration. The story he told me updates the commonly held history with an important new detail as well: Monte never actually had written a floating-point implementation before. As I ve done so many times in my own life when an opportunity presented itself, he eagerly pledged to do something that he knew he d have to figure out how to do later. Here he is in his own words: I learned about floating point from the implementation of the FOCAL language on a PDP-8. I disassembled it by hand to understand how it worked. I had not actually implemented floating point before the 8080 BASIC: my claim to Bill Gates that I had was a bit of an exaggeration. I understood it well enough that I was confident I could do it. At the time I was writing the 8080 floating point, I didn't have access to any of the PDP-8 FOCAL material. Thus, Monte learned how to do floating point from his experience with FOCAL the PDP-8, but he didn t refer directly to it while writing his own. He recalls his disassembly as being done sometime in 1972 or 1973, which would be a number of years prior to his work on Altair BASIC. If Monte s floating point was inspired by the FOCAL language, where did FOCAL come from? Since it all came out of Digital Equipment, I first tried Dave Cutler himself, who pointed me at fellow DEC alumnus named Richard Lary. Mr. Lary was also the author of a 1970s BASIC interpreter that has floating point support. He told me that a fellow named Rick Merrill was the sole creator of FOCAL, but that the floating-point format didn t likely start with him, either. In fact, it all likely goes back to the system math routines on the PDP-8 itself, and Richard kindly sent me a copy of the PDP-8 Floating Point Manual from DEC itself, which is dated back to 1968, the year of my birth! I traced the lineage of those routines back to the PDP-5 floating point package; they re basically an update of those. Thus, the Quake III FISR is dependent on IEEE-754, which was the standardized replacement for Microsoft Binary Format, which was designed by Monte Davidoff based on having seen the FOCAL language by Rick Merrill which in turn was likely inspired by the PDP-8 floating point package that was an update of the PDP-5 code from the early 1960s. [It s the circle of life!] But who wrote the DEC code? Well, I ve made some calls, and I m working on it, but for now that s the end of the line. If I find out I ll be sure to let you know in some future video, so if you re curious for more history, do make sure you re subscribed to the channel! Now that we know the lineage of this vaunted floating-point format, what is it? How does it work? How do computers internally represent fractional numbers with decimal points in them when all they have to work with is whole binary numbers? If I were new at this and you asked me how to represent floating decimal point numbers on a processor that deals exclusively with integers, I might decide that 8 bits of precision after the decimal point is good enough for me . So, I d break up a 32-bit word into a 24-bit section of whole numbers, known as the mantissa, and the other 8 bits would represent the fraction after the decimal point. Remember we re in binary, so our bits after the decimal point are fractional powers of two. In other words, after the decimal point we have digits worth , , 1/8, 1/16, and so on. You make up whatever number you need by combining those fractional bits to get a close as possible to the Base10 number that you re trying to represent. The 24-bit mantissa is then just a normal integer number. To represent the number 10.75, we place 10 in the mantissa and the bits for one half and one quarter in the fractional section, making it .75. Taken together, it s 10.75 There are few problems with this approach. First off, it s not a floating point system at all, it s a fixed point, locked at a fraction size of 8 bits. Second, we ve lost much of the range, as with only 24 bits it now tops out around 16 million. A much better system would be to represent everything in scientific notation, with a fraction and an exponent. That way, the precision is always the same no matter where the decimal point is. We shift the decimal place and update the exponent until our mantissa is between zero and one. Our 10.75 number would become .1075x10^2 To represent this we would put 1075 in the mantissa and 2 in the exponent. We re going to want to deal with negative numbers as well, so we ll need a sign bit up top. The next question is how many bits we should allocate to the exponent. We ll use 8, which as we learned gives us the range negative 128 to positive 127. These are powers of 2, not 10, however. But 2 to the 127th still works out to 10 followed by 38 zeros in Base10, plenty big for most 32-bit purposes. That leaves us 23 bits for the fraction. Thus, we have one sign bit, 8 exponent bits, and 23 fraction bits. It doesn t matter how big your number is you get 23 bits of precision independent of the exponent. There are only two nuances you really need to learn from there. The first is that we always shift the fraction left to get rid of any leading zeros, adjusting the exponent by one each time we do so of course. The value stays the same, but in so doing we always have a one in the topmost bit of the fraction. And in one of the cleverer moments of computer science, someone realized that if it s always one, why not just make it implicit, throw it away, and you gain one more bit of precision seemingly for free! I always thought that was pretty genius. Remember it s not magic or cheating, that information of where the topmost bit lives is fully preserved in the exponent. It s just eliminating redundancy. Now, it s not used by the FISR, but if you re curious, a double precision float is also available at 64 bits. It carries the sign bit, 11 exponent bits, and 52 fractional bits. That s a huge amount of range and precision. With an exponent up to 2 to the 1023rd power, its range works out to 10 to 307th. Last I counted there were only 10 to the 80th particles in the universe, so let s call it jumbo for sure. Now that you have a general idea how a floating-point number is represented, I should clarify that when I was speaking about the lineage of the format earlier, I don t mean that it remained entirely identical across the ages. A few viewers pointed out that I was playing a little fast and loose with the precise details as I wrapped up the previous episode, which is true. But the basics of it the sign bit, the 8-bit biased exponent, the 23 bit fraction, that s all there going back to the PDP-8. The bias is one different between MBF and IEEE-754, so that s just one reason that they aren t directly binary compatible, they re at least cousins. Whether reinvented multiple times or by direct inspiration, many other formats, like that of the venerable IBM 360 mainframe, are also quite similar in layout. And now that you ve seen how a floating-point number is stored in a register or in memory, you can appreciate how little sense it makes for the FISR algorithm to interchange raw floating point data with integer numbers. When the FISR first extracts the integer value of the floating-point number, remember that the exponent is up there, and any bits were turned on in the exponent will now be high bits set in the integer value. That s the least of our problems, though. Remember that step of shifting everything left all the way? That means that any bits set in the fraction are going to be way up top, most likely. I don t think it ll ever be intuitively obvious to me what precisely happens when you smash bits back and forth between floating point and integer formats, at least in terms of the impact it has on the value of the number, and why. Even if that s never clear there is, however, as they say online, one weird trick you absolutely must know . And that is that when you do that process of aliasing an IEEE-754 floating point number as an integer, you get an important side effect: it yields something very close to the log2 of x for small x. Once we have the log2, we can use it to help solve the square root using various identities. The resulting value is perfect at 0 and 1. For numbers in between, it s simply scaled and offset. But given that s the case, we can fix that by removing that offset and scale. We just need to know by how much to do so. It turns out that once you ve done that floating point to integer aliasing, you can fix up the scale and offset by subtracting half of your result from a starting point. And that starting point is the magical constant 5F37 59DF. Here s a graph of the results, and as you can see, it s impressively accurate for solving everything without a branch or loop! As to how that magic constant is derived, I ll show you a bit of the math. Now, if you look at the top right hand corner, you ll see the most important number, which is 9, in the top right. That is the page number, and that means this is the ninth page of this stuff. I ll include a link to this paper in the video description but here s the thing: after all that they come up with a different number, and no one actually seems to know how the FISR magic constant was discovered originally. People have now literally tried every possible 32-bit number exhaustively, and it was proved to be the best. But who did it originally? At the time, with it being made famous by the release of the Quake III Arena code, most people assumed that John Carmack was responsible. When someone finally though to actually press him on it, though, he demurred that in fact it was not him, but perhaps it could have been (Terrhea) Terje Matheson? And my Scandinavian ancestors are rolling in their graves as I m sure at my pronunciation is wildly off, so I ll stick with Mr. Matheson from here on out! Mr. Matheson is one of the very few people that might challenge Michael Abrash when it comes to hand optimizing x86 assembly. If his name isn t quite a household word, you might know him as one of the folks that did the public analysis of the Pentium FDIV bug! He chimed in on the comments section of the first episode so if you ve got math questions about the FISR that are way above my pay grade, perhaps he ll spot it if you ask nicely in the comments jere! It turned out that Mr. Matheson had indeed written something similar for a computational fluid chemistry problem. In his case, however, he started with a table lookup followed by the Newtonian approach we discussed in the first part. According to a thread at the Beyond3D website, the trail leads from there to NVIDIA, to a Gary Tarolli, one of the original founders of 3DFX. When he was shown the FISR, he called it a pretty great piece of code , but as to why the magic constant works, even he could not recall. It wasn t originally his. He was familiar with it and had done an implementation of it for the SGI Indigo that may have influenced the final form of the FISR in Quake, but he was not prepared to take credit as its creator. From there the trail went largely cold until an article on Slashdot brought the original author forward on their own. His name is Greg Walsh, and his resume goes back to include things like the first WYSIWYG word processor at Xerox PARC while he was at Stanford. At the time, Greg was working with Cleve Moler, perhaps best known to us as the author of MatLab and the founder and chief scientist at Mathworks. Greg says he took his initial seed of inspiration from Cleve. To date, no one has put a firm stake in the ground as to precisely how, but as best we can determine today, Greg Walsh was the author and the magic constant came out of a collaboration between himself and Cleve Moler. Lest you think it perhaps handed down to him by divine inspiration, or perhaps aliens, rest assured that even the FISR algorithm isn t perfect, however. There s room for some small improvement in the constants used by the Newtonian iteration. The last question we want to ask ourselves is, with all the glorious ability of modern CPUs and GPUs, is the FISR algorithm even still relevant? The answer to that must be answered by its performance. That s why I developed a C++ benchmark to test four variations of inverse square root including the FISR. We ll race them heads up for a billion iterations and see which is the fastest. I ve kept the benchmark as simple as I could. First, it calls each of the four methods to calculate the square root of one million and hope it comes out at one thousand. And sure enough, of course, each one passes the test. The FISR actually returns a 998, not 1000. That s well under one percent error, but if it bothers you we can uncomment that second line of Newtonian refinement in FISR code. It s speed is little changed but the accuracy improves significantly such that it confidently returns 1000. Once each method is shown to work properly we call it repeatedly as fast as possible for five seconds and we count how many million passes we can make per second. There are four functions here, and each measures a different approach. The first, CountDivisions, just uses the sqrt() function from the compiler. The next, CountIDivisions, call the integer square root implementation. The third, CountFIDivisions, uses the FISR. The final one, CountIntrDivisions, uses the specialized AMD/Intel SIMD/FPU instruction intended specifically for this purpose: it returns an estimate of the reciprocal square root. If you re ready, I ll hand the keys to the Stack here and he ll take each one for a test drive and report back with the results. I don t think there are too many surprised here, except I was impressed both by the FISR itself and especially the new rsqrt instruction. The integer square root is the slowest by far, coming in at 111 million operations per second. The processor s floating point square root instruction comes in next and is a full order of magnitude faster. We call it by simply writing 1/sqrt of x, and at 1.4 billion ops per second it lags behind the FISR s 2.8 Billion by almost half. The FISR, then, is at least twice as fast as the fastest modern FPU. But wait, as there s more to the story. Today, Intel chips have the Streaming SIMD, or SSE extensions. These were added by Intel around the year 2000 to enhance graphics capabilities and as a response to AMD s 3DNow architecture. SSE includes an instruction specifically to estimate the reciprocal square root of a number, just as we d like. It renders the older methods largely obsolete, as you can see it turns in nearly 4 billion operations per second, and that s with us wasting resources by only using 32 bits of the register s 128-bit width. What we don t know, however, is whether Intel used the FISR approach, or something similar to it, in the microcode to the reciprocal square root instruction. Perhaps it lives on in silicon, or perhaps the FISR is now entirely a historical curiosity. Only the chip engineers can know for sure. In the 1980s and 90s when it was conceived, however, it was revolutionary in its speed, coming in almost 30 times faster than our optimized integer square root. The FISR, then, is something like an Amiga computer: way ahead of its time, but long, long ago. I hope you enjoyed this random tale of old code as much as I enjoyed bringing it to you. If you did find it entertaining, educational, or even just enjoyable, please consider subscribing to my channel so that you get more like it! And if you d consider leaving me a like on the video, I d sure appreciate that as well! If you have feedback on this video or topic suggestions for a future episode, please let me know in the comments! This video came about directly as a suggestion from a subscriber named Brandon, and I m always eager to know what it is you d like to know more about. Thanks for joining me out here in the shop today! In the meantime, and in between time, I hope to see you next time, right here in Dave s Garage.
Info
Channel: Dave's Garage
Views: 138,993
Rating: undefined out of 5
Keywords: fast inverse square root, quake 3 arena, quake iii, algorithm, quake 3, story, john carmack, mike abrash, c++, programming, id software, gaming, quake, history, math, mathematics, fun, Michael abrash, floating point, how floating point works, how does floating point work
Id: Fm0vygzVXeE
Channel Id: undefined
Length: 20min 12sec (1212 seconds)
Published: Sat Aug 14 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.