Hey, I m Dave, welcome to my Shop! I m Dave
Plummer, retired operating systems engineer from Microsoft going back to the MS-DOS and
Win95 days, and today we re wrapping up the story of the amazing Fast Inverse Square Root
algorithm that was discovered in the Quake III Arena source code when it was released
to the public back in 2005. We ll find out precisely just how fast it is later in the
episode when we heads-up drag race four variations of inverse square root in a sudden death,
no-holds barred cage match of inverse square root fury. Watch all your favorites, such
C Runtime , and SSE intrinsic as they battle the FISR and a classically optimized integer
version for ultimate square root speed supremacy. What did we learn last episode? Well, in brief,
we covered how everything in computer graphics is really based around processing mathematical
vectors, and we looked at how to normalize vectors to make their length one without changing
their direction. In so doing we learned that the fundamental operation we need to perform
is dividing each of the X, Y, and Z by the vector s overall magnitude. Because in America,
first you normalize the vectors, then you get the money, and then you get the power.
And, well, you know the rest of that one The Quake III Arena FISR, or Fast Inverse
Square Root, is a confounding, confusing, and complicated algorithm that solves square
roots in a few quick steps with no loops or branches. In the code we can see that it twice
does what is actually an undefined operation in C. It takes the raw storage data for an
IEEE-754 floating point number and converts it directly to a long value, then just starts
using it. It later stuffs the resulting bits back into the floating-point memory slot as
though that were all perfectly normal, although it s most certainly anything but!
We ran out of time last episode which means we have a few items to wrap up today, and
those are: - How does the floating-point to integer and
back part of the algorithm work, and why? - Who is ultimately responsible for the algorithm,
the magic constant in it, and where did it all come from?
- And finally, just how fast is the FISR algorithm in real life testing?
We ll cover all three points in today s episode. To test the speed, I ve written a benchmark
that does billions of inverse square roots per second to compare four different approaches.
The first is the compiler s provided square root function. Next is the intrinsic for the
estimated reciprocal square root instruction that is now built into the processor s specialized
floating point unit. Third is an optimized C version of an integer square root, and finally
we have the Quake III FISR itself. After all, we have the actual source code to it, so why
not put it to the test? If you re a fan of this channel, then you
ve no doubt seen the Fastest Computer Language Drag racing series where we test the speed
of most every computer language ever invented. If not, it s my most popular series, so do
make sure you check it out next! To compare the speed of the languages, we rely on prime
number benchmarks submitted by the community and operated by our tame racing programmer,
The Stack. Some say he codes with just the ones and doesn t even need the zeros, and
others say he s the illegitimate love child of Grace Hopper and Ray Noorda. All we know
is he s called the Stack, and he does all the code racing for us. Predictably, those
fancy results and charts and graphs are coming up later in the episode.
In the code we see a float get stuffed into an integer, operated on as if it were meaningful
to do so, and then the result is stuffed back into a float. I ll tell you right up front,
that doesn t work at least not the way you think it might.
To appreciate why that is, we need to understand what a floating-point number is and how it
s stored and manipulated. Then we ll not only get a sense for how alien the notion of smashing
the bits back and forth is, but we ll be also able to appreciate how and why they re actually
doing it. A normal 32-bit integer is stored in memory
much as I suspect you think it is. Imagine the lowest bit on the far right. It has a
value of 1 if its set. Moving left, each bit position doubles in value: 2, 4, 8, 16, and
so on. The topmost bit, however, is reserved as the sign bit. When it s 0, the number is
positive. When it s 1, the number is negative. Yes, you in the front row, what s up? Ah,
the smart kid wants to know how it handles having both a positive and negative zero.
Are they equal when all the bits are zero even if the sign bit is different?
Well, the trick is that almost all modern computers use a binary representation known
as two s complement. When the sign bit is set you obtain the real value by negating
all the other bits and adding one. We don t want to drift too far out of our lane with
the how and why, but it avoids having two representations for zero and it means that
you can take advantage of the full range of numbers. For example, in an 8-bit value if
you just used a na ve sign bit, you d have the range from negative 127 to positive 127,
but that s only 255 values. With two s complement, the values start at negative 128 and run up
to the same positive 127, so you get all 256 values made use of. Better yet, addition and
subtraction can be performed blindly without worrying about the sign bit and it all just
works . When we last left floating point, we learned
that Bill Gates and Paul Allen hired a Harvard guy named Monte Davidoff to write the floating-point
implementation for their Altair BASIC. It was completely up to him to pick a format
and figure out how to do it, and the format he came up with is quite like what is now
enshrined as the official spec in IEEE standard number 754.
Rather than speculate, I directly asked Monte where he got the inspiration. The story he
told me updates the commonly held history with an important new detail as well: Monte
never actually had written a floating-point implementation before. As I ve done so many
times in my own life when an opportunity presented itself, he eagerly pledged to do something
that he knew he d have to figure out how to do later. Here he is in his own words:
I learned about floating point from the implementation of the FOCAL language on a PDP-8. I disassembled
it by hand to understand how it worked. I had not actually implemented floating point
before the 8080 BASIC: my claim to Bill Gates that I had was a bit of an exaggeration. I
understood it well enough that I was confident I could do it. At the time I was writing the
8080 floating point, I didn't have access to any of the PDP-8 FOCAL material.
Thus, Monte learned how to do floating point from his experience with FOCAL the PDP-8,
but he didn t refer directly to it while writing his own. He recalls his disassembly as being
done sometime in 1972 or 1973, which would be a number of years prior to his work on
Altair BASIC. If Monte s floating point was inspired by
the FOCAL language, where did FOCAL come from? Since it all came out of Digital Equipment,
I first tried Dave Cutler himself, who pointed me at fellow DEC alumnus named Richard Lary.
Mr. Lary was also the author of a 1970s BASIC interpreter that has floating point support.
He told me that a fellow named Rick Merrill was the sole creator of FOCAL, but that the
floating-point format didn t likely start with him, either. In fact, it all likely goes
back to the system math routines on the PDP-8 itself, and Richard kindly sent me a copy
of the PDP-8 Floating Point Manual from DEC itself, which is dated back to 1968, the year
of my birth! I traced the lineage of those routines back to the PDP-5 floating point
package; they re basically an update of those. Thus, the Quake III FISR is dependent on IEEE-754,
which was the standardized replacement for Microsoft Binary Format, which was designed
by Monte Davidoff based on having seen the FOCAL language by Rick Merrill which in turn
was likely inspired by the PDP-8 floating point package that was an update of the PDP-5
code from the early 1960s. [It s the circle of life!]
But who wrote the DEC code? Well, I ve made some calls, and I m working on it, but for
now that s the end of the line. If I find out I ll be sure to let you know in some future
video, so if you re curious for more history, do make sure you re subscribed to the channel!
Now that we know the lineage of this vaunted floating-point format, what is it? How does
it work? How do computers internally represent fractional numbers with decimal points in
them when all they have to work with is whole binary numbers?
If I were new at this and you asked me how to represent floating decimal point numbers
on a processor that deals exclusively with integers, I might decide that 8 bits of precision
after the decimal point is good enough for me . So, I d break up a 32-bit word into a
24-bit section of whole numbers, known as the mantissa, and the other 8 bits would represent
the fraction after the decimal point. Remember we re in binary, so our bits after the decimal
point are fractional powers of two. In other words, after the decimal point we have digits
worth , , 1/8, 1/16, and so on. You make up whatever number you need by combining those
fractional bits to get a close as possible to the Base10 number that you re trying to
represent. The 24-bit mantissa is then just a normal
integer number. To represent the number 10.75, we place 10 in the mantissa and the bits for
one half and one quarter in the fractional section, making it .75. Taken together, it
s 10.75 There are few problems with this approach.
First off, it s not a floating point system at all, it s a fixed point, locked at a fraction
size of 8 bits. Second, we ve lost much of the range, as with only 24 bits it now tops
out around 16 million. A much better system would be to represent everything in scientific
notation, with a fraction and an exponent. That way, the precision is always the same
no matter where the decimal point is. We shift the decimal place and update the exponent
until our mantissa is between zero and one. Our 10.75 number would become
.1075x10^2 To represent this we would put 1075 in the
mantissa and 2 in the exponent. We re going to want to deal with negative numbers as well,
so we ll need a sign bit up top. The next question is how many bits we should allocate
to the exponent. We ll use 8, which as we learned gives us the range negative 128 to
positive 127. These are powers of 2, not 10, however. But 2 to the 127th still works out
to 10 followed by 38 zeros in Base10, plenty big for most 32-bit purposes.
That leaves us 23 bits for the fraction. Thus, we have one sign bit, 8 exponent bits, and
23 fraction bits. It doesn t matter how big your number is you get 23 bits of precision
independent of the exponent. There are only two nuances you really need to learn from
there. The first is that we always shift the fraction left to get rid of any leading zeros,
adjusting the exponent by one each time we do so of course. The value stays the same,
but in so doing we always have a one in the topmost bit of the fraction. And in one of
the cleverer moments of computer science, someone realized that if it s always one,
why not just make it implicit, throw it away, and you gain one more bit of precision seemingly
for free! I always thought that was pretty genius. Remember it s not magic or cheating,
that information of where the topmost bit lives is fully preserved in the exponent.
It s just eliminating redundancy. Now, it s not used by the FISR, but if you
re curious, a double precision float is also available at 64 bits. It carries the sign
bit, 11 exponent bits, and 52 fractional bits. That s a huge amount of range and precision.
With an exponent up to 2 to the 1023rd power, its range works out to 10 to 307th. Last I
counted there were only 10 to the 80th particles in the universe, so let s call it jumbo for
sure. Now that you have a general idea how a floating-point
number is represented, I should clarify that when I was speaking about the lineage of the
format earlier, I don t mean that it remained entirely identical across the ages. A few
viewers pointed out that I was playing a little fast and loose with the precise details as
I wrapped up the previous episode, which is true. But the basics of it the sign bit, the
8-bit biased exponent, the 23 bit fraction, that s all there going back to the PDP-8.
The bias is one different between MBF and IEEE-754, so that s just one reason that they
aren t directly binary compatible, they re at least cousins. Whether reinvented multiple
times or by direct inspiration, many other formats, like that of the venerable IBM 360
mainframe, are also quite similar in layout. And now that you ve seen how a floating-point
number is stored in a register or in memory, you can appreciate how little sense it makes
for the FISR algorithm to interchange raw floating point data with integer numbers.
When the FISR first extracts the integer value of the floating-point number, remember that
the exponent is up there, and any bits were turned on in the exponent will now be high
bits set in the integer value. That s the least of our problems, though. Remember that
step of shifting everything left all the way? That means that any bits set in the fraction
are going to be way up top, most likely. I don t think it ll ever be intuitively obvious
to me what precisely happens when you smash bits back and forth between floating point
and integer formats, at least in terms of the impact it has on the value of the number,
and why. Even if that s never clear there is, however, as they say online, one weird
trick you absolutely must know . And that is that when you do that process of aliasing
an IEEE-754 floating point number as an integer, you get an important side effect: it yields
something very close to the log2 of x for small x. Once we have the log2, we can use
it to help solve the square root using various identities. The resulting value is perfect
at 0 and 1. For numbers in between, it s simply scaled and offset. But given that s the case,
we can fix that by removing that offset and scale. We just need to know by how much to
do so. It turns out that once you ve done that floating
point to integer aliasing, you can fix up the scale and offset by subtracting half of
your result from a starting point. And that starting point is the magical constant 5F37
59DF. Here s a graph of the results, and as you can see, it s impressively accurate for
solving everything without a branch or loop! As to how that magic constant is derived,
I ll show you a bit of the math. Now, if you look at the top right hand corner, you ll
see the most important number, which is 9, in the top right. That is the page number,
and that means this is the ninth page of this stuff. I ll include a link to this paper in
the video description but here s the thing: after all that they come up with a different
number, and no one actually seems to know how the FISR magic constant was discovered
originally. People have now literally tried every possible 32-bit number exhaustively,
and it was proved to be the best. But who did it originally?
At the time, with it being made famous by the release of the Quake III Arena code, most
people assumed that John Carmack was responsible. When someone finally though to actually press
him on it, though, he demurred that in fact it was not him, but perhaps it could have
been (Terrhea) Terje Matheson? And my Scandinavian ancestors are rolling in their graves as I
m sure at my pronunciation is wildly off, so I ll stick with Mr. Matheson from here
on out! Mr. Matheson is one of the very few people
that might challenge Michael Abrash when it comes to hand optimizing x86 assembly. If
his name isn t quite a household word, you might know him as one of the folks that did
the public analysis of the Pentium FDIV bug! He chimed in on the comments section of the
first episode so if you ve got math questions about the FISR that are way above my pay grade,
perhaps he ll spot it if you ask nicely in the comments jere!
It turned out that Mr. Matheson had indeed written something similar for a computational
fluid chemistry problem. In his case, however, he started with a table lookup followed by
the Newtonian approach we discussed in the first part.
According to a thread at the Beyond3D website, the trail leads from there to NVIDIA, to a
Gary Tarolli, one of the original founders of 3DFX. When he was shown the FISR, he called
it a pretty great piece of code , but as to why the magic constant works, even he could
not recall. It wasn t originally his. He was familiar with it and had done an implementation
of it for the SGI Indigo that may have influenced the final form of the FISR in Quake, but he
was not prepared to take credit as its creator. From there the trail went largely cold until
an article on Slashdot brought the original author forward on their own. His name is Greg
Walsh, and his resume goes back to include things like the first WYSIWYG word processor
at Xerox PARC while he was at Stanford. At the time, Greg was working with Cleve Moler,
perhaps best known to us as the author of MatLab and the founder and chief scientist
at Mathworks. Greg says he took his initial seed of inspiration from Cleve. To date, no
one has put a firm stake in the ground as to precisely how, but as best we can determine
today, Greg Walsh was the author and the magic constant came out of a collaboration between
himself and Cleve Moler. Lest you think it perhaps handed down to him
by divine inspiration, or perhaps aliens, rest assured that even the FISR algorithm
isn t perfect, however. There s room for some small improvement in the constants used by
the Newtonian iteration. The last question we want to ask ourselves
is, with all the glorious ability of modern CPUs and GPUs, is the FISR algorithm even
still relevant? The answer to that must be answered by its performance. That s why I
developed a C++ benchmark to test four variations of inverse square root including the FISR.
We ll race them heads up for a billion iterations and see which is the fastest.
I ve kept the benchmark as simple as I could. First, it calls each of the four methods to
calculate the square root of one million and hope it comes out at one thousand. And sure
enough, of course, each one passes the test. The FISR actually returns a 998, not 1000.
That s well under one percent error, but if it bothers you we can uncomment that second
line of Newtonian refinement in FISR code. It s speed is little changed but the accuracy
improves significantly such that it confidently returns 1000.
Once each method is shown to work properly we call it repeatedly as fast as possible
for five seconds and we count how many million passes we can make per second.
There are four functions here, and each measures a different approach. The first, CountDivisions,
just uses the sqrt() function from the compiler. The next, CountIDivisions, call the integer
square root implementation. The third, CountFIDivisions, uses the FISR.
The final one, CountIntrDivisions, uses the specialized AMD/Intel SIMD/FPU instruction
intended specifically for this purpose: it returns an estimate of the reciprocal square
root. If you re ready, I ll hand the keys to the
Stack here and he ll take each one for a test drive and report back with the results.
I don t think there are too many surprised here, except I was impressed both by the FISR
itself and especially the new rsqrt instruction. The integer square root is the slowest by
far, coming in at 111 million operations per second.
The processor s floating point square root instruction comes in next and is a full order
of magnitude faster. We call it by simply writing 1/sqrt of x, and at 1.4 billion ops
per second it lags behind the FISR s 2.8 Billion by almost half. The FISR, then, is at least
twice as fast as the fastest modern FPU. But wait, as there s more to the story. Today,
Intel chips have the Streaming SIMD, or SSE extensions. These were added by Intel around
the year 2000 to enhance graphics capabilities and as a response to AMD s 3DNow architecture.
SSE includes an instruction specifically to estimate the reciprocal square root of a number,
just as we d like. It renders the older methods largely obsolete, as you can see it turns
in nearly 4 billion operations per second, and that s with us wasting resources by only
using 32 bits of the register s 128-bit width. What we don t know, however, is whether Intel
used the FISR approach, or something similar to it, in the microcode to the reciprocal
square root instruction. Perhaps it lives on in silicon, or perhaps the FISR is now
entirely a historical curiosity. Only the chip engineers can know for sure.
In the 1980s and 90s when it was conceived, however, it was revolutionary in its speed,
coming in almost 30 times faster than our optimized integer square root.
The FISR, then, is something like an Amiga computer: way ahead of its time, but long,
long ago. I hope you enjoyed this random tale of old
code as much as I enjoyed bringing it to you. If you did find it entertaining, educational,
or even just enjoyable, please consider subscribing to my channel so that you get more like it!
And if you d consider leaving me a like on the video, I d sure appreciate that as well!
If you have feedback on this video or topic suggestions for a future episode, please let
me know in the comments! This video came about directly as a suggestion from a subscriber
named Brandon, and I m always eager to know what it is you d like to know more about.
Thanks for joining me out here in the shop today! In the meantime, and in between time,
I hope to see you next time, right here in Dave s Garage.