The Truth about the Fast Inverse Square Root on the N64

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
whenever I upload a video about my Mario 64 optimization project I get asked about implementing the famous fast inware square root function made famous by Quake 3 this algorithm computes one divided by the square root of x only with simple operations by exploiting need properties of how the computer represents floating Point values this allows you to skip Computing a square root directly which is usually very expensive and Old Computers often had run much more expensive approximation algorithms like this one here I will explain to you why most of the time this is a bad idea but I will also show you when it is a good idea and I will even show you a little improvement over the algorithm listed on Wikipedia that works faster and is more accurate for most use cases as well as a completely new spin on this algorithm that might come in handy the problem with using the fast and West gared algorithm is not that it is a bad algorithm but rather that most of the time on the N64 we have a much better option available to us the N64 has special hardware parts that let it compute the square root in a single instruction that takes 29 Cycles this is the exact same cost that doing a divide instruction costs so if we compute 1 divided by the fast inverse square root of x instead of the square root of x we really just TCH the cost and accuracy loss of this function on top which is all around terrible here is a table comparing the cycle costs for using the build-in square root function and using the fast inward square root with Z One Newton iterations don't worry about what a Newton iteration is just yet it's just a cheap little thing we can do to increase accuracy for this reason it only ever makes sense to use the first inverse square root function if we actually want to compute one divided by the square root of x and never if we need the actual square root of x additionally there is another issue we we are not CPU bound on the N64 at all the N64 is possibly the machine with the worst memory to CPU speed ratio of all time to render at 30 FPS we need to finish a frame in 33,000 microc seconds in this scene you can see that the CPU doesn't even use half of the time while the renderer needs all the time it can get the renderer and the CPU share the same memory so any optimization we make should aim to improve memory throughput to make the rambas go vroom vroom while using the square root and divide instruction to get one divided by the square root of x takes 58 whole Cycles it also Only Takes Two instructions loading eight instructions takes 60 CPU cycles and it starts the rendering for a bit which is a whole lot worse than just running the CPU a little slower here is a comparison for computing one ided by the root of x if you look at the graph you can see that the instruction count for the fast in square root is higher regardless of how inaccurate we go this is because while the fast inw Square rot uses a bunch of quick to run instructions that just move numbers around in memory it has to use a lot more than just divide and a square root instruction there is one saving rce here and that is cach code the cache is like a memory that the CPU can use all by itself very quickly without disturbing the renderer it can hold 16 kiloby of information and it's our best friend as far as optimizations go if our code is already cached we don't have to pay any cost per instruction beyond the first caching making this negligable I've Rewritten Mario 64's graph rler engine to fit all into one block of code cache meaning if we use this somewhere in engine we can avoid this cost and make this whole column irrelevant but even then there's another issue with the fast inverse square root function the accuracy is reduced we can use this for any type of of code that values the accuracy more than the tiny performance benefit that this offers in this video I've used the fast and square root to normalize The View Matrix and while you might not be able to see the accuracy loss right away I think the accuracy loss is more noticeable than the 0.5 microc CPU gain that this algorithm offers per use all of this means it's looking very Grim for the first and with square root function ever being useful with all these factors working against us is there ever a situation where it's faster Yes actually we just need to fulfill all three of these conditions one we want the actual inverse of the square root and not the square root itself two the code we are running is already residing in instruction cache three we don't need very high accuracy this leaves only a single use case that I can think of right now normalizing vectors for unimportant graphic iCal effects which is coincidentally the exact same use case Doom had to make this algorithm as famous as it is normalizing vectors works by multiplying with one divided by the size of the vector and the size of the vector is the square root of all the elements added together meaning we can substitute 1 / the root of x with the fast inverse square root function here and our first condition is passed I managed to compress the code for the graph render into just 12 kiloby which is less than the 16 kiloby of instruction cach that we have so our second condition is passed as well and additionally it is not important that the vector is perfectly normalized for unimportant effects like Shadows so we passed all three conditions additionally we have a little Advantage here because the inverse square root function has inaccuracies we don't have to worry about dividing by zero in this spot anymore and end up saving an additional fre cycles for a total of 58 Cycles saved this is another another small Advantage for the F iners sare algorithm and it ends up making it just a tiny bit more valuable I went with the lowest accuracy version of The Fast and with square rout which saved around 80 microc per frame on the CPU which means data is fed into the GPU more efficiently where it saved around 20 microc per frame which is a much bigger Improvement than I initially expected from investigating this and the inaccuracy is at most around 3% which is the difference between these two Circles of course course the average difference is much lower and I doubt anyone would actually notice the inaccuracy in gameplay but the fast in squar algorithm as shown on Wikipedia was just a 35-year-old algorithm that wasn't fully cooked back then let's see if we can improve on it there are a few obvious accuracy improvements to this like optimizing all three constants instead of just this one here and there are a lot of scientific papers on a subject already that we could simply copy paste if you've been counting Cycles throughout this video you might have thought that I was off by one on all the new T iterations but that's actually because there's one cycle Improvement you can make over the algorithm listed on Wikipedia you can actually just bitag even harder and subtract one from the exponent instead of multiplying by 05 in a normal inverse Square algorithm this says one cycle on the N64 and I've been comparing with this algorithm the whole time throughout the video it only works if your number is bigger than this tiny number shown on screen that I'm not going to pronounce but like come on your number won't be that small I can't imagine I'm the first to think of this but I figured I'd mention it before you guys cook me in the comments but since we actually use the simplest and least accurate version we only want to optimize the version of this algorithm with no Newton iterations which somehow doesn't seem to have any research done on it so we can get to be the first to optimize it optimizing this is surprisingly very simple because all I had to do for it is ride a brute forcer and let it run on my Nintendo 64 over night since all our Vector nums are going to be in the range 0.03 to1 we can simply iterate over all floats in this range and try all the constants and then use whichever one ended up having the lowest maximum relative error running the brute forer directly on the same Hardware that this algorithm will run on also accommodates for all kinds of Hardware quirks that we might not know of after letting my N64 run overnight the best constant and found was ZX 5 f37 8171 which is not far off for Doom us at all and improved accuracy by just 1.9% one other idea that syus lock had was to not compute the first in square root but instead compute two separate approximation for the fourth inverse square root using two different magic constants picked to cancel each other's error out and then to multiply them together to get the inverse square root what a cool idea this gives us a 33% reduced error over the fast iners gr algorithm but it does cost nine Cycles more to give you a comparison the first Newton iteration cost 24 cycles and it has a 90% increased accuracy I think all three of these approaches are valid and this new idea provides a pretty nice intermediate step I'm sure there are some cases where each of these three functions is your best choice and I bet at least one programmer watching this just went yo because that 33% accuracy gain was all they needed but this concludes my research on this topic I hope you learned something and I'll see you in the next video bye-bye [Music] wow here we go
Info
Channel: Kaze Emanuar
Views: 234,241
Rating: undefined out of 5
Keywords: super mario 64, kaze emanuar, n64, nintendo, mod, download, mario, rom, emulator, retro
Id: tmb6bLbxd08
Channel Id: undefined
Length: 10min 0sec (600 seconds)
Published: Sat Nov 18 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.