Finding the BEST sine function for Nintendo 64

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
back in the day of 2D gaming things were pretty straightforward to move your character around you just add the horizontal speed to the X position and the vertical speed to their y position it was all about programming logic and not so much about the math plus you didn't have to deal with the nightmares of compilers the good old days of programming and pure machine codes every program has a dream when we shifted from 2D games to 3D games the game physics and rendering got a whole lot trickier now we have to deal with an additional Dimension which will require trigonometry to achieve let's take an example imagine you have a character in a game this character can move forward and rotate in any direction now the direction the character is facing is represented by an angle which were called a face angle when the character moves forward they're moving in a direction determined by the phase angle and this is where our friends sign and cosine come into play to move the character we change its X position by multiplying the sign of the face angle with the forward velocity but same as stainlessly we change its Z position by multiplying the cosine of the face angle with its forward velocity similarly in 3D games we have 3D animated characters and each character is made up of various bones that can move and rotate independently in most N64 games each of these bones requires to use a free sign and three cosine functions to calculate the rotation this is because each bone can rotate along three separate axes with each axis needing one sign and one cosine to be calculated and it's not just animated characters all objects have to use these functions to calculate the visual rotation so whether it's Mario turning his head or the spinning platform trigonometry is at work here's the tricky part accurately Computing the sine or the cosine of a value is not easy sine is what we call a transcendental function this is similar to how the value of pi goes on forever without ever repeating silly can write down the exact value of the sign of most numbers but don't worry while we can't get the exact value we can get as close as we want to the real result it's just that every additional digit of accuracy is going to cost exponentially more resources so because it's so tricky and expensive to calculate the exact sign or cosine of a value computers have to use an approximation algorithm instead Mari 64 was one of the first games to come across this issue the way they handled it or variation of the method became the standard for most N64 games but what if I told you there's room for improvement I bet you want to see how you are going to learn a lot of math today and you will like it before we dive into the math I have to clarify something about Mario 64. it has a limited number of angles that it can represent to help you visualize this imagine if you had only four angles to work with now what if we had eight angles and what about 16 angles you get the idea so how many angles are we actually working with on Mario 64. a whooping 65 536 and here's something you might not have noticed while playing in gameplay Mario only ever uses 4096 out of the 65 536 angles shocking right but even with this limited number it's more than enough for his players to not notice any restrictions after all the controller itself can't even handle all 65 536 possible angles mari's anger also Loops back to zero once it hits 65 536 this means that our sine function only needs to work within a fairly small range so how does Mario 64 compute its 4096 sign and cosine values the answer is it doesn't instead of calculating each value Mari 64 uses a clever shortcut and has a large table that stores every 16th sign value when it needs to calculate the sign of an angle it rounds the angle to the multiple of 16 and then looks up the corresponding value in the table thanks to our limit of 4096 angles this covers every angle that can be reached in the game perfectly however this method comes with a cost to use this table you need to store 5096 different sine values this isn't that much luckily since it only takes 4 kilobytes of memory or about 0.1 percent of the n64's total memory without the expansion pack when we're working with 65 536 angles the cosine of an angle is equal to the sign of that angle plus 0x400 what does this mean for Mario 64 well it means that they could reuse the sign table as a Cosi table by simply attaching the last 1024 cosine values to the table and that is exactly what they did this band up the total table size to 5 kilobytes you might be thinking hey that's a pretty smart move it's super fast and efficient surely it can't get any better than this well hold on to your hats because it's about to get a whole lot more exciting than this and a whole lot faster let's take a step back and analyze how expensive it is to compute the sine and cosine using this method on the N64 what relay matters isn't the complexity of the computations but the amount of the data that we transfer that's because the whole system is typically bottlenecked by its slow memory if you look up the N64 on Wikipedia you'll see it claims the console has an astonishing memory speed of 565 megabytes per second I don't know how they got that number but in the real world it's more like 75 megabytes per second my 61st compiler settings when exactly top of the line and you compile the sign and goes encode into eight instructions but with some common sense optimizations we can get this down to 5 instructions five instructions translates to only 20 bytes of memory being attached when you run this course in terms of function size that's practically A Drop in the Ocean it's as fast as you can hope to be with these calculations but while retrieving data from the table seems super fast and isolation there's a hidden cost that we need to consider every time we read from the table we trigger a data transfer to the data cache and since the N64 caches data in chunks of 16 bytes we're effectively paying the cost of a 16 byte data transfer every time there's also a small delay when starting a cash request Molly can't exactly quantify this delay in terms of data transfers it's roughly equivalent to caching an additional 60 bytes so when we add it all up we find that a total cost for each sign and cosine calculation in Mar 64 comes to about 84 bytes let's take this a step further in a typical game frame we might run this function about a hundred times for each subsequent call we don't need to re-cache the function however we'll still need to Cache the new data so our grand total for 100 sign calls and 100 is 5 kilobytes of memory and 6420 bytes of data transfer costs let's see how much we can get this down to let's take a look at one of Nintendo's later N64 games Ocarina of Time they managed to make a significant Improvement in this algorithm by shortening the table if we take a moment to consider design function we can see that the function graph can be divided into four distinct Parts with some clever manipulation each of these sections can be recreated by simply mirroring or flipping the graph around imagine taking the first quarter of the graph and mirroring it then you flip the whole graph not set and voila we've just recreated the entire sine function with only a quarter of its original graph this clever trick allows us to get away with storing only a fourth of the original values this approach does come with a bit of trade-off it adds quite a few of instructions to the sign and cosine functions in my own implementation of this I ended up with about 180 bytes of code it significantly reduces the amount of memory you need just one kilobyte that's the fifth of what we previously needed with fewer values in the mix the likelihood of hitting a previously cached value increases tremendously for simplicity's sake let's assume that every angle is equally likely after making 200 sign or cosine calls the chance of not having to recache the value is going to be significant easily over 20 remember the N64 caches memory and multiples of 16 bytes and each sign or cosine value is a four byte size number so reading a single value will cause four values to be cached I run some simulations and I found that after 100 sign and 100 cosine values we've typically only cost 180 separate data cache transfers of course in a real game scenario some of these cache values will be invalidated and some angles will be more common than others but for the sake of our discussion this approximation should suffice using the same cost analysis we did earlier in terms of data transfer we're looking at 180 times 32 plus an additional 180. this method is definitely a game changer when it comes to memory savings and it even offers a slight reduction in data transfer costs the trade-off is that the larger code means we are performing significantly more computations to derive these values overall I would consider this approach Improvement but only in scenarios where constrained by memory given that the N64 only had 4 megabytes of memory without the expansion pack I can imagine that many N64 games would prefer a solution like this over Mario 64s however I had the crazy idea to make this lookup table concept even better instead of storing a quarter of a sign table what if we only needed an eighth we can actually get away with just using an eighth of a sign table but there's a catch we also need to add an eighth of a cosine table to it imagine the N64 is trying to read a value in the table right here it will cache this entire line of numbers but only one of the four numbers is actually being read if we have the intervolved in sine and cosine table like I suggest we can guarantee that by the time we need to compute the cosine of a number it will already be in the cache ready and waiting I'm not sure if anyone else has ever used this approach but I was thrilled when I came up with this you might be wondering but why is an eighth of a side table enough I thought we could only simplify the sine function to a quarter before great observation here's the trick pull the missing eighth of the sign table from the cosine table I've run simulations on this new algorithm and the results are great it only requires an average of 92 cash transfers that's just over half as many as the previous method of over 100 chords and this algorithm uses 16 times 92 plus 180 bytes of data transfers it also needs a table of just one kilobyte put it all together and you've got an 80 Improvement in memory usage and a 50 Improvement in data transfers compared to the original Mario 64 algorithm yes it's slightly marker heavy computation wise but the trade-off is more than worth it however we are dealing with the N64 here in this context our CPU is like an underused star player but my optimized source code we can typically finish processing a frame in a speedy 16 milliseconds or less but that's not even half the time we have per frame so our CPU is just launching around idly for around half the game's runtime that is not good we want to harness all our resources to their maximum potential the challenge now is to find a solution that uses even less data than a quarter sign table without turning our beloved N64 into Smoky Paperwhite get ready for some math it's time to dive into the thrilling world of mathematical approximations the most straightforward way to avoid having the game cache so much data is to Simply calculate the sign on the Fly I mean why not it can't possibly be that expensive to get a reasonably accurate result right well the most tempting solution here might seem to be the Taylor series you might think why not just add the terms of the actual definition of design and keep adding terms until it's accurate enough to even come close to accurately calculating the sign we need to take our input to the power of 5. that is an expensive operation and even after all that we're still left with an unsightly error around pi divided by 2. so what does an error around pi divided by 2 really mean for us and what cause actors to not walk in a straight line when you hold forward which is really really bad using using the Taylor series in this way would also mean that all the sine and cosine values come out slightly too big which looks very goofy on the animation side on top of all that the sheer computational cost of the sine function starts to become a real issue at this many computations so where does that leave us what other options do we have one unique method for approximating sine and cosine was actually conceived in the 7th Century by an Indian mathematician and astronomer named bhaskara the first this formula has an intriguing characteristic if you plug in x equals pi divided by 2 this result is zero and for x equals zero you get one this matches the cosine function exact and prevents the kind of peculiar scaling anomalies we are counted with the Taylor series pascara's formula is also pretty cheap to compute once we have the cosine we can easily derive the sign as well we know that the sum of squares of sine and cosine is always equal to what this is a fundamental identity and trigonometry so if we have the cosine we can calculate the sign by finding the square root of 1 minus cos sine squared this is straightforward because the sign of the sign is easily determined you might be thinking that calculating a square root could be a costly operation but that's not the case with the N64 Computing a square root on the N64 is a relatively inexpensive it's not more expensive than the division operation and it's only around 6 times as expensive as a multiplication the maximum error of this approximation is acceptable for our purposes such as animation rendering with this function we avoid the need to load data into the cache this eliminates the associated data transfer costs while there's a computational cost to running these calculations on a CPU it's significantly cheaper than the data lighting process this function runs in only 28 assembly instructions to compute both the sine and the cosine of a number equivalent to just 112 bytes of code and since there's no more data loading required that is the entire data loading costs that we have to pay to run this function a hundred times from that angle this is a huge Improvement unfortunately buscaris formula requires us to use a division as opposed to the polynomials which can get away with only multiplication divisions are about six times as expensive as multiplication so this kind of sucks but it is still a lot better than reading from a table technically this function could be optimized down to 27 instructions but even modern compilers are not that great at optimization so if we went with that I'd have to re-implement this whole function in assembly if you're curious here's what the full bhaskara approximation function looks like implemented in C at the start we perform some bit manipulation to determine whether we need to mirror our input value explaining this is gonna go a bit beyond the scope of the video but have fun discussing in the comments as to why I have to do this next we employ an assembly macro to construct our compiler to construct a value directly in the code our compiler can be a bit inefficient and might store and load values from memory if we don't do this which as we've discussed is quite expensive after that we simply apply the mascara formula lastly we apply the sign from the earlier mirroring or flipping operation we simply get the sign from the square root of 1 minus cosine squared as I've discussed before and lastly we use another cool compiler trick to do a macro that enables us to return two values simultaneously instead of just one this marker was generously provided by easiest pie huge shout out to her I would never have figured out how construct a smarter on my own this macron actually utilizes gcc's imaginary number extension which I think is hilarious because sine and cosine are related to the imaginary numbers in the same way that our output values are related to image name numbers it's just a random coincidence that this happened to be the best way of doing this I want to underscore that returning two values with this method is not the same as returning a struct that holds two values or anything similar this macu actually places the value into Hardware registers this means we avoid the costly process of storing and retrieving anything from the stack moreover it saves a significant number of instructions I absolutely love it before we proceed from rather it's important to note that executing cash code is not without costs there's an overhead associated with having the CPU process all this information the longer the CPU is running the less efficient the memory read pipeline becomes so while this method of computing sine values is considerably more cost effective than the previous one we can't merely look at the size of the function and equate it to the cost of loading 112 bytes given that data transfers have now been reduced to negligible amounts from this point forward but primarily be focusing on the number of instructions our code contains and the number of cycles each instruction in our code takes the next approximation formula I want to talk about was shared to me by easiest pie so a big shout out to her for this as well how does this formula work out remarkably well we can execute this formula using just 20 free instructions I was initially very confused as to where this formula came from but then it hit me this is the only third order odd polynomial that fulfills the conditions F from pi divided by 2 equals 1 and the derivative of F and pi divided by 2 equals zero this was actually just polynomial spline interpolation from high school and it's the best approach so far this ensures that we hit one at 90 degrees and we had zero at zero and moreover it gives the curve that nice pleasant sine Fade Out at pi divided by 2 without ever exceeding one avoiding odd scaling issues as the universe would just slap and have it the coefficients for these polynomials happen to be very close to those of the Taylor series this means that our approximation will be accurate around zero even though we've designed our values to be act around pi divided by 2. the actual code for this approximation is actually quite similar to that of the vascara approximation the main difference lies in the formula we plug into the center of it unfortunately this function is only a third as precise as poss but it removes five instructions from the functions and more importantly it eliminates the need for an expensive division but this level of accuracyly is perfectly adequate for graphical rendering and might not be sufficient for gameplay mechanics there's a risk that the direction you're gonna be holding is not gonna be the same that you're walking and taking that risk is not worth it for a micro optimization like this initially I thought we could enhance this function by minimizing the discrepancy between this and the actual sine function this idea isn't new in fact it's a common approach known as the min max approximation algorithm what's interesting is that when I delved into Unreal Engine 5's source code out of curiosity I discovered notes by developers stating that they used a min max approximation algorithm for similar purposes they used an 11th order polynomial which is significantly more complex than our measly third Auto polynomial but the core idea is the same I decided to take a dive into the world of numerix to find a third other polynomial that minimizes the maximum error while still satisfying the condition of sine of pi divided by 2 equals 1 and by dive into nomorex I mean around Italian get something that is anything more accurate because that's what numeric's really just is after a fair amount of mathematical detective work and the best fit I managed to find was the following on paper or rather on the graph this looks incredibly promising excitingly this new formula trims the maximum error down to just 0.007 which makes this approximation just as accurate as pascaras but it's more efficient requiring fewer instructions and most importantly no division however there's a catch using this function causes the local minimum to occur before pi divided by 2. because of the shifted local minimum the value of sine X can exceed 1. this access could cause some animation bones to scale improperly which would be pretty noticeable the error for each bone would grow exponentially with each child bond which looks hilarious but it's really not what we want and this is why I decided to just steer clear of the min max approximation altogether the results are just not good at these or polynomials given how great Pi's third order polynomial worked it seemed worth exploring a fifth order polynomial in this case we had the condition that the first derivative of the function at 0 equals 1 to our spline interpolation calculating the polynomial is actually quite straightforward in Germany we actually learned as a ninth grade in school and I never expected anything I've learned in school to actually be a useful life skill but here we are we know our function is of the form F from x equals ax to the power of 5 plus c x to the power 3 plus e x we want our function to only meet three conditions which gives us these three equations at pi divided by two our function should be one and x equals pi divided by 2 the derivative should be zero and at x equals zero derivative should be one which gives us this last equation which easily simplifies to E equals one after solving these equations the actual polynomials we get from this process looks like this this polynomial is very close to the true sine function without any of the odd Behavior we saw with the Taylor series or the min max approximation approaches the maximum era for this polynomial is only 0.00004 which is 1 500 of the error that the third order function had so let the error being that significantly smaller how does the computation cost look at first glance it appears to require only 28 instructions which is the same amount as the buscara implementation and it only adds one more multiplication and one more addition over the third order polynomial but dear sketch the coefficients for the polynomials are so small that they introduce an extraordinary amount of inaccuracy the animations generated using this approach end up looking peculiar because the extremely small numbers cause problems we can address this issue by resolving the polynomial and adding a scaling Vector although this does add two extra instructions the result is a 30 instruction function yes it's two more instructions than the mascara approximation and it doesn't require any division so this ends up being faster and more accurate while I'm satisfied with the accuracy of the previous implementation it irks me that vascara's approximation still holds the title for the most accurate function within 28 instructions or less there's no way that a seemingly arbitrary arrangement of mathematical operations should yield the best function sodak Depot considering the error of the third order and the fifth order design functions the fourth order cosine function should be somewhere in between so how did that turn out the implementation requires only 27 instructions saving on one multiplication with the coefficients of this polynomial we can take advantage of the fact that one is always stored in a hardware register in this repository and requires few instruction than both pascara and the fifth order sine function it might also even be more accurate than the sine function because we have to take the polynomial to a lesser degree and there will be less exponentially accumulating floating point in accuracies the maximum error of dysfunction on the graph is only 0.0028 which is absolutely fantastic I found myself oscillating between all three functions the lookout table the fourth order polynomial and the third order polynomial as I managed to shave off couple of instructions on each one I settled on easy as pies math functions photographic thread and the fourth order polynomial for the gameplay thread easiest price math formula is accurate enough for its inaccuracies to be imperceptible in the graphics rendering and the fourth order causing function provides the best mix of accuracy and performance for game mechanics that require precise calculations if we need perfect accuracy or we only need to call the function once at a time the lookup table will be the fastest option however no such situation exists in a game like Mario 64. if we consider data transfers the fourth order polynomial only costs 1 64th of what the original sine and cosine function costs for 100 calls additionally we can eliminate the sine and cosine table all together this way saving a whole kilobyte of memory however it's crucial to keep in mind that we are running a lot more CPU instructions which leads to less efficient memory pipelining and as a result in the actual Factor by which we are faster is going to be much much smaller than 64. so after all this how much time does this actually save in a frame around 200 microseconds most of the time just a fraction of a millisecond this is 0.6 percent of the time that the frame takes it might sound pitiful but in such an optimized game every additional percent is hard earned and I'm actually quite happy with the results and to answer the question in the title what is the best sine function on the N64 it depends between easy as Pi's polynomial my fourth or a cosine and my interval and lot I believe we cover every use case easy as price solution is great for repeated calls that don't require super high accuracy the fourth order cosine suits repeated calls that need high accuracy on the left is ideal for either hyper accurate signs or one-off function holds I hope you enjoyed this and a shout out to my sponsor brilliant.org would you like to learn more about these kinds of topics brilliant.org is a great way to learn math and computer science with tons of graphics and animations to help you along brilliant has thousands of lessons including some of the similar topics to this video and new lessons are added monthly for example you could take a look at the calculus fundamentals course that will explain the basics of sine and cosine and how to get the derivative of a function to try everything brilliant has to offer free for a full 30 days visit brilliant.org Kaza Emmanuel or click the link in description the first 200 of you will get 20 of brilliant's annual premium subscription [Music] oh yeah [Music]
Info
Channel: Kaze Emanuar
Views: 286,411
Rating: undefined out of 5
Keywords: super mario 64, kaze emanuar, n64, nintendo, mod, download, mario, rom, emulator, retro
Id: xFKFoGiGlXQ
Channel Id: undefined
Length: 26min 40sec (1600 seconds)
Published: Sat Jun 24 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.