120x Faster Algorithm By Nested Loops

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I have not actually I have actually not seen this video in my last video I introduced the two major bottom necks that slowly program down okay compute Bond and memory Bond y at the end of the video I mentioned the specific compute bondy task which is General Matrix Matrix multiplication I remember this video I think we watch operation despite its apparent Simplicity can be remarkably challenging to optimize efficiently in today's episode we're going to walk through some genius algorithms to make it over 100 times faster all achieved through pure CPU optimiz techniques leveraging seemd and cash strategies okay some of these techniques might seem counterintuitive at first class but they'll start to make perfect sense as we delve deeper into the underlying reasons we'll Begin by implementing a Vanilla Jam algorithm the straightforward approach involves a triple for Loop structure to compute the dot product for each entry in the matrices this is code that I let co-pilot right now uh what was it it must have been like gosh two years ago I wrote matrix multiplication on stream it just you know writing this thing just takes so much it's just the world's worst thing to write here we simplify the input to be square matrices of size mat sides yep the performance of this basic implementation is notably slow y taking approximately 2 seconds to process matrices of size 1K by 1K of course we're not even using Ming right nowon I mean it seems right because it's n cubed right and so being a square Matrix that' be a th time a th time a thousand oh that's a billion that's a billion I'm pretty good at math okay I'm really good at the maths okay uh just use an npm package to do that Sim D's Nuts matrix multiplication npm package just install it in quick maths we've already accelerated the orm by a factor of 10 oh this is because my CPU boasts an a core 16 threat configuration however despite this Improvement important question right now is is it really fast enough yeah we are going to optimize in when should we be satisfied to understand this question we have to do the math first okay not a big specifications my CPU has a maximum compute bandwidth of 600 G flops the memory bandwidth limitations stand that's a lot of flops I'm not going to lie to you I I don't I don't think I have that many flops does does an average man have this kind of many flops or is this like an unusual amount of g- flops it's at approximately 46 gigabytes per second H however keep in mind that memory BWI isn't solely dictated by your CPU peripheral factors including lower memory frequencies and under utiliz the memory channels can also decrease your practical memory bandwidth to test the memory bandwidth in the production environment return to the stream Benchmark I'm never going I I'm not going to lie to you guys I've never been to this level of optimization you know what I mean I've never had to optimize here like I've never hit a memory bus problem right skip skill issue hard skill issue okay my my my flops are way too floppy for this you never used npm Fair okay maybe maybe I have maybe maybe I have ran into it maybe I've only been subjected to the memory problem but I've never actually done the memory problem according to this Benchmark result my system's practical memory band with sealing hovers around 21 GB per second equivalent to 5.25 G flops it's a lot the next question is we all know that J relating memory to Gig flops is that normal that seems confusing cuz it's a floating Point operation but what does that mean in is that just saying that it can only you can only send through the results of so many flops it's not really a comparison I was am I am I the only one bandwidth limits G floppies I can't believe I just read that phrase out loud but I guess we're GNA be calling them G floppies from here on out uh yes the G does stand for Giga chat flops will eventually become compute bonded from the last video okay but when the answer theoretically isn't overly complex since we can find out the total number of memory access to be S and squared and the total number of floating Point operations required is to enced gem should shift from B memory bound to compute bound when n exceeds 380 but here's the problem at a resolution of 1K by 1K both the compute band ws and memory band WS fail to reach their full potential this is because we have a relatively low cash heit rate within our naiv implementation not that each entry in The Matrix is a and b needs to be accessed multiple times but if they don't reside in the cache our memory access comp can get close to 3M Cub rather than the expected 3 squ this complication can be problematic when n is large as the cach is less likely to keep the data when the same entry is accessed the next time however if you have a relatively small Matrix there's even a pretty high chance the entire Matrix can be accommodated within a large cach so increasing the cash heit rate huh where should we start introducing the first trick transposing one of the matrices from row major to column major okay I've heard about this this is that just I assume you don't act do you do you transpose The Matrix or do you just walk it differently cuz is actually coping over the Matrix to a new piece of memory really the way you do this or is it actually just just walking you're saying yes transpose yes so you actually you you do it you do the transposition on the fly or do you actually copy it yes to the copy okay the small adjustment makes our algorthm run three times faster but you might be wondering are we doing more computation when transposing the Matrix well this bring us through the two principles of cach optimization maximizing spatial locality and temporal locality in our original Jam implementation both of these aspects are suboptimal yeah you see in the vanilla gem both matrices A and B are stored in row major order however when performing the matrix multiplication Matrix is access the longest columns this means that the Alm needs to skip an entire row to access the next value yeah but wait a second it's not like truly random memory access so a mod Cash System can handle this pattern right and You' be correct actually your CPU likely employs prefacing as long as you have a stable memory access pattern like this the problem is that you don't catch a single float when you access the value but catch an entire Cash Line the way you access Matrix B results in a substantial amount of cash being brought in but not fully utilized therefore the low spatial locality in this this case ISS to inefficient cach line utilization transposing The Matrix spe despite the more computation and even memory access overhead in curse significantly improve the cach line utilization and therefore increase the overall at the end of the day so we can improve the spatial locality by transposing The Matrix what about temporal locality to improve that's a great that's a great visualization for showing you because you just will always want to walk it linearly right like that's why they that's why I mean one of the most common ways to increase performance in the most simplest way is that if you have a set of like 10 items 15 items you don't use a set even if you're removing an adding instead you just put them into an array and yeah when you remove you have to move everything back but it just makes it so that you get this nice tight array where everything's located that you can walk really you know swiftly and even that adjusting is still better plus you have whatever the complication is of the hashing Factor right so there's a I mean these type of improvements are are wild right they're just not something that I think the average programmer thinks about but they do exist and there are really good practical implications for them which is really nice power locality we must first understand what cause it to be suboptimal in the context of our matrix multiplication every entry in matrices A and B needs to be accessed multiple times however when dealing with large matrices holding an entire row in the cach can be challenging so how can we make this memory access more efficient Here Comes The Magic of linear algebra each entry in the result Matrix CI equals to the sum of a k times bkj for K from Z to n classic but if we divide the Matrix I have never heard someone say that in such a swift way I mean he just tossed out that quick math so quick and mathy like that was just like the most natural it was that was it was it was beautiful out blocks the entry is still equal to the of multiplication within each block and when you really think about it you'll find we are actually doing matrix multiplication in blocks each block multiplication is essentially another smaller matrix multiplication why is this useful well imagine if we choose a block size that's small enough to Fe entirely inside the cache ideally by doing so we only need to access each entry in The Matrix L times instead of n times take a look at this code snippet while it appear to have more nasty Loops the total number of floating Point operations remain unchanged what we've achieved is a significant improve in the memory access pattern resulting in a better temporal locality by breaking down the matrix multiplication into smaller and cach friendly blocks we can maximize the ReUse of data storing in Cache thereby accelerating our gem algorithm that's enough for the concept let's turn our attention to the Practical side performance sadly using blocked Jam can be a little bit more challenging than the transposed gem since you need to decide what block size to use you also have five for Loops it's not just that you have to decide the you got all five for Loops really easy to screw up there remember the actual number of memory accesses is the product of total size of two matrices q and squ and the number of memory access for each entry L choosing a small block size will resting a large block number L therefore the total number of memory access to and square L can approach the the worst case memory access time to enclude okay on the other hand selecting a block size that's too large to be inside a cache May negate the advantage of using blocks all together in practice different machines may prefer different block sizes okay and the only way to find the best block size is through experimentation another important difference between the block jam and transposed gam is well we don't need to transpose anymore our goal is to feed the entire block inside a cache moreover people often tend to choose block sizes that are in multiple of cash line so transposing the Matrix won't improve the performance right let's transpose it to see what happened anyway I have never had to optimize something to this level when I look at this this just looks like I mean none of it is crazy surprising right like we all know that you want to exist in in in as small you know you want to exist in the most memory efficient way but to actually like I've never had anything at my job where this is something I've had had to do right I've never had a locality problem and so it's it's very interesting I mean this this seems like this is like the game engine work in my head this is what I assume this is like this is ECS stuff this is where like the real optimization happens the things that you don't actually really think about a lot uh it reminds me also of like it probably this is probably in some this type of optimization this like making things closer is also like the SME optimization in V8 where they they do small integers so if you have an array of integers that are small you literally have an array of integers in uh in V8 which is just going to allow for pretty fast access whereas you know in the oldie days it would have to go to each integer and then hop to each uh Heap offset where they stored it in the Heap to be cleaned up now it doesn't have to do that you know so there's like a little bit of little something there yeah I assume it's also with any of the uh any of the amazings um uh ml stuff ml is just one gigantic linear equation continuously running MLPs are just they're literally just doing sweet sweet MLP stuff performance increased again that's a little bit unexpected as I mentioned earlier transposing metrix speed doesn't fundamentally change the cach line utilization instead he improved the performance by affecting another parallel mechanism seemd these nuts you might have notice that I added an open mpcd macro just before the innermost Loop the seemingly small addition enable the compiler to leverage CD instructions for do product calculations within the blocks oh now if you've been following my previous videos you are probably familiar with Sim these capabilities you can perform multiple floating Point operations in a single CPU cycle but there's a catch it depends on efficiently loading data into SD registers yeah for the block jam without the inmost loop Direction only aligns with Matrix a after transposing Matrix B we can load both matrices into CD registers easily now let's explore another content intuitive but highly effective optimization technique copying blocks into local buffers in the code you see we've made a significant change by hardcoding the block size into the orgm this provides the compiler with more information for compile time optimization okay however the real magic lies in our ability to create local copies of data for each block copying data is often considered an expensive operation but in this case it almost doubles the performance again the reason behind this approach is once again related to cach optimization notice I use the aligns keyword when navigating the local buffer this ensures that all local buffers start at addresses that are multiple 64 which is the lens of The Cash Line on my machine yeah remember that every time you access an address the entire cach line containing that address get loaded into the cache by making sure that our data structure is align with cach lines we again increase The Cash Line utilization the next reason we are using local buffers here inside this OPM thre private Mac I believe that's why rust has in their Mac or in their strs like if you do two8 bit uh uh members in a rust struct it's still eight bytes or it's system length bytes right uh times two and so it won't be or not eight bytes it'll be 16 bytes even though you only are using two bytes technically it's because it's always doing these these uh larger offset because it's just fast to read it's fast to put those things there you might recall a discussion about for sharing the video when two threats try to access the data on the same cach line performance can plummet even without locks or other software limitations the hardware often interv to synchronize cach line between different physical cores causing performance degradation to solve this issue we need to guarantee that not only does each threat have its own private data but also these data blocks reside on different cach lines I know it sounds a little bit stupid when they said just keep them on different cach line but the solution is really that simple just use the OPM threat private macro it copies a private buffer for each threat response and also handles the force sharing concern perfectly looks good but there's actually one more problem we can solve Matrix transposition as discussed previously the performance skin from transposing Matrix B is mainly attributed to achieving alignment with the innermost CD Loop in both local a and local B however what if we could achieve this alignment without Matrix transposition yeah it turns out we can okay if the Matrix B is not transposed the coda look like this but hold on a second there are still two local buffers aligned the same way local B and local C actually instead of transposing the Matrix to let the buffer align with the loop we can swap the loop to let them align with the buffer this adjustment doesn't affect the computational logic of a rthm it merely transform the aess pattern of the buffer and that is that what I isn't that what I said earlier we can instead of copying we can just literally walk it in the correct order pre-at I pre I pre-at this one clearly pre-at this one I knew it I knew it I knew the pre-at it was going to happen it just felt right oh man I mean that makes perfect sense like if your goal is to access in a nice linear way why copy all why transpose when you could just access that way who's the vtuber now this guy is now we can get rid of the Annoying Matrix transposition step there are also some Minor Details you can find toing your gem implementation for example we can clear localy less fre by moving the clearing step one Loop outward all these efforts led to a substantial performance Improvement reducing the time needed to process a 1K Square matrices from 1900 milliseconds to Just Around 16 milliseconds that's an impressive Improvement I have to say I wonder how does this thing scale for just like 4x4s right so like if you're doing game programming does this all scale at that point or is n so small that it actually makes no real difference or even hurts it at a smaller level like you know cuz sometimes optimizations don't always always they don't always work uniformly they work at certain sizes scales linearly I think I mean does it no real difference I mean because my my real question is like if it it may not matter at all on small amounts right I am wrong I it turns out I am actually wrong no I mean I'm just curious 4x4 fit in cash okay that is what she said she she did say that a 4x4 always fits in cash especially considering there's no stuff like fancy GPU acceleration the bad news is that we probably don't want to optimize gam or any other linear algebra operations yourself because there's a much better option called basic linear algebra sub programs the Intel mkl implementation of Bas can get you only 2 milliseconds for a 1K resolution Square Matrix Jam Plus it also support Matrix multiplications between nonsquare matrices unlike the crappy demo we showed today but you get a point that demo is great is first off that demo was fantastic okay great job on that demo second two milliseconds and Vin by the way uh what clearly clearly looks like uh uh lazy Vim by the way lunar Vim is this lunar Vim okay maybe simar approaches we use today somewhere in their proprietary code base you said you want to know what makes it even a times faster than the best we could do well I did learn something doing that high throughput optimization course I took lot of Master including separating block Hots spot into another module and using assembly intrinsics to optimize it like crazy and my professor also mentioned that even loading data into CD registers in different orders can affect the performance but I never use them in my own project so I say I doubt if I can tell you anything about it anyway if you enjoyed this video remember to subscribe to the channel as always I hope you cash well and see you in the next one I hope you cash well what a great give that thumbs up give that subscribe depth buff buffer that was a great video that was really really really well done I think the thing that makes it so good is that everything he stated was extremely difficult right everything how many people are you subscribed to too many but everything he talked about was like it's a really difficult topic and he did really good that was surprising I'm actually surprised at how much better he did that and also I think the second thing is that you see this all the time which is people make like I think it's very easy for anybody to see the matrix multiplication algorithm and just be like well I mean like can you really make it that much faster like even if spend months optimizing it can we really make it that much faster and that thing went from like 2 seconds or yeah 2 seconds to 2 milliseconds like that Library made it a thousand times faster and so I think sometimes you can make things quite a bit faster by playing around and having you know not everything can be optimized that way obviously if you're working with the web server you really got to you know it's not going to be the same situation you're not talking about a specific algorithm in which you're like looking at the assembly instructions instead it's like how are you managing memory uh are you creating a lot of garbage how often are you in garbage collection how often are you returning back to the runtime what's going on like how much time can you spend just within just doing your code and doing the things you need to do and then getting out you know there's definitely a whole slew of things that can be very important yeah the animations were very very good how often injs no this is actually a really good one you may not realize how important this one is there are libraries that are written in native code that can be used often in JavaScript and if you know them and you're doing a large chunk of your work in these native libraries it can be really really effective to replace them and not use JavaScript at all and that's most specifically true in node in bun it's less it's it seems to be less of an issue you don't get nearly the same wins because bun just has a really good optimization from JavaScript to the runtime and so a lot of those wins go down so spending more time in Native doesn't it it's not like a 10x increase right you're getting like a 2 to 3x increase uh bun just have a good jit compiler it's literally the I mean it's it's it's JSC versus V8 I think V8 if I'm not mistaken is better than bun so to say that bun has something that's much better I don't know if that's true right I can't tell you if that's true or false because I don't know enough about JSC versus V8 uh anyways the name is the primagen I'm not going to do the cachen okay we're not doing it
Info
Channel: ThePrimeTime
Views: 338,439
Rating: undefined out of 5
Keywords: programming, computer, software, software engineer, software engineering, program, development, developing, developer, developers, web design, web developer, web development, programmer humor, humor, memes, software memes, engineer, engineering, Regex, regexs, regexes, netflix, vscode, vscode engineer, vscode plugins, Lenovo, customer service
Id: 24nhC1TMEV4
Channel Id: undefined
Length: 21min 20sec (1280 seconds)
Published: Fri Oct 27 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.