"Clean" Code, Horrible Performance

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
some of the most often repeated programming advice especially to beginner programmers is that they should be writing clean code and that name is accompanied by a long list of rules that tell you what you should do in order for your code to be clean now there is no Benchmark for this you cannot actually run something and have it tell you how clean your code is but there are fairly descriptive rules that tell you what you should be doing now a large portion of these rules don't actually affect the runtime of the code that you write especially not if it's compiled and so there is really no way to objectively assess whether these clean code ideas are good or bad and we don't necessarily have to because they're fairly arbitrary at that point however there are several aspects of clean code in fact some of the most important ones that are stated are things we could objectively measure because they do affect the runtime of the program that you are writing those are if you go look at a list and pull out the ones that would actually carry through to the code as it is compiled and run on the end user's machine there are things like polymorphism is good so that's like a plus ifs and switches are bad so those are two things you can easily act on always use polymorphism if you can get rid of if and switch statements number two is no internals for things you should never know the internals of something you're working on so for example if you write a function that does something it should always call like a polymorphic function in an object or something like that and it should never like look into the object to see what's actually in there or what kind of object it is it should never know what like the internals of that thing are ever number three is functions should be small number four functions should do one thing now one thing is never really defined in any particular way and I don't really think that's all that actionable but it is kind of a general rule of thumb you could see affecting how you structure your code it's sort of taken together with this that you're trying to reduce a function maybe down to the smallest possible thing it could be let's say finally number five is don't repeat yourself or d-ry in general if you take all of these things they are actually fairly specific about how any particular piece of code should be created in order for it to be quote unquote clean and so what I would like to ask you is well if we create a piece of code that follows these rules and is clean how does it perform in order to construct what I would consider the most favorable case for a clean code implementation of something I used their own example code because I figured that way I'm not making something up I'm just using what they actually give as an example of how you should do something so if you look at clean code examples you will see this example repeated very often it is a base class for a shape and then you have a few shapes that are derived from That Base Class circle triangle rectangle square we could add some more if we want to but we have a virtual function that does something like compute the area and the idea here is we are preferring polymorphism and for functions to just do one thing and to be small and all that good stuff you can see that we just have this little tiny area function and depending on which of the derived classes we have we will do a different area computation for the square the rectangle the triangle or the circle now if we imagine using this if we wanted to find the area of a series of shapes that we pass in we would expect to see something like this now you'll notice I haven't used an iterator here because there was nothing in clean code that suggested you had to use iterators so I figured I would give them the benefit of the doubt and assume they were just going to use a basic for Loop that wouldn't have anything that would confuse the compiler so here we are we create an accumulator we Loop over all the shapes in the array that we're given we call the area function on them and we return now you'll notice these have to be arrays of pointers because we have no idea how big each of these shapes are so unless we were going to parse out the lengths of the shape and use some kind of a skip procedure to go through them we have to just look at pointers to find out where the shapes actually are furthermore you'll see there's a loop carrier dependency here this accumulator is going to get added to each time so just to make sure that we're not seeing some artifact of the loop I also included in hand unrolled for times version just to make sure that the CPU has enough to look at to do ads we break the dependency chain here by having four separate accumulators but otherwise it's the same code if I run that code right there what I will get is the total number of cycles per shape that are required to do that operation so I just feeded a big array of shapes and say how fast can you do these per shape so we just how long would it take to do one shape I time it in two different ways one is I time with a repeat count of one and that shows us what happens if you were in a fairly cold cache scenario so like imagine something was like in the L3 might be a good example of what that time is going to be represented of and then I also do a repeat count version where we try to Prime the cache a little bit more so this is like if it was going to maybe fit into an L2 or something or if it was going to maybe have a branch predictor warm up and do something we would see an improvement here what we can see from the results is there's not much of a difference here it's around 35 Cycles to do this thing maybe it gets down more towards 34 sometimes if you're really lucky so the clean version of our code took around 35 Cycles to actually do each shape that's what we're doing if we use polymorphism no internal small one thing don't repeat yourself we're not violating any rules here what would happen if we violated just the first rule and said instead of using polymorphism here we were actually just going to use if switch statements right we're not going to use a virtual function anymore what happens to the performance of our routine if we change that so here I've written the exact same code but instead of writing it in the polymorphism way using C plus classes and virtual functions I've written it using a standard enum and then just a shape type that effectively just unions them together so even if something didn't have a height for example because it's only one parameter well it has one now because we've done this the old-fashioned way then we say instead of getting the area from a virtual function you're going to get it from a switch statement exactly the thing that a clean code lecture would tell you never ever do and what we would do here is we'd say switch on the type and depending on which of the types we have we will go ahead and compute the exact same thing that we were Computing in the virtual function for the derived class of that type now we have effectively the exact same code here's the switch statement version of the exact same Loop that we saw before for the V table version and what you can see is we Loop over it and we just call the get area switch function instead of calling a function off of some pointer now this offers us an immediate benefit which is that the shapes of course can just be in the array there is no indirection of a pointer because they're all the same size so we get that benefit and then of course we also get the benefit that the compiler can now see exactly what we're doing because it doesn't have to wonder about what kind of types might be derived from a base class it can see everything that's in this code so with those benefits what can the compiler do for us if I run these two together now and look at the difference in performance between the one that has to use the V table because it's using C plus classes and one that just can use a basic switch statement we can look at the difference between the two and I'll skip the video here so you don't have to wait for the testing to happen when we look at the results what you can see is really rather remarkable just that one change writing it sort of the old-fashioned way rather than the clean code way you immediately get what is effectively a 1.5 x performance increase for not really doing anything other than removing all of that extraneous stuff that's required to have C plus plus polymorphism so by violating this rule of clean code which is one of the central tenants we were able to drop down to 24 cycles per shape which means that the clean code version that follows rule number one is 1.5 times slower roughly this means that if you were to put that in Hardware terms it would be like taking a iPhone 14 pro Max and reducing it down to an iPhone 11 Pro Max it's about three or four years of Hardware Evolution erased because somebody told you to use polymorphism instead of switch statements but we're only just getting started What If instead of following these four rules and just breaking this one we also broke number two meaning let's see about knowing what if our functions could use knowledge of what they were actually operating on to make themselves more efficient in aggregate if you look back at the get area switch statement one of the things that you can see is really these are all Computing the exact same function just with a different coefficient they're all going to do something like width times height or width times width and then they're going to multiply by half in the case of a triangle or Pi in the case of a circle something like this now one of the reasons that I think switch statements are great is because they actually show you this it's really easy to pull out common patterns when you look at them put together by function rather than by class by contrast if you were to look back at the class version you would probably never notice this pattern because each of these would probably be put in a different file and you would never really see that there was quite a bit of similarity between them so architecturally I generally disagree with that structure anyway but that's kind of beside the point the point is we can simplify this quite a bit and remember this is not an example that I picked this is the example that clean code Advocates use so I'm not picking an example where you happen to be able to pull out a pattern I'm using their example and I can pull out a pattern because in general this is what happens in code so if we actually look at what we can do here we can just introduce a simple table that says what the coefficient is that we actually need to use then we'll make sure that we always stuff the height with the same thing as the width so that in cases where we were just using the width twice it will still work that allows us to collapse our switch statement into a single simple equation which just looks at a table for the one thing that differs in the operations now the two functions that will use this are exactly the same we don't have to modify them at all the only difference is now we'll use this to get our area instead of the switch statement let's see what happens if we run that version now and again I'll skip the video forward so you don't have to wait for the timings to complete what you can see here is that by taking advantage of what we know about the actual types that we have and by turning the function that we're doing as more important than the types themselves and keeping them opaque we get a massive speed increase now we've gone for something that's merely 1.5 times faster to something that's fully 10 times faster or more on the exact same problem using nothing other than a very simple table driven calculation because it's very easy to observe what the actual patterns are in this code when we think of them by function rather than by type and allow functions to know the internals of what they're working on so by actually fusing our data model with our actual code as it usually was supposed to be until clean code happened we were able to get all the way down to 3.5 cycles per shape that's a 10x speed improvement over the clean code version that follows the first two rules now the important thing to remember there is 10x is so large of a performance increase it's not even possible to really put it in iPhone terms because iPhone benchmarks don't go back far enough if I went all the way back to the iPhone 6 which is what we have benchmarks for if you look at common Benchmark sites right now it's only about three times slower so 10x we can't even use phones anymore if we were to look at desktops a 10x speed Improvement is like going from the average CPU Mark today all the way back to the average CPU mark from 2010. so these two rules combined in the clean code idea wipe out 12 years of Hardware evolution by themselves just those two ideas are enough to erase 12 years of Hardware improvements that Hardware manufacturers try desperately to bring you okay but here's the thing we're only doing the simplest possible operation right I mean we are doing these two things by default because all we're doing is Computing the area and it'd be very difficult to figure out how you could make a large function out of computing the area because it's just like a multiplication or two what if we add one more aspect to our problem just one so that we can follow these two rules of clean code and see how much worse the clean code gets over code that does not try to be clean in that way so here you can see the exact same hierarchy that we had before but this time we've added one more virtual function which gets us a corner count in other words rectangle Glass Four Corners the triangle has three a circle has none I'm then going to compute instead of the total area I'm going to compute the total area but weighted by one over one plus the number of Corners there's no reason for this I just needed something to compute that was as close as possible to the original example that they gave but which included a second property that wasn't found in the original now when we're going to do this in the clean code way of course we call Corner count an area as separate virtual functions and then we compute what we want to compute now we might even pull this out into yet another function but what we definitely know is we cannot fuse these two into one function because that would not be following the rules of clean code which say that everything has to kind of come from the simplest possible functions and then they compose themselves up so all we could do at best is this and at worst we would have to pull this into yet another function but I'm giving it to benefit out by leaving it this way here's what it would look like if we use the clean code version both the normal way of writing it and the one that breaks the loop carrier dependency here just in case if we look at the union version which you can see is not very much would change we just add another switch statement that would tell us how many corners there are and then we would compute the exact same thing so again the code looks almost identical between the class case versus the switch case the only difference being how they are getting those two values whether it's a virtual function call through a v table or through a switch statement then we've got the table driven case and the table driven case as you can see is awesome because we confuse our knowledge of our data with the knowledge of what we are doing the table itself is the only thing that actually needs to change because we can bake in those waiting terms directly into the table so we don't actually have to get any information about our shape other than just looking up the one value in the table which is exactly why table driving things is so good and why knowing what you're doing and what the types are that you're doing it too is a crucial element of getting reasonable performance out of compute as you can see the two functions for these are exactly the same as they were before the only differences we're calling a different function here but it's the exact same computation in fact we could have just changed which table we were using and done the exact same thing now if we run all of these together we will be able to see what the differences are now that we've added a second property and again I'll skip ahead so that you don't have to wait looking at the results now the case is even worse for quote-unquote clean code now we see that our switch statement version is almost twice as fast and our lookup table version is almost 15 times faster that's just insane and it just gets worse because the more you use the clean code methodology the less a compiler is able to see what you're doing because everything is Behind These virtual function calls and obfuscation of that kind so as a result it makes it harder and harder for the compound to do its job and of course you also can't really do your job very well because you can't do things like pull things out into tables the code just isn't structured that way as you can see making our problem more complicated doesn't benefit the clean code solution it actually hurts it we've gone from a 10x speed difference to a 15 x speed difference that's like pushing the hardware all the way back to 2008 so instead of erasing 12 years we're racing 14 years just by adding one new parameter to our definition of the problem that's it now that's horrible in and of itself but you'll notice I haven't actually optimized anything here I don't even use modern CPU features like SSE I don't use anything that might speed this up more so this is just handing the compiler the most basic possible code we can and just not following clean code guidelines that alone gets us 15x but of course if we wanted to write the optimized version even just a lightly optimized version it would be even worse than that here's what it looks like if I run all of the routines together including some actual AVX routines which use the AVX instruction set to get more performance out of the loop again I'll skip ahead so that you don't have to wait for the timings what you can see is that in the best case scenario you're looking at a 21x speed difference between an AVX optimized routine and the C plus plus v table version and in the worst case you're looking at more like 24 x so again these are if someone were to take this code and actually try to optimize it a little bit I didn't try particularly hard I just kind of typed in something that would do an AVX Loop both with a select style picking for things like the coefficient and for table driving the coefficient just to see what would happen and that alone ends up with an over 20 x speed Improvement in either case and again there's no way you're doing any of that using clean code principles because just the nature of how you're going to organize your data for things like AVX means you're never doing any of those things that were listed in that set so as you can see if you actually want to look at the optimal code or at least code that's slightly optimized it's even worse than 50 next 15x is just where you start before you've actually done any optimization or actually trying to Target a platform's CPU to the problem at hand all of which always requires breaking these rules so that leaves us with number five which is don't repeat yourself honestly don't really have a problem with that rule as you can see we didn't really repeat ourselves much if this rule means that don't build two different tables that both encode the same data well then I would disagree with it because we may have to do that for things like more Optimal Performance but if in general it just means don't write the exact same code twice prefer to just compose that code with other code as necessary I agree with that one so out of the clean code things you can actually do I would say you have one you might want to think about and the other four you really definitely shouldn't why because as you may have noticed software is extremely slow these days it performs very very poorly compared to how fast the hardware can actually do the things we need our software to do and if you want to ask the question why is the software so slow well there are several sources of that slowness and it depends on what kind of program you're doing and where but for a certain segment of the Computing industry the answer of why is software so slow it's because of quote unquote clean code these pieces of advice are all absolutely horrible for performance you should never actually do them they were designed because someone thought they would produce more maintainable code bases but even if it's true that they do produce more maintainable code bases you have to ask at what cost it simply can't be the case that we're willing to give up a decade or more of Hardware performance just to make programmers lives a little bit easier our job is to write programs that actually run well on the hardware that were given and if this is how bad these rules cause the software to perform they simply aren't acceptable we can still try to come up with rules of thumb that help you keep your code well organized easy to maintain and easy to read those aren't bad goals but these rules aren't it and they need to stop being said unless they are accompanied by a big old asterisk that says and your code will get 15 times slower or more when you do them
Info
Channel: Molly Rocket
Views: 624,523
Rating: undefined out of 5
Keywords: Handmade Hero
Id: tD5NrevFtbU
Channel Id: undefined
Length: 22min 40sec (1360 seconds)
Published: Tue Feb 28 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.