C++ vs Rust: which is faster?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] every year in December a bunch of programmers come together online and try to solve the same series of 25 coding exercises in an event called adjunct of code some try to be the first to solve problem gaining points and shining in the leaderboard and some are a little more chill about it and use it to learn a new programming language when I do it I use it to teach trust last December I wrote 16 articles going from the basics of error handling to parsing with the dumb crate doing memory profiling generators and even creating graphical interfaces compiling the result to webassembly and embedding it on a web page but the problems were getting harder and harder and I wasn't getting any smarter I was spending more and more time debugging silly off by one errors or having to learn more maths than I signed up for and since the focus of the series was rust and not showing off what a clever boy I am I switched gears for day 17 live on stream I ported someone else's solution from python to rust the result was about what you'd expect at naive one one rust translation ended up being 44 times faster than the original python adding an extra data structure and switching to a hash function that's faster for integers ended up making it over 2 000 times faster than the original reducing runtime from 6 seconds to under 3 milliseconds you could make some of the same changes to the python version or even switch to a different python implementation altogether and that might help narrow the performance Gap but it was never a fair fight to begin with see python the main python implementation is just an interpreter it reads the code translates it to bytecode and then at runtime it just steps through the bytecode into printing it as it goes along Pi Pi however has a jit a just-in-time compiler which is able to translate bytecode to machine code and the result can be run directly on the CPU with tons of air quotes for the patents in my comment section if you look at the computer language Benchmark scheme where everyone's trying to prove their language is the fastest for the end body program a math heavy problem the fastest solution is currently rust and then a JavaScript solution running on node.js is only eight times slower part of the reason is that V8 the the JavaScript engine that powers node.js has a very good jit even though variables in JavaScript are not statically typed now it's a number now it's a string and now it's an object the jit is able to monitor what arguments functions are called with at runtime and then make guesses it goes I'm going to generate code for this add function as if the arguments were always integers if that's the case all the time if the guess was right then things get really fast but if the function gets called with an argument that isn't an integer first you need to do the right thing you can't add two strings but you can concatenate them and that's the same operator in JavaScript and second if that happens a lot then your guess was wrong and you might need to de-optimize the function and come up with another machine code version of it that branches on the type of the arguments this is believe it or not not what today's video is about but if I got all curious about it you can go read the article speculation in javascript's core on the webkit Block which goes into a lot more detail I felt pretty stupid while reading it and I wish you the exact same that just means you found something you can learn about so python V rust was never a fair fight because the languages are just too different C plus plus however is the language that's strongly typed despite its sea Heritage compiled ahead of time and use then performance intensive scenarios they used to make game engines and browser engines and databases It's gotta go fast so on day 18 when I ported a solution from C plus to brust I didn't know what to expect it could be faster it could be slower it was slower much slower in fact I was a bit sad until I realized it wasn't slower at doing the heavy computation required by the problem it was slower because it was doing a bunch of unpreferred rights to a file at the very beginning of the program in the original solution just to make sure that the puzzle input was parsed correctly the author had the program write everything that was parsed back to a file so that they could look at it and go yeah that looks okay and they used T apis like f open and F printf which are doing buffering at first it's writing the text to some buffering memory and then when the buffer is full IT issues a right Cisco which is how you politely ask the kernel to perform some IO and that's a lot more expensive than writing to memory but in Rust I was using just a file so each individual print would issue one right Cisco and that ended up taking something like 20 milliseconds whereas the C plus version was done in nine milliseconds once I wrapped the file in a buff writer I discovered that the part one solution there's always two parts was one and a half times faster in Rust than in C plus plus we went from 8.9 milliseconds to 6.1 milliseconds for part 2 of day 18 the story was much the same the naive one one rust translation was already 50 faster than the original and then by switching to a different data structure changing the hashing function and targeting modern CPU instruction sets it became more than three and a half times faster for it to be really fair you would need a C plus plus expert which I'm not at all to make similar changes to the C plus version of the code and the Gap would probably narrow significantly but what's interesting to me is that the rust version used less syntax and it was very easy to make those changes and bring it down from 8.5 milliseconds to 3.3 milliseconds I am biased of course since I know more rust than C plus plus but in C plus plus I had a hard time even embedding the input into the executable for benchmarking I discovered that C plus plus had raw string literal syntax but that it had parentheses inside of the double quotes and then msbc Microsoft C plus compiler choked on the literal because it was too large so I ended up using xxd to generate a header file and then had to figure out the proper way to include that from C plus sources running into a bunch of template instantiation errors which turned out to be a missing cast or missing include I've never missed brass Diagnostics as much as I did then the operator overload syntax in C plus plus seems hard to remember and I say that because the original had a link to a stack of affluencer about how to do operator overloading and rust is just implementing a trait like any other except we didn't even need to do that because we were able to derive it with the simple attribute finally iterators in C plus plus are very different from rust data radius the pattern they use to iterate over a set removing some elements but keeping others got me worried but it turns out that that's just what the retained method does in Rust another part of the C plus code calls begin on a set which gives you an iterator to the beginning of the set as you cannot really tell from the name and then it immediately references it in Rust I had to call the first function which gives you a reference to the first element but in an option because what if the set is empty so in Rust I just used unwrap so that if the set was empty it would print an error message and safely exit the app but in C plus if you'd reference the iterator return by begin on an empty set that's undefined Behavior you know nasal demons the thing that can steal your dog and optimize your whole program away UB is bad don't don't do it anyway the next day I thought you know what maybe Russ being faster than C plus plus was a fluke I'd like to try on a different solution just to make sure and sure enough on day 19 after porting a solution from that same person in C plus plus the rust it was slower 20 slower C plus plus ran in 4.5 seconds and rust in 5.4 seconds and this time around it wasn't spending any time doing ciscals it was just doing a lot of recursion a function calling is itself over and over and I had no idea where to look we just call grind but that just took forever we used nut perf which is like perf except not it's a sampling profiler and it showed that yeah the function calls itself a bunch of times recursively over a billion times actually people pointed out that this solution was doing a lot more work than needed and I could just switch to a better solution but I decided to focus on figuring out why essentially the same code was slower in Rust than it was in C plus and it took me a hot minute we disassembled both Solutions with amsterdamck looked at them side by side and it was pretty hard to make sense of it beyond the first few lines so we use gidra which can decompile to C and start renaming registers after what they were used for this all happened live on stream and it lasted over three hours if you want to see the play-by-play and finally I noticed something funny and we're not going to look at the actual code for day 19 we're going to look at simple reproduction cases because I want this video to be accessible to people who don't speak x866 assembly fluently so we need to build this up a little let's start simple say we are making a function that adds two 64-bit integers this is what the Rask code for it could look like when compiling this code rust c will generate llvm IR intermediate representation llvm will generate machine code and we can translate that back to assembly Which is closest to what the CPU actually does but still human readable if we try hard enough we'll be using compiler Explorer a free online service maintained by Matt godbold whom you can support financially on GitHub sponsors patreon and PayPal let's first look at the disassembly of an unoptimized version of AD numbers with overflow checks disabled there's three places we can put stuff the Heap which can grow very large but it's expensive because we go through an allocator the stack which is limited in size but very fast because we're just moving the stack pointer down to grow and up to shrink it depending on platform and registers which are very fast Global variables directly in the CPU there are exceptions but as a rule let's assume that if we want our code to go fast we should try and use registers as much as possible when we call a function we are jumping to the start of its code but before we do we must put its argument somewhere if we can put them in registers we should do that and that's what's happening here in main we move one one to the EDI register and 2222 to the ESI register then add numbers gets called call is almost like jump except it first pushes the return address on the stack so we know where to continue execution after the function has returned and in ad numbers we can see that the value in RDI which is another name for EDI is move to racks and then the value in RSI is added to Rex and then we return a calling convention is where and how we pass arguments to a function what can we say about the calling convention for this function well it follows the standard x8664 sysv API which says if the arguments look like numbers past the first one in RDI the second one in RSI and then RDX rcx R8 R9 and that's it so argument X is in RDI argument Y is in RSI and the return value is in Rex so far so good now let's look at the assembly for the optimized build of the same code the code from Main is the same but the code for ad numbers has changed instead of using a move to copy the first argument to racks and then an ad to add the second it does both at the same time using Liam what are those square brackets well if you have move racks RDI it copies the value in the REI register to the racks register but if you have move racks brackets RDI it treats REI as an address and copies whatever is at that address to racks this is useful if REI is a pointer to the stack for example but in those square brackets we can have more complicated Expressions if we have move racks brackets RDI plus RSI it would add REI and RSI treat this as an address and then read whatever's there and then copy it to racks this would work if Rea was a pointer and RSI was an offset this happens when a reading struct fields for example as for Leah it's a bit like move except it doesn't read whatever is at that address it just copies the address itself hence load effective address so what's happening here is it's using a mechanism meant for computing addresses to do basic arithmetic which is a very common pattern moving on what happens if we add more parameters then it uses all six registers we can use for integer arguments RDI RSI RDX rcx R8 R9 the unoptimized Version Moves the first argument to racks and then as all the other arguments accumulating into racks and can you guess what the optimized version will do that's right much the same except it adds the first two arguments into racks with Leah could we do better than this probably but let's move on what happens if we add eight numbers together then the last two arguments spill onto the stack we can tell because it uses the same square bracket syntax to read from rspe the stack pointer plus some offset let's keep going what if instead we passed two structs each having two fields even without optimizations we can see from ad numbers this assembly that all four numbers are passed through registers main looks inefficient though it's allocating space on the stack moving all our constants there and then moving them back into registers to call at numbers turning on optimizations gets rid of that nonsense moving the constants directly into the relevant registers and now what happens if we have three Fields per struct adding six numbers together in total stuff gets weird real weird so far we've dealt with 64-bit wide registers RDI RSI Etc but we have wider registers xmm0 and xmm1 are 128 bit wide and you know what that means we can pack two 64-bit integers in there two quad words making a word 16 bits apparently these are Cindy registers Cindy for single instruction multiple data and that comes in handy let's look at the optimized code and try to go through it there's a bunch of scary simply instructions in there but the first thing that jumps out is that it's reading memory RSI and RDI points to I can tell because there's move instructions with the source operand in square brackets the first move for example moves 64 bits worth of data from wherever RSI points to to the Rex register the move DQ that follows does the same using RDI as an address and moving two 64-bit values two quad words the instruction stands for move double quad word on a line into the xmm0 register let's follow what the function does step by step with diagrams so originally both structs are somewhere in memory each field 64 bits RDI points to X and RSI points to Y the first move copies the first field of y y a to Rex then XA and XP are moved in one instruction to xmm 0. then YB and ycr moved in one instruction to xmm1 thanks to that plus 8 offset here skipping over the 8 bytes or 64 bits of the y a field P add Q add packed quadward integers as both quad words from xmm0 into xmm1 that's two additions with a single instruction single instruction multiple data of D shuffled packed double words is the most complicated instruction here it has a destination operand a source operand and a third operand that defines the order it's a series of four two bit integers that Define which double words to copy from the source to the destination here the order operand is decimal 238 which in binary is one one zero one one one zero if we translate that to four decimal numbers we get three two three two and that corresponds to the higher and lower 32 bits of YB plus XA which is currently in xm1 so all it's really doing is copying YB plus XA from xm1 to both halves of X and M 0. then we have another path queue doing two additions between both pairs of quad words we only care about one of the results so this could be an ad but the other operand is already in xmn1 so this saves us a trip after that we move the result we care about out of xm0 into rcx and XC at address REI plus 16 to racks and add rcx the result of all those simdi editions two racks which gives us our final sum and return value easy am I right aren't you glad you click on this video but let's forget about Cindy stuff for now possibly forever one thing that's interesting is that this code passes six quad words to a function and it uses six registers to do so but as soon as we switch over to two structs with three Fields each it stopped using registers and started passing them on the stack in some cases that can be slower and that's exactly why the day 19 solution was slower in Rust than in C plus kind of the day 19 solution was passing more integers but they were 32-bit the rust compiler looked at it and went you know what that's too many numbers let's just pass them on the stack whereas the C plus plus compiler also backed by lvm looked at it and went huh I bet I can pack two 32-bit values in each of those 64-bit registers and so we just use registers for everything the reason this happened is really complicated and I'm not going to get into it here it is the kind of issue where you fix one case and that causes performance regressions in a lot of other cases it's been an ongoing fight for years according to sources who are very tired okay this is not part of the script but I wanted to explain so for functions that are not exported that are not part of the public interface for Library you can't call them from the outside so you don't have to adhere to a known calling convention right you can do whatever you want and so instead of using registers like RDI through R9 you could use 13 different registers you could use xmm0 as part of your own internal calling convention and that's what ldm is supposed to do but it doesn't for some reason and the problem is if you do that you have fewer registers you can use in the coli in the code that's calling the function you might already be using those registers to just have local variables instead of allocating them stack you have them in registers because you have those but if you're calling a function that has an internal specific API that uses more registers then you can use fewer in your your colee and the function that is calling the other function and so that means that you now need to not use the registers for locals you need to put locals on the stack and that can get slower so the reason is it depends it depends on what the the Kohli is doing you can't tell just from the function signature and that's why it's a bug that's so hard to fix there's not one good option for like which calling convention to use internally so there you go does this story have a happy ending for us though for the day 19 problem slapping an extern C on the function did make it use registers for argument passing and they compiled rust code finally became slightly faster than the C plus version but that was a coincidence it's not a real generalized fixed it would depend on the target architecture among other things this this works on xcbc64 with this version of the compiler anyway before I leave you I want to show you one more thing so we've seen that with the current version of the nightly rust compiler passing two strikes with three Fields each uses the stack instead of registers and as I was discussing this with a friend a thought occurred three Shields is an odd number of fields not just weird also not even so maybe something somewhere in the compiler doesn't like odd numbers what if we passed three strikes with two Fields each that's still six total would you still use the stack nope it users registers and I guess this is my long way of saying if you see a compiler engineer in the wild ask if they need a hug [Music] so funny story I checked my inbox today and it had a pitch for a sponsored segment for the video I was like heck yeah not all that money and uh the email was from a cat so here you go to the video sponsored by cat cuts are very good The Fluffy that one sweeps a lot so he's a very quiet colleague I don't really have any complaints he's not helping much but I'm I'm a self-driven individual in this in this tech industry climate so that's that's good um cat would like pets um I'm assuming that's cat's wave saying he doesn't get enough I don't know I just I'm just reading from the picture um the cat also wants food not cat food cat what's human food the thing that smells delicious in the kitchen and that's that's what cat wants and cat wants you to know that cat wants that so yeah um check out cats um this is half my cats I have double the cats that are shown here on on camera yeah uh cats rock you don't I don't think you don't there's no discount code or anything you can't use code fast online with cat which is [Music] and then pet them eat them this is a bunch of cleanup but you you'll you'll see when you get to it [Music]

Info

Channel: fasterthanlime

Views: 381,737

Rating: undefined out of 5

Keywords: rust, c++, x86, assembly, performance, llvm, clang

Id: VMpSYJ_7aYM

Channel Id: undefined

Length: 21min 14sec (1274 seconds)

Published: Mon Jan 16 2023