Software Drag Racing: M1 vs ThreadRipper vs Pi

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey, I'm Dave, Welcome to my Shop! I'm Dave Plummer, a retired operating systems engineer from Microsoft going back to the MS-DOS and Windows 95 days, and today we're going desktop drag racing again. But this time instead of four languages on the same machine, it's the same language on four different machines. We'll take last week's winning CPU - the AMD Threadripper - and run it head to head with a new Apple Silicon Mac M1 and two flavors of Raspberry Pi. Wait, did I just say I was going to drag race a Threadripper and an M1 against a couple of Raspberry Pis? Yeah, well, at least one of them is a Pi4, and nobody said these races would be fair, just informative and entertaining. So, while you might already assume the M1 is faster than the Pi, by how much? Is it 5 times faster than a Pi4? 10 times faster than a Pi3? Find out if the Pi's dreams are just too big and the rather surprising and confusing results of the M1 vs Threadripper showdown, all right here today in Dave's Garage. [Intro] Now before I even get into any of the cool code and benchmarking for today, I'm going to quickly tell you precisely why you should click on the thumbs up button for this particular video right up front. It's because I clearly demonstrate that I value your time. Some even say I talk TOO fast, but I could easily have split the two variants of the pi off into their own episode and then make the M1 its own episode and then have a roundup episode where we compare them all, and just milk this concept for all its worth like some would do. But not me. I wouldn't do you like that. We'll do all the benchmarks today and I'll show you all the results right here today too. You can thank me by subscribing to my channel if you haven't yet done so. Maybe buy one of these sexy mugs from the link in the video description. Vote me up on reddit. Share this with a nerdy friend. Post it on Twitter. I'm sure you'll think of something. Well, enough of that then, what's the actual plan for fun and success today? Well, the main thing is that we're going to bust out a crisp new M1 Mac Mini. The real claim to fame for the new M1 is that its single core performance is unusually spectacular. For certain operations it's about the fastest sequential core out there. I heard it even cheats a little by using the economy cores to somehow help the performance cores along, at least in x86 mode. You know what I say to that? Good! If it makes it faster, just do it. I'm all for it. Add a fan if it comes to that, we'll forgive them in exchange for more speed. At least I will. In the last episode, C# vs C++ vs Python, I introduced what would become our canonical benchmark, our software drag strip if you will: a prime sieve that strives to solve for all the primes under 1 million. It does so as many times as it can in five seconds. The fact that we're working in 64-bit C++ for this episode tells you which language was the winner last week! So everything will be C++, and 32 or 64 bit as appropriate to whatever chip it is. A quick note about the comparison between Apple M1 and the AMD Threadripper 3970X. Don't let the Threadripper's 32 cores worry you - it'll effectively have 31 of them tied behind its back for this fight, just as the M1 will be working on a single core as well. Even the Pis will be using but one of their four. But when it comes to gaming or other serial workloads, that's the reality of what matters. The AMD 3970X is actually slightly faster than its big brother, the 3990X, on workloads of 32 threads or less. So, it's what we'll be using today, and they generally score about 1260 on Geekbench. The M1 Mac scores 1700 in Geekbench. Given the M1 scores some 30 percent higher in that well known benchmark, should we expect a 30 percent increase in prime number batches as well? Maybe or maybe not - it's hard to guess for a number of reasons: our prime sieve is primarily doing sequential in-cache work against memory. If you want to be fancy about it, a sieve has excellent locality of reference, particularly if most or all of the sieve fits into the CPU cache, so the speed of the interface between the CPU registers and the memory cache is paramount. Our bit array will hold a million bits, but that's still only 125K. That should easily fit in the L1 cache of the Threadripper and the M1. It'll fall to the L2 cache in the Pis, but still all inside the CPU. It makes no difference how fast pushing around large 64 bit registers is or what the floating point performance is like - all that matters for prime sieve drag racing is the ability to set a bit, clear a bit, and make a decision as quickly as possible. Thus, it highly stresses a few aspects of the CPU while not using others at all. I may not know a great deal about CPU architecture, but I used to skim Microprocessor Report over coffee sometimes, so I think I'm qualified to guess that parallelization and pipelining of the instructions may be key here, as well as optimizing around branch prediction. I'd wager those factors mater more than the memory interface or even the clock speed, for example. Some would argue that therefore this is not a really valid test, or that it's not a valid benchmark. That's nonsense. It's A valid benchmark. It's certainly not the only benchmark nor do I in any way claim that it's the best benchmark. It's just one general-purpose benchmark that's illustrative of solving one particular real world math problem. But guess what? Accelerating in a straight line in a car does not test each and every system in a road vehicle either, but a timeslip makes for some serious bragging rights, and being on the leaderboard is cool, and that's why we do it. To get the process rolling on a plain Ubuntu installation, we first have to install the C and C++ compilers. Let's drop into an SSH console connected to a Pi3B where I've got that process already well underway. [Pi3B Console] As soon as the gcc install completes we'll do a sudo apt install of g++ to get the C++ compiler that we need. And if you're thinking "Boy, it sure installs a lot faster on Dave's Pi than it does on mine" keep in mind that it's because I've sped the video up. What, did you think I'd make you sit through it live? Not likely! I'll then browse down into my source folder where I've already enlisted in the github project - something you can do as well, and I'll talk more about that later - and into the PrimeCPP folder where the C++ version of the prime sieve lives. Unlike last episode, this time around everyone will be running the exact same code, so other than the variation from one compiler to the next it's a pure hardware showdown. I've got a small shell script in the folder that will compile it with the options that I used in the episode. The gcc compiler, quite rightly, enforced that I add two additional include files before it would build. I had to include cstream for memset and cmath for the square root function. [Go until Nano saves the header file fix] With the compiler errors fixed I simply execute the run.sh batch file which compiles and execute the code, and five seconds later we're greeted with our result: 306 passes. And with a fast PC scoring around 7500, now you know roughly how much faster a modern PC is than a Raspberry Pi on a simple single core workload: about a factor of 25. I actually think that's still fairly impressive - and we haven't tried the newer Pi4 yet either - simply because it means that regardless of where it ranks relative to a full PC, for about $30 you can buy a small self-contained computer that can solve for all the primes under a million more than 300 times per second. Frankly, I think that's mind blowing no matter its relative rank. I tried other variations of the gcc compiler optimizer options but 306 was my best run on the Pi3. Keep in mind that the Pi3 has been running a 32-bit ARM build whereas the Pi4 and the M1 will be 64 bit. So perhaps we'll see more improvement than one might expect from just clock speed alone when we race the Pi4. There's only one way to find out, so let's ssh into my Pi64 box. [Start the Pi4 shell, let it run and continue narrating] Here too I tried various options but Ofast turned out to be the best, cranking out 780 passes, which is two and a half times as fast as the Pi3B. So, if performance is an issue, it certainly seems like the Pi4 is the way to go if you can find them at a fair price. Finally, then, we have the new Apple Silicon M1 Mini. I chose the mini because (a) who am I kidding I already owned it, and (b) it should be about the fastest consumer core out there right now since it even has a cooling fan, whereas the Macbook Air might be thermally limited, for example. [Run the M1 clip] Here's where it gets a little weird. As you can see, at about 6200 the M1 Mini comes close to the Threadripper's 7000+ but does NOT surpass it, as I wholly expected it would. I verified that I wasn't running Rosetta, that it was a native binary, properly optimized, and all of that. I tried g++, I tried it in XCode, but 6500 was the best I could do. Thinking maybe the compiler is smarter on Windows, I started to tinker with the code. But if I do that I've to go back and retest at least the Threadripper, since it would no longer be fair to compare to the old code's results. And so that's what I planned to do. But before I get too far ahead of myself, could I improve the C code? The first thing I did was to replace the code that tests for even array indices in the getbit and clearbit functions. They're called millions of times per pass, so their performance is critical. Instead of using modulus 2, I changed it to AND with one. No difference. Then I thought maybe I was causing alignment issues by managing my memory bits as individual bytes, since ARM chips generally like their data blocks aligned on even 64-bit boundaries. As a test I changed everything to be 64 bit unsigned ints, but it made no difference. I tried a lot of things, actually, and came away impressed with the compilers because pretty much nothing helped. So no changes to the code were made for this episode. The best result I recorded video of was 6190, but without the screen recorder running I managed a 6500, so that's the number we'll go with to be fair. Still, however, it falls a little short of the Threadripper. As much as I enjoy paying Apple $99 for the right to call myself some kind of Mac developer, I'm not actually a Mac developer. That means I might be doing something really boneheaded. So if you can take the existing code and get a 5 second result much above 6500 on a Mac, please let me know what, if anything, you did differently in the comments! If it happens, I'll make an M1 redemption episode. For now, though, the Threadripper remains King of the Street in our Desktop Drag Racing series. That won't last long. As soon as I get my hands on a 5950X we'll see if a Ryzen 9 can displace the current champion. I wanted to talk a little bit about the GitHub repository and the types of submission changes that I'm going to accept into the main tree. I was really kind of flattered and surprised to see so many people interested in and actively contributing to the code that I threw up there. From reading through the pull requests, I learned a great deal about Python, libraries such as numpy, the C sharp optimizer, and so on. Folks even found two bugs in the algorithm that I have fixed and will upload shortly. Neither was impacting results in our cases but they could have in others. So what changes will I accept into the tree? Well, things that go out alongside the main code base are usually a no brainer. Like someone submitted a Java implementation that's pretty cool so of course that's in. And if you want to crank out a Turbo Pascal or Ada implementation, they would be cool as well. Be sure to document what's needed to compile and run it, and the more languages, the merrier! The goal in all cases is to stay as similar to the existing code's algorithm as possible so as to make it easier to compare languages directly. That's why my C++ looks like my C sharp which looks like my Python. I didn't want to specialize too much for any one language. And along the way I've seen some genius submissions using multiple dynamically allocated bit sets and other changes in C++ that are a little too specialized for me to throw into the tree, but anything that doesn't change the code much I'm very happy to look at. And if you've got a significantly faster implementation, feel free to put it in a folder beside the PrimeCPP folder. Call it PrimeCPP-HashTable or whatever makes sense for your case, and I'll check it out. If you haven't already seen the C++ vs C# vs Python episode, be sure to check that out next. And while you're at it, if you liked this episode there's a good chance that you'd also likely enjoy the Retrocoding in Assembly language episode. Or if you just want to hear me tell you a few stories, check out Why are Bluescreens Blue. I'll throw links to both up at the end of the video. Speaking of liking this episode, if you did, please be sure to upvote it and make sure you're subscribed to the channel. When I see new subscriptions, I know I'm making content people appreciate, so I make more of it, and if you have the bell icon on, you'll even be notified about them when I do, so it's a win-win. If you are subscribed, then I'll see you soon in the next installment of Software Drag Racing in Dave's Garage. In the meantime and in between time, I hope to see you next time, right here in Dave's Garage. * Charts * PC Python vs Pi C performance
Info
Channel: Dave's Garage
Views: 82,851
Rating: undefined out of 5
Keywords: threadripper vs m1, m1 vs threadripper, m1 vs pi, threadripper vs pi, pi vs threadripper, threadripper vs m1 vs pi, pi vs m1, software drag racing, daves garage, 3970, M1, amd3970x, 3970x, apple silicon m1, apple m1, benchmarks, windows, primes, fastest cpu, cpu comparison, drag race, 3990, 3990x, m1 macbook pro, mac mini, m1 mini, mac vs windows, m1 speed, m1 fast, apple m1 speed, apple m1 fast, m1 faster than, m1 slower than
Id: l1j-aF_wyzU
Channel Id: undefined
Length: 15min 43sec (943 seconds)
Published: Sun Apr 04 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.