Hey, I'm Dave, Welcome to
my Shop! I'm Dave Plummer, a retired operating systems
engineer from Microsoft going back to the MS-DOS and Windows 95 days, and today we're
going desktop drag racing again. But this time instead of four languages on
the same machine, it's the same language on four different machines. We'll take last week's winning CPU - the AMD
Threadripper - and run it head to head with a new Apple Silicon Mac M1 and two flavors
of Raspberry Pi. Wait, did I just say I was going to drag race
a Threadripper and an M1 against a couple of Raspberry Pis? Yeah, well, at least one of them is a Pi4,
and nobody said these races would be fair, just informative and entertaining. So, while you might already assume the M1
is faster than the Pi, by how much? Is it 5 times faster than a Pi4? 10 times faster than a Pi3? Find out if the Pi's dreams are just too big
and the rather surprising and confusing results of the M1 vs Threadripper showdown, all right
here today in Dave's Garage. [Intro]
Now before I even get into any of the cool code and benchmarking for today, I'm going
to quickly tell you precisely why you should click on the thumbs up button for this particular
video right up front. It's because I clearly demonstrate that I
value your time. Some even say I talk TOO fast, but I could
easily have split the two variants of the pi off into their own episode and then make
the M1 its own episode and then have a roundup episode where we compare them all, and just
milk this concept for all its worth like some would do. But not me. I wouldn't do you like that. We'll do all the benchmarks today and I'll
show you all the results right here today too. You can thank me by subscribing to my channel
if you haven't yet done so. Maybe buy one of these sexy mugs from the
link in the video description. Vote me up on reddit. Share this with a nerdy friend. Post it on Twitter. I'm sure you'll think of something. Well, enough of that then, what's the actual
plan for fun and success today? Well, the main thing is that we're going to
bust out a crisp new M1 Mac Mini. The real claim to fame for the new M1 is that
its single core performance is unusually spectacular. For certain operations it's about the fastest
sequential core out there. I heard it even cheats a little by using the
economy cores to somehow help the performance cores along, at least in x86 mode. You know what I say to that? Good! If it makes it faster, just do it. I'm all for it. Add a fan if it comes to that, we'll forgive
them in exchange for more speed. At least I will. In the last episode, C# vs C++ vs Python,
I introduced what would become our canonical benchmark, our software drag strip if you
will: a prime sieve that strives to solve for all the primes under 1 million. It does so as many times as it can in five
seconds. The fact that we're working in 64-bit C++
for this episode tells you which language was the winner last week! So everything will be C++, and 32 or 64 bit
as appropriate to whatever chip it is. A quick note about the comparison between
Apple M1 and the AMD Threadripper 3970X. Don't let the Threadripper's 32 cores worry
you - it'll effectively have 31 of them tied behind its back for this fight, just as the
M1 will be working on a single core as well. Even the Pis will be using but one of their
four. But when it comes to gaming or other serial
workloads, that's the reality of what matters. The AMD 3970X is actually slightly faster
than its big brother, the 3990X, on workloads of 32 threads or less. So, it's what we'll be using today, and they
generally score about 1260 on Geekbench. The M1 Mac scores 1700 in Geekbench. Given the M1 scores some 30 percent higher
in that well known benchmark, should we expect a 30 percent increase in prime number batches
as well? Maybe or maybe not - it's hard to guess for
a number of reasons: our prime sieve is primarily doing sequential in-cache work against memory. If you want to be fancy about it, a sieve
has excellent locality of reference, particularly if most or all of the sieve fits into the
CPU cache, so the speed of the interface between the CPU registers and the memory cache is
paramount. Our bit array will hold a million bits, but
that's still only 125K. That should easily fit in the L1 cache of
the Threadripper and the M1. It'll fall to the L2 cache in the Pis, but
still all inside the CPU. It
makes no difference how fast pushing around large 64 bit registers is or what the floating
point performance is like - all that matters for prime sieve drag racing is the ability
to set a bit, clear a bit, and make a decision as quickly as possible. Thus, it highly stresses a few aspects of
the CPU while not using others at all. I may not know a great deal about CPU architecture,
but I used to skim Microprocessor Report over coffee sometimes, so I think I'm qualified
to guess that parallelization and pipelining of the instructions may be key here, as well
as optimizing around branch prediction. I'd wager those factors mater more than the
memory interface or even the clock speed, for example. Some would argue that therefore this is not
a really valid test, or that it's not a valid benchmark. That's nonsense. It's A valid benchmark. It's certainly not the only benchmark nor
do I in any way claim that it's the best benchmark. It's just one general-purpose benchmark that's
illustrative of solving one particular real world math problem. But guess what? Accelerating in a straight line in a car does
not test each and every system in a road vehicle either, but a timeslip makes for some serious
bragging rights, and being on the leaderboard is cool, and that's why we do it. To get the process rolling on a plain Ubuntu
installation, we first have to install the C and C++ compilers. Let's drop into an SSH console connected to
a Pi3B where I've got that process already well underway. [Pi3B Console]
As soon as the gcc install completes we'll do a sudo apt install of g++ to get the C++
compiler that we need. And if you're thinking "Boy, it sure installs
a lot faster on Dave's Pi than it does on mine" keep in mind that it's because I've
sped the video up. What, did you think I'd make you sit through
it live? Not likely! I'll then browse down into my source folder
where I've already enlisted in the github project - something you can do as well, and
I'll talk more about that later - and into the PrimeCPP folder where the C++ version
of the prime sieve lives. Unlike last episode, this time around everyone
will be running the exact same code, so other than the variation from one compiler to the
next it's a pure hardware showdown. I've got a small shell script in the folder
that will compile it with the options that I used in the episode. The gcc compiler, quite rightly, enforced
that I add two additional include files before it would build. I had to include cstream for memset and cmath
for the square root function. [Go until Nano saves the header file fix]
With the compiler errors fixed I simply execute the run.sh batch file which compiles and execute
the code, and five seconds later we're greeted with our result: 306 passes. And with a fast PC scoring around 7500, now
you know roughly how much faster a modern PC is than a Raspberry Pi on a simple single
core workload: about a factor of 25. I actually think that's still fairly impressive
- and we haven't tried the newer Pi4 yet either - simply because it means that regardless
of where it ranks relative to a full PC, for about $30 you can buy a small self-contained
computer that can solve for all the primes under a million more than 300 times per second. Frankly, I think that's mind blowing no matter
its relative rank. I tried other variations of the gcc compiler
optimizer options but 306 was my best run on the Pi3. Keep in mind that the Pi3 has been running
a 32-bit ARM build whereas the Pi4 and the M1 will be 64 bit. So perhaps we'll see more improvement than
one might expect from just clock speed alone when we race the Pi4. There's only one way to find out, so let's
ssh into my Pi64 box. [Start the Pi4 shell, let it run and continue
narrating] Here too I tried various options but Ofast
turned out to be the best, cranking out 780 passes, which is two and a half times as fast
as the Pi3B. So, if performance is an issue, it certainly
seems like the Pi4 is the way to go if you can find them at a fair price. Finally, then, we have the new Apple Silicon
M1 Mini. I chose the mini because (a) who am I kidding
I already owned it, and (b) it should be about the fastest consumer core out there right
now since it even has a cooling fan, whereas the Macbook Air might be thermally limited,
for example. [Run the M1 clip]
Here's where it gets a little weird. As you can see, at about 6200 the M1 Mini
comes close to the Threadripper's 7000+ but does NOT surpass it, as I wholly expected
it would. I verified that I wasn't running Rosetta,
that it was a native binary, properly optimized, and all of that. I tried g++, I tried it in XCode, but 6500
was the best I could do. Thinking maybe the compiler is smarter on
Windows, I started to tinker with the code. But if I do that I've to go back and retest
at least the Threadripper, since it would no longer be fair to compare to the old code's
results. And so that's what I planned to do. But before I get too far ahead of myself,
could I improve the C code? The first thing I did was to replace the code
that tests for even array indices in the getbit and clearbit functions. They're called millions of times per pass,
so their performance is critical. Instead of using modulus 2, I changed it to
AND with one. No difference. Then I thought maybe I was causing alignment
issues by managing my memory bits as individual bytes, since ARM chips generally like their
data blocks aligned on even 64-bit boundaries. As a test I changed everything to be 64 bit
unsigned ints, but it made no difference. I tried a lot of things, actually, and came
away impressed with the compilers because pretty much nothing helped. So no changes to the code were made for this
episode. The best result I recorded video of was 6190,
but without the screen recorder running I managed a 6500, so that's the number we'll
go with to be fair. Still, however, it falls a little short of
the Threadripper. As much as I enjoy paying Apple $99 for the
right to call myself some kind of Mac developer, I'm not actually a Mac developer. That means I might be doing something really
boneheaded. So if you can take the existing code and get
a 5 second result much above 6500 on a Mac, please let me know what, if anything, you
did differently in the comments! If it happens, I'll make an M1 redemption
episode. For now, though, the Threadripper remains
King of the Street in our Desktop Drag Racing series. That won't last long. As soon as I get my hands on a 5950X we'll
see if a Ryzen 9 can displace the current champion. I wanted to talk a little bit about the GitHub
repository and the types of submission changes that I'm going to accept into the main tree. I was really kind of flattered and surprised
to see so many people interested in and actively contributing to the code that I threw up there. From reading through the pull requests, I
learned a great deal about Python, libraries such as numpy, the C sharp optimizer, and
so on. Folks even found two bugs in the algorithm
that I have fixed and will upload shortly. Neither was impacting results in our cases
but they could have in others. So what changes will I accept into the tree? Well, things that go out alongside the main
code base are usually a no brainer. Like someone submitted a Java implementation
that's pretty cool so of course that's in. And if you want to crank out a Turbo Pascal
or Ada implementation, they would be cool as well. Be sure to document what's needed to compile
and run it, and the more languages, the merrier! The goal in all cases is to stay as similar
to the existing code's algorithm as possible so as to make it easier to compare languages
directly. That's why my C++ looks like my C sharp which
looks like my Python. I didn't want to specialize too much for any
one language. And along the way I've seen some genius submissions
using multiple dynamically allocated bit sets and other changes in C++ that are a little
too specialized for me to throw into the tree, but anything that doesn't change the code
much I'm very happy to look at. And if you've got a significantly faster implementation,
feel free to put it in a folder beside the PrimeCPP folder. Call it PrimeCPP-HashTable or whatever makes
sense for your case, and I'll check it out. If you haven't already seen the C++ vs C#
vs Python episode, be sure to check that out next. And while you're at it, if you liked this
episode there's a good chance that you'd also likely enjoy the Retrocoding in Assembly language
episode. Or if you just want to hear me tell you a
few stories, check out Why are Bluescreens Blue. I'll throw links to both up at the end of
the video. Speaking of liking this episode, if you did,
please be sure to upvote it and make sure you're subscribed to the channel. When I see new subscriptions, I know I'm making
content people appreciate, so I make more of it, and if you have the bell icon on, you'll
even be notified about them when I do, so it's a win-win. If you are subscribed, then I'll see you soon
in the next installment of Software Drag Racing in Dave's Garage. In the meantime and in between time, I hope
to see you next time, right here in Dave's Garage. * Charts
* PC Python vs Pi C performance