Linux v Windows: Which is FASTER? - Software Drag Racing!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Wait you pronounce it C-lang not clang?

👍︎︎ 6 👤︎︎ u/DeadLink404 📅︎︎ Apr 12 2021 🗫︎ replies

Code is available up at http://github.com/davepl/Primes in the PrimeCPP_PAR folder.

If any kernel gurus have an explanation for the weird "bump" in perf it shows between 16 and 31 cores, I'd love to know what's going on. I've heard the scheduler is NUMA aware and so on but how does it actually increase perf over baseline?

Could it somehow be running at a higher clock speed by distributing the work to different AMD core groups or something fancy?

👍︎︎ 5 👤︎︎ u/daveplreddit 📅︎︎ Apr 12 2021 🗫︎ replies
Captions
[Music] hey i'm dave welcome to my shop i'm dave plummer a retired operating systems engineer for microsoft going back to the ms-dos and windows 95 days and today we're taking the software drag racing series to the next level with some hardcore multi-core action we're bringing out the big iron to test the raw no excuses speed of linux versus windows it'll be a brutal head-to-head showdown on the same hardware with no quarter asked none given we'll be doing our testing on the big threadrippers and not only am i going to compare the raw real world performance within the operating systems i'm also going to test the four most popular compilers so you'll also know how visual c compares to gcc and sealang in their generated code performance join me today as we create a scorching new primes benchmark that will truly let us unleash the beast within today's most powerful cpus [Music] if single core cpu performance is analogous to traditional drag racing with a single engine then multi-core processors are closer to the unlimited tractor pulls with their massive multi-engine rigs this is our third episode in the software drag racing series if you haven't seen the other two yet tell your assistant to hold your calls for half an hour because you're in for a treat our first episode pitted python versus c sharp and c plus plus and c plus plus emerged the clear winner now even python can be made reasonably performant if you use a library such as numpy to help with bit twiddling but in the end i decided to standardize on c plus plus do not only to its high speed but because of its wide acceptance across a broad range of systems plus it's my native tongue all the testing to date has revolved around running a prime sieve whose job it is to find the prime numbers under 1 million it does that as many times as it can in 5 seconds and the number of completed passes becomes its score for some rough comparison a raspberry pi scores in the 500 range while a relatively modern pc scores around 5 000. on these bigger machines we'll be running the sievo to 10 million and then a billion instead of just 1 million so just a heads up you can't compare numbers from this episode to previous ones because of that everything we've done so far however has been single core work it does one thing at a time as fast as it can and it repeats it as many times as it can in a row within the five seconds while one core of your cpu is working feverishly away at maximum speed all the other cars are just sitting idly by watching the show single core performance may not be as sexy as the numbers you get out of a massively parallel epic chip but single core performance is still what determines how snappy a machine feels that's because single core performance is really what determines how long it takes to complete any given task multi-core performance is indicative of how many of those tasks can be completed all at once [Music] if some complicated task takes 30 seconds to complete and it cannot be broken down any further such that it's truly a single core workload then it doesn't matter how many cores you have you're not going to get an answer back for 30 seconds a great example of single core work can be found in most video games while additional course can help a game like flight simulator render more terrain more quickly by doing that in parallel certain things are simply dependent on one another that is to say you can't do a and b at the same time if starting work on b requires the results out of a first you have to finish a before you can take the results and put them into b and even start on b so putting them on separate course would buy you nothing and that chain of events where a provides results to b which are then used by c and so on is called the critical path let's say your critical path takes 20 milliseconds to process your maximum frame rate is then 50 frames per second adding more cores doesn't speed up the frame rate it only allows additional work to be done alongside the critical path to go back to my analogy earlier more cores will allow you to do additional work but they don't speed up the work that you already have to do unless you can somehow break that work up into parallel chunks some problems lend themselves easily to parallelization for example take a game like chess where the computer has to evaluate the possible moves for each piece on the chessboard given the state of the board the possible moves for each piece can be evaluated entirely separately from one another the computer can run off and score the possible moves from the bishop and the possible moves to the queen on two different course and then compare the results when they're ready that should in theory be about twice as fast as doing it one after the other what about our prime siv well it turns out the solving primes is somewhere in between and winds up fairly hard to effectively make parallel the reasons become apparent once you really understand how the primes of algorithm works so let's revisit the algorithm for a moment to refresh ourselves on how it operates first we write a list of all the numbers up to our limit that we want to consider next we look for the first number that has not been eliminated from that list of possible primes we circle it as a prime and then run through the list eliminating all the subsequent multiples because anything that has that number is a factor other than itself is by definition not a prime number so we cross them off the list rinse and repeat we keep looking for the next prime and limiting all of its multiples until we've reached the square root of our upper limit and at that point we're done and what's left in our list are all prime numbers the problem with trying to parallelize this algorithm is that the elimination of the multiple step depends entirely on the finding of the next prime factor step there's no way around that you might be tempted to simply spin off the marking of the multiples onto its own thread that however is a recipe for disaster because it would introduce a subtle bug consider the case where for whatever reason our scanning for primes goes first or faster than the elimination of the multiples which is running on a separate thread it could find a number that should have been eliminated as a multiple and erroneously accepted as a prime simply because the elimination hasn't happened because it's running behind on the scanning thread it can be a tough one to catch in the debugger if that happens these two steps are essentially a critical path but perhaps one or both of those steps can be internally parallelized i'm pretty confident that because it's impossible to predict how far away the next prime is with any certainty there's not much to be gained from trying to break up the scanning for the next factor pass it's probably more overhead than it'd be worth but what about the elimination of the multiples from the list that's where i focused my energy based on my thinking that the elimination of the multiples of the number three involves some three million steps in a sieve of 10 million i figured that would be worth splitting up for sure what i did then was to break up the range into batches of a certain size i decided that each thread would process batches of ten thousand multiples for example then thread a would mark off all the multiples of five between zero and fifty thousand thread b would mark off all the multiples from fifty thousand to one hundred thousand that allows you to run twenty threads in peril each marking off a small section of the multiples i varied how big that batch was by hand in order to optimize as best i could and it all worked it just really wasn't any faster i tweaked the batch size and special case and small factors but somewhat to my surprise the overhead of creating and scheduling extra threads simply wasn't worth the win from parallelization in this case my guess is that there are only a few small factors by the time you're at the number 100 the batches are small enough that only a single core can really be used anyway i scratched my head for a while read a few papers on the subject but at the end of the day i couldn't make the process significantly faster unless the sieve size was very very large if you are scanning for the primes up to 10 billion marking the multiples of three clearly benefits from multiple cores but when you're only talking about going up to one million the thread overhead seems to about offset any benefit having invested an hour so into it and having come up empty i'd be mighty impressed if somebody could come up with a readable parallel version without a bunch of special cases that actually outperforms a single threaded sieve to 1 million check it in it turns out however that i was in a sense parking up the wrong tree if the goal is to benchmark how much work a modern cpu can do in a short period of time parallelizing the algorithm is but only one approach after all when you're doing thousands of passes per second why not simply schedule those passes on different cores and so that's precisely what i did i looked around for some code that i could steal that would do sieves in parallel but i'll let you in on a secret i'm a bit of a lazy dude everything i found was complicated producer consumer patterns with mutexes and deadlock prevention and all kinds of the sort of c plus template madness that keeps me awake at night i figured if the goal is to keep all your cores as busy as possible just create one thread per core and keep each core entirely consumed with cracking out passes to the prime sieve keep track of how many passes get done across all the cores in your five second burst and you've got your result the beauty of this simple approach is that it's only a few lines of code and the civ itself doesn't change at all first i need to know how many cores your cpu has and so that's how i know how many threads to create my 3970x reports 64 and it has 32 hyper threaded cores the 3990x chip however also reports 64. i'm not sure if that's a bug in the c runtime or a feature or a limitation of a single numa node but the crt doesn't guarantee that this number matches the hardware the documentation says that is merely guidance that's why in the end i wound up adding a command line switch to let you plainly specify how many worker threads you actually want without going too far down the cpu architecture rather it's worth noting that this algorithm really stresses the cpu memory cache in a traditional architecture all the cores share the same cache so it just works with chips like the amd threadripper however the cores are broken up into groups of eight and they share l2 and l3 cache as you might imagine there's a lot of magical plumbing that goes on to keep each cpu's view of memory consistent with one another when every group of eight has its own independent cache after all perhaps one core off in another group is marking a multiple that another core in another cpu group is about to read each needs a consistent view of memory and simply cannot trust what's in its own cache this all happens deep within the cpu well below the level that we program at but it has real world implications for speed and performance one of those implications is that creating threads is also not free i rely on a one thread per sieve approach and i figure the amount of work to compute the primes to 10 million likely justifies the overhead of creating and destroying one thread after adding the code to run the sieves in parallel and the ability to specify how many threads to run for the command line i then compiled and built the exact same c plus program four different ways with visual c plus plus for windows with a c line compiler for windows with gcc for linux and then finally with a c-line compiler for linux not only do i think this will tell us a lot about the code performance we get out of the various compilers but the fact that i'm going to be able to compare c line generated code directly between windows and linux makes that comparison much more of an apples to apples comparison it should at least in theory eliminate much of the variation from the compiler and truly let us compare the operating systems my first approach was to run it from the command line and then copy and paste the results into an excel chart well that turned out to be way too tedious especially if you consider they had to do at least 64 cases for a minimum of four compilers and os variations so my next step was to write a simple python program to launch the benchmark repeatedly and while increasing the thread count from 1 to 64. i added a command line option to quiet down the program and output such that it produced only the timings the python script captures the info from the standard out and prints the results in a nice table that i can copy and paste straight into excel once i've got all the data in excel i can graph it which will reveal a great deal so let's have a look as we can immediately see the visual c compiler's implementation lags the rest by about 10 i took a brief look at the generated code and the biggest difference seems to be the gcc and c lang treat the sieve memory as blocks of 64 bits rather than 32. the assembly code from vc shifts the index by 5 whereas gcc and c-line plainly shifted by 6. i don't actually believe this is due to better code generation but rather the way in which the supplied run time implements the vector of bools that's up to the compiler vendor and if we look at the visual c plus code we can drill down through vector bool and see that it has the constant v bits which turns out to be 8 bits per byte times the number of bytes in the base type and the base type it uses here is an unsigned int that's a 32-bit type instead of a 64-bit type and i have a hunch that perfectly improved with a change to a 64-bit base type after all the system writes to the 64-bit and that's where it's doing all its masking if we look at the assembly language for a generated code and the two cases we can see that c-line generates slightly different and more efficient code than vc when it comes to clearing out every inch bit check out the c-line implementation and compare that to the more naive and literal visual c approach that loads the register from memory clears it out with the btr instruction and then stores it back i haven't counted cycles by hand or anything yet but i'm fairly certain that this is where the difference comes in between them and finally to answer the big question of which operating system is faster the answer is that it's going to depend on the cpu you're using in terms of desktop it appears that c lang for windows is the absolute fastest by a thin margin up until about 14 cores are running but if you have more cores than that fully engaged then not only does linux start to pull ahead at that point but it makes an absolute leap at about the 16 cpu mark this is something i can't explain i'd love to hear your theories is linux being smart about pooling threads rather than creating additional ones beyond a certain point even so it wouldn't account for the sudden bump nor the subsequent contraction back at around 32 cpus i don't have access to any high core xeons to see if they behave similarly but the code is all in the github repository if there's sufficient interest i can also likely put up a binary for those of you who don't know how or perhaps just don't want to compile your own it'll give you something to test with but just zipped up i don't want to do a fancy setup and everything yet for it let me know if that is something you want to see my hunch was a simple one and that was that threads are simply heavier weight on windows i think my benchmark may actually be a degenerate case for windows because it simply creates too many threads but how many threads is too many each sieve launches on its own thread and my feeling was that running a sievo to about 10 million would be enough work to justify the life and death of a thread but once you get a lot of cores running on a big cpu that amounts to thousands of frames per second which is not an efficient approach the compiler could be smarter for me or i could be smarter for my own sake to find out if there was any merit i did a simple test i made the sieve a simple no-op so that i was merely creating destroying threads about as fast as i could and sure enough windows was about the same speed and doing almost no work meaning that a lot of its time had in fact been spent in thread overhead it's time to take a look at the final results each compiler and os put to maximum load generating saves up to one billion test manager confirms in each case that the cpu is in fact fully loaded since we know that only about one thread per second is being created on average for the one billion case thread creation overhead shouldn't matter here but the context switch efficiency of the operating system does that's a measure of how seamlessly it can share the cpu amongst the 64 busy threads without wasting too many cycles on logic and bookkeeping and so on there's a book i keep on my bookshelf that i'll link to in the description and it's called how to lie with statistics it's a good book but i just like to have it on my bookshelf for how it looks and everyone likes to thumb through it in it you learn handy tricks such as where to set the origin of your chart axis consider this example here are those final results for the sieves up to 1 billion with the origin based at 0 it looks like they're all about the same and in one sense they are because the difference really is only a few percent and you so you don't really see it but if we zoom into the chart by pulling the origin all the way up to 0.9 we can make it look like linux absolutely troused windows by focusing on just the very top of the chart where it turns out that linux eeks out a three percent win over windows where the chart should really be drawn is a matter of some journalistic integrity but since that doesn't apply to me i'll just put it where it looks about right visually which is a 0.75 drag racing is often decided by hundreds or even thousands of a second in this case is no exception thinnest of margins are not the official win for number crunching primes of performance goes today to linux don't forget the sub for future rounds in the linux vs windows showdown if you enjoyed this route of the showdown please be sure to leave me a thumbs up and to make sure you're subscribed to the channel i'm still trying to figure out the right balance between coding and exposition and your subscriptions in large part determine where i focus my time if i see a bunch of new subscriptions from a particular video then i know i'm going in the right direction i'll then make more like it and if you turn on the bell icon you'll even be notified of them when i do it's a win-win and don't forget to grab your dave's garage mug which is guaranteed to make your coffee actually taste better not a guarantee and besides all profits from mug sales will go to autism research then you probably need a mug anyway so why not own a mug that's cool and helped out a kid besides the mugs however i'm not selling anything and i don't have any patreons i'm just in this for the subs and likes so i'd be sure appreciative if you left me one of each before you left in the meantime and in between time i hope to see you next time right here in dave's garage it's worth noting that this algorithm really stretches the cpu memory cache ah no stresses well it could stretch it i guess if it's working producer consumer patterns with mutexes and deadlock prevention and all kinds of the sort of c plus plus template template madness template template this everything i found was pretty complicated producer consumer patterns with mutexes and deadlock prevention and all kinds of sort of c plus plus tempe temp if you're scanning for the primes up to 10 billion marking the multiples of three oh so thirsty and so refreshing from my dave garage mug you
Info
Channel: Dave's Garage
Views: 84,141
Rating: undefined out of 5
Keywords: windows vs linux, linux vs windows, linux vs windows 10, speed, performance, benchmark, test, linux benchmark, windows benchmark, windows subsystem for linux, task manager, open source, linux 2020, linux distro, microsoft windows, windows subsystem for linux 2, wsl, wsl 2, wsl2, prime, prime sieve, solve primes, factor primes, windows 10, operating system, windows 10x, reactos
Id: h4IDTblOHYY
Channel Id: undefined
Length: 17min 26sec (1046 seconds)
Published: Mon Apr 12 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.