AMD says these are the same... We DISAGREE. - Testing 12 of the same CPUs for Variance

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

If you went to the grocery store, gave them your money, and left with 10% less Vegemite than you paid for, you'd probably be pretty ticked off. But what if I told you that happens every day, except with $350 CPUs? Well, it's true. These 12 chips were obtained over a span of a month from 8 different sources spread across 3 separate countries. And the real-world difference between the best and the worst of them is 8% in Factorio and as much as 12% in CSGO. This has caused a lot of problems for us over the last 6 months. You see, in April, I gave the labs team the goal of increasing our testing throughput so that we could give you guys more juicy benchmarks in our reviews. Unfortunately, the math was not in our favor. You see, the typical turnaround to troubleshoot, benchmark, write, film, edit, and QC our coverage of a new product is 5 to 8 business days. At about 5 minutes per test, times 3 runs for consistency checks, times however many products we want to compare, times however many games or applications you guys want to see, it is easy to run out of time. Now, we could shorten each test, but we've found that longer tests are more consistent from run to run and end up being more reflective of the real-world experience of using the product. So, that's out. And I guess we could work through the nights like a bunch of Adderalled out master students, but come on guys, we're not that young anymore. Automation with Markbench does help a lot, but if our system hangs in the middle of the night, which it happens sometimes, we're right back where we started, which leaves us with one real option, building up more of our test benches and running our tests in parallel, except as I already told you, 8% difference in Factorio, 12% in CSGO. Now, when I signed the procurement authorization for 11 CPUs, I knew that we were likely to find an outlier or two, but then it turned out that this rabbit hole went way deeper than I could possibly have known. Like this deep segway to our sponsor, Nexigo. Whether you're in need of webcams or VR accessories, Nexigo has products that'll make you Nexigo. Wow, those are cool. Learn more about them at the link below. Now, before you start sharpening your pitchforks and demanding that AMD's executives be turned into thermal paste or something, it's worth noting that most of our chips were within a few percentage points of each other, even in CSGO at 1440p, which was the test that saw the greatest overall spread. Also, when we expand our comparison to include our full suite of games, that maximum difference falls to around 2.5%. Far less outrageous, so you don't really have to worry about Buddy in front of you in line getting a way better gaming experience for the same price, but still too big for us to buy any two random 7800X 3Ds, use them to test two different GPUs in parallel, and then say, well, these results should be comparable. Now, the obvious conclusion at this point then is, guys, there's something wrong with your CSGO test. I think it's time to get good, but the thing is, there aren't a lot of variables here, and that's by design. We used the same motherboard, same memory, same Windows drive and install. We even tested using phase change thermal pads to ensure that our paste application wasn't an issue, and we chucked our bench in the thermal chamber just for good measure. We are very confident that our numbers are valid, and we're gonna have the process doc linked in the video description if you wanna have a look, which is all fine and good, but doesn't answer the much bigger question of why do these CPUs vary so much in the first place? AMD gave them the same model number and specifications, AMD charges the same price, so they should have the same performance, right? Well, back in the day that was true, CPUs would run at their rated speed, and if they didn't, it meant they were either broken or they were about to be, but thanks to a relatively recent innovation called dynamic frequency scaling, that's no longer the case. You see, no two pieces of silicon are the same, and whether it's through rolling improvements to manufacturing or just sheer blind luck, you can end up with a processor that is capable of better than the advertised clock speed. Now, in the old days, you could unlock this extra performance manually through overclocking, but nowadays processors just adjust their own speeds, and they do it on the fly based on a whole host of factors, including user configurable power profiles and thermal limits. AMD's approach is to allow the CPU to clock as high as it's able until the CPU die average reaches about 90 degrees Celsius, at which point the clocks will be dialed back until it reaches equilibrium. Sounds good, right? I mean, why be bound by some artificial performance limiter when I got a golden chip that can go higher? And I actually agree, but for the folks who end up with a lesser chip, can feel a little bit like missing out, even if AMD is careful to only guarantee clock speeds that 100% of the chips can hit. And as I mentioned before, it's also very inconvenient for our parallel testing endeavors. So what do we do? Testing. Lots of it. After throwing Cinebench at our very first CPU, we ran into our very first roadblock. The numbers from run to run can be vastly different. I am talking 300 point spreads on a single CPU run back to back. What the heck, right? How on earth are we supposed to narrow down which two CPUs are within 1% of each other if one CPU isn't within 1% of itself? As it turns out, software was the culprit and software is notoriously hard to account for. Have you ever opened up task manager right when you boot up your system? There's no programs running, nothing should be happening except wrong. What was that? Here's the thing. Even when you're doing nothing, your operating system is busy managing all the behind the scenes work that keeps your system running, like updating the weather widget, synchronizing the clock with a trusted time server, prepping the next thing it thinks you might need, installing updates and so much more. And we don't really get to decide when that stuff happens, which means that no one result can ever be taken as gospel truth. We do have custom Windows images that are intentionally debloated to remove some startup processes to help with this, but it only partially mitigates the issue and it introduces new ones, like making our results slightly less representative of the typical user. We feel this trade-off is worthwhile because it helps us to better isolate our variables in testing, but it's not even enough. To further mitigate the amount of work that Windows is doing in the background, we can also increase a process's priority. In Cinebench, we went from seeing points varying in the hundreds down to the tens on the same CPU. That's a big improvement and enough to use Cinebench for our binning process, but we're not out of the woods yet. You see, with some tests, it's not enough to use the same hardware at the same process priority because the benchmarks themselves have built-in inconsistencies. Red Dead Redemption 2, for example, simulates physics and AI behavior during a bench run. That's a really good thing because if that stuff was canned, our results would not be comparable to actually playing the game. But the bad news is it means that sometimes Arthur loses his hat, sometimes he doesn't, sometimes the horse gets shot, sometimes it doesn't, which can impact our run-to-run consistency. Can we ever fully account for this? Unfortunately not, but by running each test multiple times and then taking an average, we can get a pretty good picture and we can bake that expectation of noise into our data analysis, which finally happens now. Sorry for all the preamble. First up, gaming. For the sake of legibility, we named each of our samples after a Pokemon. Why Pokemon? I don't know, because it seemed better than deadly diseases. Anywho, looking at the geometric mean of our gaming results, we found a 2.07% spread in average frames per second between the best performing and the worst performing CPUs and a 2.46% spread in our 1% lows. This puts all but one of our CPU samples within three frames per second of each other, which gives us confidence that we'll be able to find some close enough CPUs, but given that 2.46% isn't 1%, it also tells us that we can't just pull any three chips at random, nor can we simply look at the average. Returnal, for instance, is a benchmarker's dream because A, it's actually a good game that people might want to play, and B, it is a stunningly consistent benchmark, which is great for producing results that we can trust when we're comparing GPUs. But the real world is a lot messier than Returnal, and while most of our other games, both at 1440p and 1080p, showed a similar small level of variance in CPU performance, in a couple of games, notably Total War, Warhammer III, and Cyberpunk, we found larger variance in the 1% lows. This indicates that, as run, these games are more CPU bottlenecked, which better reveals the deficiencies of our worst chips, but as you're about to see, not all CPU-bound games are bound in the same ways. We went into this process thinking, ah, CSGO, what a classic CPU gaming benchmark. It's a shame that it's been replaced by CS2, and we came out of it thinking, ah, CS good riddance. I mean, on the one hand, it does certainly separate the CPUs from each other and our slowest chip, Corsola, was the slowest-est in CSGO. But on the other hand, the overall variance is so high and so different from the entire rest of our test suite that it becomes almost an outlier data point, having an outsized impact on our results. And this could be for a number of reasons. First, CSGO uses a game engine that is older than YouTube, which has been useful over the years since it was originally built just for single-core CPUs, and it can make use of just about all the single-threaded performance that you can give it. But it also means that its performance requirements just aren't very similar to more modern games that are gonna wanna see a number of fast cores rather than just one or two. Second, CSGO itself is also old, old enough that any modern gaming CPU can run it so fast that no professional esports gamer even could tell the difference anyway, and so fast that limitations in the software itself can start to rear their ugly heads, which adds potential variables. Basically, CSGO is having its Quake III arena moment. After a long run, it's time to drop it. And when we reviewed the overall variance numbers without CSGO, it shows just how much of an outsized influence that it had on our results. The new results show far less variance, closing the spread to just 0.46% and 1.43% for the average and 1% lows respectively, which is somewhat reassuring for you, the consumer, but still doesn't change that our runaway loser, Corsola, is still a dud. Corsola consistently underperformed the rest of the chips by so much that when we remove it from our results, our overall spread and performance goes from 1.43% in our lows to 0.86%. That is a massive decrease. So what the heck is wrong with this thing? We don't know for sure, but one guess is that the 3D vCache on this chip could be struggling some way because it fumbles pretty hard in our Factorio test where most of the benchmark can actually fit on that 3D vCache. Another possibility is that it could be the PCIe controller, the part of the CPU that communicates with our PCIe lanes and consequently our GPU. This idea comes from the fact that when it comes to productivity performance, sure, it still ain't top of the class, but it isn't flunking like it used to. Speaking of, we actually found greater variance between our chips in our productivity tests, which kind of makes sense since we no longer have the GPU getting in the way of raw CPU performance. 7-Zip brought us a spread of around 3-4% for compression and decompression and Blender hovers in the same realm along with our video and audio encoding suites. The biggest contributor to the size of the spread of our sample though is Lugia, who takes up the bottom spot in pretty much every productivity benchmark. Since it wasn't so bad in gaming, this leads us to believe that perhaps there's a problem with the integrated heat spreader, but AMD has made that much more difficult to evaluate now that all of their CPUs just kind of run at the same temperature and then adjust their clock speeds to reach their thermal limit. So across our small sample, variance and performance is present, but not egregious. Of course, we aren't trying to quantify variance. What we're trying to find is equivalence. So how do we do that? It turned out to be a bit tricky. We ended up using Euclidean distance to determine which CPUs were the most similar, unconventional, but also kind of cool. Here's how it works. First, we scaled our data so that our five digit Cinebench scores don't overshadow those low flacking code numbers. Then we took those scaled numbers and treated each as a coordinate for a point in multidimensional space. Think about it kind of like this. If we took a plane and chose two points, those would each have an X and a Y coordinate. Well, the Euclidean distance is the distance between those two points. The closer together the points are, the more similar they are. And this can be applied for points that exist in any dimension. In our case, a 12 dimensional space for productivity and the 19th dimension for gaming. Since we're weighing all of our tests as equal, we can then do a bunch of comparisons to determine which CPUs are the most similar to one another. From that, four emerge as extremely comparable. EV, Mewtwo, Raiku, and Zapdos, with Zapdos being the least equivalent. So, sorry bro, just the other three. They are outside of our tolerance in CSGO. But the issue is that the CPUs that did perform identically in that one game were not the tightest across the rest of the suite, meaning that they can't really be trusted on games that, you know, you might actually be able to play. In conclusion, productivity saw these CPUs perform within roughly 0.24% of one another. And in gaming, we see a spread of 0.86% in the 1% lows and just 0.1% in average frame rates. Now that's tight. Tight enough, we figure that it will allow us to directly compare GPU results across our test benches. Wait, benches? Oh yeah, you see where I'm going with this. We found near identical CPUs, but what about the other components? Do they vary? Time for another round of testing. The main secondary performance contributors in your GPU test bench are going to be your motherboard and your RAM. But since those still run at fixed clock speeds, we're not expecting nearly as much variance. With RAM, for example, you set the speeds in your BIOS and then it's either capable or it's not, and your system is unstable and probably crashes. All of our testing is done at the recommended RAM speed from AMD, 6,000 megatransfers per second. And if you want to learn even more about how we test our hardware, we've got a recent exclusive over on floatplane.com where we have a feature length deep dive looking at the improvements we've made to our testing processes. Anyway, to validate our hypothesis, we took one of our future test CPUs, EV, and threw it into both of our new parallel benches. In gaming, we landed on 0.45% variance in the 1% lows and less than a 10th of a percent in average FPS. That is more than acceptable. And in productivity, we ended up in the 0.13% neighborhood. That means in our upcoming GPU reviews performed on these three parallel benches, we're gonna consider our results to be accurate within plus or minus about 0.25%. Of course, that doesn't mean that our results will be identical to your CPU or to other media. And this is one of the big reasons that we have always encouraged our viewers to look at reviews from multiple outlets whenever making a purchase decision. Oh, before you ask, by the way, there does not appear to be any foul play from AMD with respect to review sample selection. So you don't have to pick a reviewer, for example, that buys their own CPUs versus one that gets seeded. At least we don't think so. We'd have to buy hundreds of CPUs to know for sure, but it appears that the unit that was sent to us for review, which is Raiku, falls somewhere in the good but not exceptional range. So I think we can put that conspiracy theory to rest. Another before you ask is, yes, driver updates, operating system updates, and new software that we add to our test suite could change our CPU performance spread in the future. And we're gonna do our best to maintain our data integrity by performing periodic, what we're gonna call equivalence checks, because you guys have asked for reliable, trustworthy information, and you deserve it, which brings us to a big issue. Why is this task falling to random YouTubers? I mean, the automotive industry, for instance, has government bodies that are dedicated to verifying the performance of vehicles and ensuring that companies aren't cheating on their testing. Then they dole out big fines when they inevitably do cheat on their testing. With computer hardware, there's no such oversight. We and our peers are this thin, open-mouthed thumbnail line between you and getting ripped off. And that's a big problem. I mean, for one thing, we don't have access to the types of testing that large tech companies have, and we don't operate at the kind of scale where we can say for sure if an observation is a fluke or if it's the result of conniving suits that are trying to save a quick buck. Even buying 11 chips for this investigation, that was a huge investment from our side and not the sort of thing that we can do with every single review. Unfortunately, all I'm doing is ranting right now. I don't have a solution to this other than, well, we're gonna keep trying, gosh darn it. But it just struck us as we worked on this project that the fact that these companies don't have to report things like estimated performance in a regulated and standardized fashion is kind of crazy, especially if you consider the kind of money that they're asking for their most expensive CPUs. So what's next? Well, first is gonna be going through the exact same rigmarole with however many 4090s it takes to parallelize our CPU test platforms and then slowly, but surely we're gonna be improving our automations and increasing our test volume, especially once we get the lab's website up and running. But good things take time and we aren't going to rush a good thing. Especially, I'm not going to rush this segue to our sponsor. Delete.me, your personal information sounds like it should stay personal, right? I mean, it's right there in the name. Well, data brokers and other sketchy companies disagree. So they're sharing your data online like it's a family style dinner. Eat, eat your skin and bones is what they're saying to each other. Thankfully, Delete.me is here to crash the party. They'll find out who's spilling your info and get it removed so that scammers can't use it to batter you with robocalls and spam emails. Nevermind that it can also lead to fraud or identity theft because Delete.me can mind that for you. Now, wiping out data held by hundreds of sites by yourself sounds borderline impossible, which is why this whole time I've been trying to tell you that Delete.me can do it. You don't have to. Their nifty software and expert squad can sweep it away in minutes, not hours. Delete.me averages over 2000 pieces of personal data gone for a customer in their first two years. Yeah, go on Delete.me, get them. And you should get on over to the link below and use code LTT for a sweet 20% off. If you guys enjoyed this video, why not check out our motherboard turbo nerd edition video where we went into what exactly are all those little things that look like cities and towns on the PCB?

Info

Channel: Linus Tech Tips

Views: 1,232,975

Rating: undefined out of 5

Keywords: 7800x3d, ryzen 7800x3d, ryzen 7 7800x3d, ryzen 5 7800x3d, ryzne 9 7800x3d, intel vs ryzen, lmg labs, ltt labs, labs testing, marckbench, markbench

Id: os-jXiYRihI

Channel Id: undefined

Length: 21min 37sec (1297 seconds)

Published: Thu Jan 25 2024