Intel's CPUs Are Failing, ft. Wendell of Level1 Techs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and then Linux is like I don't and it turns off a peor and you lose two threads it is it is a crash dumpster fire so I saw a comment on Reddit that said GPU vram limitation how could that be yeah it's got to be a game bug right no not so fast there Chuckles that's actually a CPU problem yeah Chuckles hey everyone so I joined by Wendell and you just published a video about Intel oh I think it's bad it's real it's real bad it's bad is Intel screwed yeah they might be okay so I started this I was like ah you know they'll figure this out with micro code and Aiza like getting the equations just R and making sure everything with turbo we're talking about of course the crashing that gamers are experiencing with the 139900 K 14900 K kfks and the other variants with that and uh it's been kind of a wild ride before that this video is brought to you by NZXT C1500 Platinum power supply which has passed the cybernetics titanium standard for efficiency NZXT has been certified for efficient operation of WS of 94% and uses a 140 mm fan to help run quieter under load this 180 mm form factor power supply is fully modular using capacitors rated for 105° C learn more at the link in the description below yes so this story for anyone who's not fully caught up but this story's had a lot of different I'll call them red herrings uh maybe they're related in different ways but the biggest one has been power profiles right there was this whole thing of the the rumor wor about motherboard vendors are implementing Intel Bas power profile extreme and all these other ones which are causing problems with CPUs yeah and so that's kind of where everyone was looking for a while uh and really spending resources of okay so is it the the boosting Behavior and the power limit the pl1 pl2 that Intel institutes that causes this instability we got a tip recently I'm not going to go into in this video it's different than what you're doing though and uh maybe related we'll see we'll find out we'll see we'll see we we've got some interesting leads I can't say more because I don't want inel to know what I know but let's talk about your stuff first since since that'll give us a good primer it's the game development side of it so you look at games and games Crash and you get game Crash Telemetry and so I reached out to some contacts that I have two different games both based on unreal and I got access to their full crash database so these are crash dump it is it is a crash dumpster fire yes as you said earlier does it rise to the level of pants on fire bad it does I think in and like I didn't think so when I started looking at this but um fast forward okay you're looking at game database it's like okay is this person overclocking their machine have they degraded their CPU because they gave it 1.7 volts or something crazy you don't really know and a lot of the crash database stuff is not structured in a way to deal with inconsistent crashing and there are some consistent crashes like the GPU out of vram error there's another one rad V game tools put out a bulletin that said if you have one of these specific Intel CPUs chances are you're affected because you'll get a a decompression error so this is a detected error but that code is so old it's unlikely that it has a new bug that is tripped up by these particular Intel CPUs they basically said hey if if the problem points to us it's not not our problem yeah yeah basically it's like good luck Gamers and it's been like 5 months and and Intel hasn't come out and said something that I would love to see which is hey Gamers we will take care of you we'll make you whole we'll figure out what the problem is and we'll take care of it it's been like 5 months so what is let's let's start a little bit with the beginning of this so first part is talking about crash dumps and game development right so we were talking about like you just you said sort of briefly here but one of the challenges with random Reddit posts online especially is you don't have all the information it's really hard to know was it this guy's particular configuration that he created yeah right or was it actually an Intel silicon problem or whatever maybe bios configuration uh so the reason that you started looking at server side stuff yeah had to switch gears it's the reliability right yeah so the in there's a lot of really interesting stuff in the crash database but not all of it was actionable I mean I found errors that seemed like CPU crashing was creating nvme errors and it's like what do you do with that and is your crash database set up to do with that so we flipped it around got to talking to uh data center providers and yeah it turns out that the 13900 13900 K 14900 K kfks those are popular for Game servers versus you know big giant zeeon systems because the High Core clock speed and because of something called the blast radius which is when one of the servers goes out how many players does it take down oh okay that's a good name for that yeah it's like the blast radius of This Server dying is only going to take out a couple of dedicated game servers running on this machine right and and then so looking at it from that angle actually uh sort of release the floodgates in reaching out to those contacts those companies have been super frustrated because they're just on this continuous cycle it seems to me you know based on the anecdotal evidence of talking to them that they've been replacing hardware and replacing CPUs they at a system stable and it'll be stable for a few months and then something happens and it becomes progressively more unstable so let's let me interrupt for a second so server providers who use like a KCU CPU are they less likely to use let's like so to me it seems unreasonable or weird for a server provider to say use a Z series motherboard and like overclock the product yeah they'll actually use a w680 okay so in the whole universe I'm talking about a total of around 210 systems across well actually probably closer to 250 across three providers it just added a third provider but the original data set was about 210 systems across two providers and it was a mix of Super Micro and Asus w680 motherboards and the failure rate between Asus brand motherboards and uh super micro brand motherboards was almost the same and that was 50% 50% of the CPUs that I encountered in the 13 and 14th gen c-port I experience some type of failure in a seven working day period that's where they're running continu insane to me cuz so if it's 50% I trust the the data source you have I guess the question I have is we were talking about this earlier but 50% seems like such an enormous amount that it would be the only thing you see on every Hardware Reddit I guess the difference is these are server providers so they're less likely to be vocal about it it's a business issue yeah one of them said that they pulled all their 13th gen and Intel had agreed to provide 14th gen as a replacement and that was fine for a while but now the 14th gen are starting to have similar issues okay some of it is also down to ddr5 memory speeds and so some of the machines we've been able to bring back into stabil by using ddr5 4200 oh my God yeah 40 and limiting the multiplier to 53 which I think was in some of the are you better going with ryzen at that point okay yeah I mean like the 7950 X is basic like we have they historically those providers have not used ryzen in those kind of volumes but at computex we saw a crapload of am5 based systems that were set up for these kinds of deployments well it seems yeah it seems terrifying for Intel if the server company cuz server companies have enormous buying power so if you start scaring them away and they go to AMD because they don't want to have a blast radius of players every time a system goes down that that just seems not sustainable yeah half the systems with the w680 and these are w680 power defaults too so that's I think 125 watt TDP you look at the boards and the boards say we're going to run these K series CPUs at the lower power targets to begin with these are CPUs that have never seen a higher power wattage and um it's interesting that it's still the and I want to call it a failure because it's not like failure the way you would experience on like a gaming computer like if your computer only did something weird once in a week right you might I mean that's just Windows right but these are Linux servers they're running 24/7 and there's something in the system log or there's a game crash or there like the kernel the Linux kernel log have one system that has one pee and it'll just turn off and it'll run fine for like 4 days and then Linux is like I don't and it turns off a PE cor and you lose two threads and like that's really wild uh Running Y cruncher to like try to do burning testing it's stable it's fine and then it gets to the S&T testing in y cruncher it dies every time but I've only got like one system at one provider and two systems at another provider that die in exactly that way okay it's really wild and the 50% that are fine work fine that consistently to one thing no no the CPU sometimes it's an e core sometimes it's a pcore sometimes it's a Memory operation one of the Y cruncher tests memory footprint is only about 21 Meg so that's like 90% operating from cash and it still will have errors sometimes maybe that kind of gets into the the tip we are withholding a little bit for for another video but interesting so so then what is the if you're seeing across some providers a 50% failure rate let me start with this cuz I think most people watching are going to be consumers the extrapolation is does that mean it is reasonable to expect a similar failure rate in the consumer Market I would assume no only because I haven't seen enormous volume maybe but it looks like you're going for May yeah yeah because keep in mind that in this server context running 24/7 for 7 Days might translate to running a month eight hours a day so it's an aging maybe maybe yeah yeah we have to couch a lot of this in maybe because it's not 100% certain but yeah one error a week on something like that even though it's not running in an overclocking configuration isn't I mean you wouldn't want that but it's not you know crashing every 10 minutes but there are some CPUs in the population that are crashing every 10 minutes even at ddr5 4200 even at a multiplier lock of 52 even disabling eor I that's like yeah if if I'm a server provider and I have a server going down every 10 minutes I feel like you have to take it out of commission yeah cuz it's going to hurt your customer base uh cuz the customer whether it's a game company or whatever is going to be pissed at the server provider yeah yeah one of the companies has told me that they have stopped listing uh new systems for sale based on 14th gen CPUs until they have a satisfactory explanation wow okay in a parallel thread I also reached out to oems like Dell lovo and HP seeing if they were willing to go on or off record for what they were going to leak and the memos that they have from those are like 10 to 25% of CPU used may need to be replaced yeah again hearsay that's a rumor maybe like if I'm seeing a 50% failure rate maybe they can fix half of those failures in an agiza update or agiza in a micro code update or in a bios update or uh you know something like that and the reason I've got a Giza on the brain is we have seen this kind of thing from AMD as well like AMD has had these kinds of instability that they've been able to resolve with micro code updates but then famously you know the exploding CPUs problem that is not something you can fix with micro code alone especially after you've already got a damaged CPU oh no there's there's no fixing it once it's damaged physically damaged the the thing that sets this apart here is that the w680 chipset theoretically has never exposed those CPUs to UNT conditions so unless there is some kind of insane motherboard Behavior like we saw in some of the x3d exploding boards you can kind of rule that out if it's working as designed as it doesn't seem like it's a z690 z790 problem yes it starts to get into so cuz so the initial topic was is Intel just pushing this too hard and that seemed reasonable to believe there's no 12th gens that have have these problems in the cohort yeah what's going on there yeah and 12 13 14 are all fairly related so I mean I think I found a total of four Ood crashes out of millions of data points versus tens of thousands of Ood crashes from game crashes for 13th and 14th gen and like maybe uh four AMD crashes what what is something whenever we make a deep dive video we have to leave stuff out because of time yeah where the video just gets too long no one's going to watch it is there anything that you left out that you want to kind of throw out there I can't it seems like 50% is like I'm it's such a large number that I'm not comfortable with it it seems like there's a missing piece to this puzzle because more people would be climbing the walls but at the same time when I'm talking about a 50% failure I'm talking about one failure in 24 hours a day 7 days a week running for a week right and so it's just like this system has a problem but some of those systems have problems like they turn off and uh you know we monitor the pdu we monitor um we monitor the pdu we monitor the uh uh the uh you just got like the Men in Black brain wipe we monitor the pdu because the data center has like they know how much power each system is drawing and we also know um the temperature cuz we were logging to temperatures because these are Linux servers so we know all of that and when you're running on a w680 the temperature on those with the cooling that the data center has is like 55 60 65 it doesn't you know like 70 is like The Hot Spot temperature you know I think the hottest not thermal yeah across the whole population of hundreds of servers the hottest hot spot I saw was 83c okay well that seems completely reasonable especially given that it's probably in a hotter environment surrounded by other servers yeah so okay so I mean if one of the messages you sent to me I think is worth talking about is um if it's this bad uh I don't know exactly how you phrase it my interpretation was it's it's almost like whatever gains Intel has had with this generation see now to be almost unfairly represented yeah because my kind of interpretation here but if a company's selling a product that can win certain benchmarks or comparisons but it does so in a way which is not actually stable for daily use then to me it's not really valid in our type of benchmarking well how do we have we have data center logs from where these systems first went online and with these systems first going online you know 6 months ago go they would pass these specific tests rerunning these specific tests now on the exact same Hardware it will not pass that's wild yeah so I mean so have they have they robbed their competitors of more of a victory by effectively having a unachievable long-term performance yeah like I mean but half of them half of them are okay yeah right okay so it's just like did they have two lines you need to chop it Intel yeah it's just what is going but if you're a gamer that's terrible and I think that's where Intel could come out and say hey we're going to we're going to figure this out and we're going to make you whole Intel does know there's an issue so we got that that's nice do you think they I was going to say do you think they know what it is but I guess since I think I know what it is they probably know what it is but is I I I strongly suspect they know what it is too and I strongly suspect that there are those that Intel those inside of Intel that think that it cannot be completely solved via a software update I don't think so yeah uh I we'll find out I would have argued with you on that before I started because it's like ah these things happen I'm sure it's fine I'm less sure after having worked on this for four months that it is fine well so I would I would have also argued with me on that point I do it frequently uh and uh but I think because so hard run box shout out to them did some excellent testing on the different Power plans for motherboards and showing how it affects performance this was very important because if intel or motherboard vendors change the boosting Behavior after the fact then that has to be accounted for at some point because it's going to be different uh and I would have based on the information we had fully assumed that's kind of the the the direction from which these CPUs are are becoming unstable a power issue it makes a lot of sense it's like are they boosting too high they're becoming thermally unstable is there degradation of the chip because of that that would make sense on a Z series chipet but the w680 is a little bit of a Smoking Gun here yeah so that's the key point I want everyone from the audience here to take away is uh specifically what Wendell's looking at on server with w680 that's kind of the the key and um now caveat it that's not to say that the power profile stuff is also is not to blame as well right like it could also be to blame it could be a separate problem yeah it could be all of those things working together and again none of that matters as long as you have Intel and partners coming together to say sorry Gamers we'll make you whole I feel like that the reason that I got some of the information I did which I don't trust fully because I got it this way is because there are people that are unhappy that are large volume customers that don't feel they have been made whole well that's why that's why it's so important to get multiple sources because you know first one maybe has an axe to grind so you try to get some other sources and if you've got multip pull this scale of systems then at least some level the information is probably accurate so it definitely points you a direction I think also some of the misdirection and crashing is why this hasn't bubbled up sooner it's CU everybody's sort of working under different assumptions there is a reasonable is sounding explanation on the table but like the messaging that Intel seems to be giving large oems like d HB Lenovo seems very different than smaller providers that are only buying CPUs a couple of thousands at a time smaller providers buying a couple thousand at a time see how big Intel is right uh yeah I mean the numbers we've heard for potential defects is in the Millions for uh information I've received I have not verified that I don't know how accurate it is uh random sampling of 250 again it's not a huge population but if you if you're got a if you got a game and you got you know one of the game developers said straight up it's going to cost me over $100,000 in Lost players to say nothing of everything else well also the the knock on effect of how does it affect their business in the future yeah uh because the player doesn't care yeah yeah they literally have screenshots from their support Forum that are this game is buggy because the GPU out of vram error is and it's not physically possible for one of the games to use more than 20 gigs of vram yeah well so this is what um it kind of reminds me of the Zotac situation totally different problem the thing that it reminds me though is with Zotac as soon as it was business to business it really mattered and it needed to be solved immediately and so like here you're talking about servers and Intel that's B2B yeah like the the level of concern should be pretty escalated at Intel y um Intel has like famously wow nice we got some lights nice um we'll see if we get arrested in a few minutes of the spotlights uh the historically though Intel's pretty tight lift about things so well you don't want to Spook everybody cuzz you know shareholders get wind of that and then it's just like oh no what's going on whereas you can go to your data center providers and be like hey we'll give you 14th gen to replace your 13th gen are you good with that it's like but I still have to pay all those guys to change those CPUs yeah well we'll throw in five or 10 extra right great great so what do you where do you see this going next what do you want to see from Intel I really hope that Intel would come out and say hey we're going to make Gamers whole cuz this kind of stuff happens let me ask you keep saying that let's define that what does that mean replace the CPU or fix the like fix the problem with whatever it is and I think that I think that probably half of the CPUs that are experiencing these kind of issues probably on the order of half of them will probably end up need needing to be replaced that would be great that' be a great start experience issues that was the caveat you had there the qualifier yeah yeah so you think half that experience issues will need to be replaced which would be about 25% of the population that I have right um and that's even really super conservative settings so you may not you know even though you got a k processor you may have rolled more of a one than you realize if you're going to be running at like DDR 55200 or 4200 or 4400 or whatever so that that would be great and then the second thing is um there's a lot more data out there in those databases go hunting it's hard to do you know select star where CPU is unstable but you can maybe start correlating the data where it's these specific CPUs and just looking to see if the type of failure is maybe a little irregular versus play time so like I had to put together the marketing analytics's database which is how much time people are spending in games the developers don't usually get that as it turns out um and say okay from this population that's playing the game this long how many crashes are they experiencing grouped by CPU it's actually very hard to do that even with very Advanced um tools and so now that we've pointed that out maybe more people will say hey we should do that internally and look for it one of the game companies said they're going to have to roll back some banss they thought people were cheating because the state of the game client was inconsistent with the server for some people enough that they were like we don't know what they're doing but the game client is inconsistent with the server we're just going to ban him yeah okay so this this really causes a lot of uh yeah a lot of effect like Chain Reaction yeah okay so so that's kind of the what wend wants to see in the next steps I guess if you're a game developer you work for one maybe raise the issue to server team or something go to your database and look steam has a lot of good Telemetry for this kind of stuff so if you get a steam game you can and you've added the Telemetry you can probably but you add your own like it's not just like valve doesn't really provide anything amazing you add your own stuff on top of it but all the plumbing and user agreements for that kind of stuff are there right okay well so on our side uh I only recently received a tip that seems pretty credible and wend and I have been talking about it we're going to see what we can do um it's an interesting explanation yeah it would take some time to really thoroughly work through but it is something we're capable of and uh we um I'll leave it there so uh you have one of those CPUs good luck I'm so sorry yeah or if you do have one of those email uh either one of us and we'll figure out we have some ideas we have some stuff we can do with it uh particularly useful if it has failed it it's no longer working but uh check out his video it's going to be on level one text please go watch it Wendell has done an excellent job recently especially Intel has a problem yeah so we've got you've got that one and then also uh a Microsoft qualcom video yeah uh which that one I I have had a chance to watch through that's really good so it's a little bit clickbait but it's probably okay no it's fine it's it's not it's not clickbait you deliver on the title so check them out uh level one tax I'll link it below and thank you for joining me no thanks for having me it's awesome we'll see you all next time no more pants on fire craziness
Info
Channel: Gamers Nexus
Views: 23,312
Rating: undefined out of 5
Keywords: gamersnexus, gamers nexus, computer hardware, intel instability, intel unstable cpu, intel 14900k unstable, intel 14900k crashing, intel 14900k gpu vram, intel 13900k crashing, level1 techs
Id: oAE4NWoyMZk
Channel Id: undefined
Length: 23min 59sec (1439 seconds)
Published: Thu Jul 11 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.