Current CPUs are Overheating? The Honest Opinion of an Intel Engineer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign back to another video again joined by Mark and we decided to have one more talk about especially temperatures of parts and power density there was some interesting stuff you wanted to share with us so we thought we're going to record it again all right so yeah okay so what we're talking about is some of the comments that that people say or at least I see online right like I think you're more in touch with the community than I am but I I do pay attention to things that people say in the reviews and one of the things that has been happening more lately is people talk about the temperature that they're seeing out of the CPU right and over the last I would say few Generations like the temperatures have been getting progressively warmer at least the observed temperatures when the parts are under load and I I see a lot of people in the community going oh what's going on right my CPU is sitting at 80 and now it's sitting at 90 and now it's 95. like is this a big deal or not um and I wanted to really talk about like the physics behind it and what's going on because there's some actual good explanations for why you see those temperatures going up and I wouldn't say that just because you see a high temperature necessarily equates to my part is bad or my thermal solution is bad right there's like we can get into how do you troubleshoot a thermal solution like there's that's a whole nother discussion in and of itself right but in terms of is the CPU operating correctly like and where are we pushing them to go I think that's the interesting conversation I mean it seems to be a very emotional and very subjective topic because I see I see the exact same comments you see somebody who's saying like my CPU is running too hot and then I'm asking like what's the temperature and he's like 80 degrees Celsius right and that and when I see those comments I'm like you're well within spec right in fact you've got 20 degrees margin why are you holding back push a little bit further right um and and again like I don't want to say oh you should be going right up to our our limit temperature and just staying there all day long that's not what I mean but like like when the CPU is Idle right yeah let it cool down let it get nice and and cool and and sit there but it if you're pushing it under load then you've got a good thermal solution and you're sitting there at your pl1 like yeah it's it's fine if you hit that temperature limit during those periods of time and what I was wanting to talk about is is the power density right so um what we were talking about before we turn the camera back on is the way voltage used to scale with area right so as we went through the different lithography nodes voltage would scale down and so we would stay more or less at the same power density for a given circuit design as we went node to node in the last 10 or 15 years the voltage scaling is no longer happening or at least at a lower rate and so now we see Power density starting to increase now the interesting thing is there's two things that are happening one is our power density is starting to go up every generation or at least it is as we're staying around 1.3 volts at our Peak but the other thing is we're also pushing our frequencies higher and higher right so the thing is I've been at Intel long enough that I came in when Pentium before it was being introduced right and the netburst architecture and one of the things that we were talking about at the time is the big frequencies that was going to hit I mean finally you made a 60 gigahertz device right yeah now two decades later right and the part that I find interesting is that like we did eventually get to those frequencies on a different micro architecture um but like as as our process notes have improved and as as our design capabilities have improved like we're slowly getting there but as the power densities are going up and as we're getting better at getting to those high frequencies right or being able to push the cores as much as they can we start seeing the transient temperature effect come into play right and so where it used to be and I'll draw on the board here because this is where things start getting interesting they'll use different colors that might show up better um so this is time and this is our temperature and I'm going to put this is my limit so normally when we're designing a a system in a thermal solution we would design for our temperature to be right at the limit when you're running at your Peak frequency and at your Peak power right when we introduced turbo we started saying hey you know what there's quite a bit of times where it takes a long time for the part to saturate and we can go to higher frequency than what the system can support from a sustained perspective so we'll go a little bit higher right and so we pushed a little bit we pushed a little bit and we've gotten a lot better at pushing that um and so what we have now is when we go to our Peak turbo frequencies we're pushing to power levels that are much higher than the system can sustain right and so you'll notice like in our case skew parts right we now actually have our default settings our the pl2 or our our max turbo power right actually it's not the max turbo power but our I forget what we're calling it now I mean you're just referring to PL one and two right yeah one appeal two yeah that's what everybody yeah you got rid of The Tout how does that exist anymore it does exist it's still there but what we recognized is we're saying hey you know what rather than Auto then artificially throttle back if you're not hitting any other limits go ahead and run up at pl2 if your system can handle it and otherwise rather than worry about tuning Tau perfectly then just let the thermal control circuit back off on frequency when you hit your limits right but what's happening is back when we let's say this is a long long time ago right we would slowly this was our temperature response so this would be at the sustained power limit we'd vary slowly over about a course of like 20 or 30 minutes slowly saturate now if I go to a much higher power level right I start getting something that looks like this actually that would have gone above the line so let me say here right and then if I it's easier if I do it on the on the screen but we'll draw this life um so as they go to higher and higher power levels you get a much deeper Spike right this is this part of the curve is exponential yes and so what's happening is when I go to even just single chord turbo which if you look at the power that the parts consuming it's well below pl2 for example right but the power density is really high and so you'll see this big spike initially which is going to drive your temperature up very close right so what happens is if I'm monitoring my CPU while it's active you're going to observe more periods of operation where you're up near the limit just because you you're going to bounce through those lower portions of the curve much faster so you're going to be sitting there closer to the Limit more often and honestly we're designing them now to be spending more time there in terms of like from an architectural perspective I want to give as much performance as I can right so I'm treating this knot as something that I have to stay far away from but more of how do I balance the maximum performance I can give against what the physical capabilities are of the system right so I'm constantly going to be trying to push up to that if I've got more Headroom right yeah because you can utilize more more yeah exactly or yeah right so so 15 years ago when I designed a CPU that would only go up to the Limit at saturation then I would say yeah if you're seeing 190 degrees C or so all the time there's an issue right because most of the time you wouldn't be pushing that power level but now since we're consuming so more power than the system can sustain yeah we're going to jump up close to that limit as often and as fast as we can just to give that Peak frequency or that Peak Performance when it's needed so so basically you're designing the chips to run at these temperatures rather than designing them to stay away from these temperatures yeah so they've always been designed to be able to hit those temperatures yeah but we're from a control perspective we're taking advantage of that much more now than we ever used to right so we're starting to say okay what what can we do in order to get performance back we're looking at that whole operating curve or the whole temperature band I should say in terms of space that we can go use to provide performance back so um this kind of reminds me of to be fully honest it always this reminds me of the the statement that AMD published regarding their Horizon 7000 CPUs because when they had their their launch reviews a lot of CPUs were peeking out at 95 degrees Celsius and then I mean obviously you're probably not going to comment about the the competition which is totally fine but I'm just I'm just saying what probably a lot of people will think out there so AMD said okay the chips are designed to constantly run at 95 degrees Celsius which kind of makes sense and maybe you can transfer this into like an Intel product and if you look at the spec of the the part and it's rated at maybe 95 degrees Celsius then technically it's fine to make sure that the chip is always running at this temperature right but uh lower lower would be better lower lower is better is kind of a funny way to phrase it right so so from for example like as somebody who's using the CPU I actually want the most performance possible right and so I mean like if I'm sitting there just doing web browsing yeah I don't want it burning a ton of power yeah so I actually would want it cooler if I'm doing something like that right because there are a couple things that happen if I go to high temperature I start getting more leakage right which means I'm burning more energy so if I'm not doing something that needs High compute capability then yeah I would want it to be somewhat cooler but if I'm sitting there running a heavy workload say for example I'm doing photo editing or video editing yeah benchmarking or something like that then I would look at it anytime that it's below that 100 degree C and say oh I had a little bit more Headroom I could have gone to one you know like one tick higher on the frequency curve so you actually you're wasting potential if you're not running on the limit basically yeah yeah right now yeah but if you're not running then yeah back off let the thing cool down yeah obviously yeah but under load sure go up to the Limit um like we have the thermal control circuit in place like Tom's Hardware years and years ago did a video where when we first introduced it like I think they took the heatsink off to see what would happen yeah right and that circuit still exists um we cut we're still working on it and adding more capability to it to handle the the transients that we have now um but it's there in order to make sure that you don't actually get the part into a place where it's going to be damaged um and talking about temperatures there's that's also something I always wondered about because I mean you get one readout let's say from uh obviously you can get a readout from the individual course yeah and you get to read all like that's made up like an average out of those course for example or the highest out of these cores but within the core you also have like probably a ton of other sensors yeah and uh so let's talk about that because there's some interesting stuff temperature measurement I'm not sure if most people are aware about that but the the temperature you read out from like in a software it's like like an artificial value somewhat it's like an average out of a ton of sensors or how does it work yeah so so let's let's see if we can sketch it out because it gets pretty complicated um so this is that means I will have to grab the camera again so let's say this is my die right and I'm going to draw here's my cores and I'm going to draw something with an integrated graphics and then I've got something else over here right so this this looks similar to our Alder Lake layouts um they'll be as you mentioned there's multiple sensors in each of the course I'm just going to draw four and then there might be a couple over here one or two over here when you go in and you you read the registers we report one temperature for each core and then there's also the package temperature right so the package temperature is you is the any sensor on the die we report the highest value of any of them that's the package temperature so the package temperature is just a Max that that you can find um because but this could be different on on individual CPUs right so for one it can be maybe core two but for the other one it's called three but it will automatically automatically just report the highest value yeah and we did that just so it's easier for our system vendors or people who are writing the fan speed control right so there's just one register for them to need to worry about and there's actually better ways to do fan speed control than just relying on that Legacy feature um but yeah the package temperature is just the maximum sensor on any of them um I don't know if there's a way outside of the debug environment to pull each sensor inside of the core individually I think right now when you pull the registers you only get the maximum temperature being reported on each core but there's no way to find which sensor is the hot spot which is this is where the interesting thing comes in right because as we were talking about power density earlier as we go further and further into the future on on future nodes right the power density we start getting Peak powered density in smaller and smaller areas and so that means that the gradients within the core start becoming larger and so then it becomes more interesting in terms of well where do I put the sensor right like well there's there's an interesting debate on the technical side where this is where I think it's like the Enthusiast Community is fantastic in terms of pointing out things that where we could do better but at the flip side the part that I think is problematic is sometimes people start going oh people don't like High reported temperatures right so well if I don't want a high reported temperature I'm just going to put a temperature sensor here I mean that's exactly what I meant I mean you don't know where the temperature comes from you don't know where your sensor placement is right it doesn't even mean that the reported temperature is the highest temperature inside the chip itself it's absolutely not the highest temperature inside the chip itself yeah and the reason why is the sensors themselves take up physical space and the sensor technology or the sensor circuit doesn't generate very much power at all right so again it's it's one of those just measurement uncertainties right like or what is it um the Heisenberg uncertainty principle right you can't know something you can either know where it is or how fast it's moving but not both simultaneously right same thing with temperature right so as soon as I drop a temperature sensor into the circuit I've displaced the actual transistors that are generating heat and they've moved so now my Hotspot might have been right here until I put the sensor and then it moves a little bit so it's an interesting game or not game but I shouldn't say game we actually spend a fair amount of my personal time is spent doing the analysis to see where are the actual hot spots on the core where should we putting sensors um and I I guarantee the actual sensor reading is not the True Hot Spot within the core right there is a gradient there that is accounted for in our reliability thing so like yeah again coming back to hit 100 degrees C that's the sensor temperature that's not the actual hot spot and so we're accounting for that gradient and that offset as part of the reliability and the testing that is being done but again yeah like for me personally I like putting the sensor at the hot spot or as close as I can because then we don't have to take as much margin into account in terms of I don't have to put buffers in place to the controls that we have right if people just want to see the lowest temperature possible well then I'll put a sensor over here right but now I have big gradients that I have to account for and that means that I'm giving I'm having to throttle back a little sooner than I would have otherwise on some workloads that might not needed to be throttled at all so again like we want the sensors near the hot spot we actually do want them showing near the limits frequently right because that shows that we're giving the full performance and it also says that yeah we're controlling things nicely so if you if you place the sensor very close to the hotspot and let's say you set the max temp of 100 degrees Celsius and the temperature sensors reading out 99 I mean isn't then the hotspot in reality then maybe like 105 or like how to use something right so the the hard part is I don't have a good way to go in and measure the actual hot spot in live silicon right so I can do it in model space so a computational fluid dynamics or something like that I can I can generate power maps and I can put those in and I can come up with a way to say here's what I expect the gradient to be but doing the the post silicon correlation is actually really really challenging there's some cool Metrology that we have that I can't talk about in more detail that gives us some insight into it so like I've Got Confidence in our models there's it's really hard actually to go in and take a snapshot of an actual workload like if you look at the core itself the reason we have all these different sensors is because we have the different hot spots and the reason we have the different hot spots is there's different features within the core that do different things depending on the instruction set and so even with something like cinebench which as an end user we talk about it as just a single title right but there's different areas within the core that are utilized as that workload actually runs and it's not a okay for the first part of the workload I'm going to use this part of the core and then later in the workload I'll use this part of the core it's literally it's like milliseconds exactly yeah right yeah so you'll the hot spots will bounce around quite a bit so post silicon when you go in and you say Okay I want to go figure out how what is the hot spot relative to my sensor and you're trying to take a snapshot of that that's really hard yeah because yeah also would have to cool the part at the same time yeah from where you are actually measuring on top yes it's a pretty difficult yeah so but yeah that's I would say there is a gradient um like where is the actual Hot Spot it's there's professionally there's actually quite a bit of debate um in terms of how much hotter is it right the interesting thing is if I talk to the test team they'll swear up and down nope no part of the part kind of diet can be above that limit that we have in the spec however knowing that when they're running the test they're relying on the sensors right there has to be a gradient just thermally that's just the way physics works right yeah if you look at the the the transistors themselves in the circuits right the timing it takes in terms of moving bits from one part of the core to the other is is dependent on the frequency that all the core is running on right so if you've got one temperature here and one temperature there that's going to make some interesting things and so what happens is if if we're running the tests and you think everything's exactly at 100 degrees C but you've really got those gradients in place it means you're stressing parts of the circuit at higher temperatures than what you were initially designing them for right and so this is where there's some tension in the system like okay how do you make sure that you have everything totally uniform I mean it's never going to be usually uniform but at the end you also have to do your normal evaluation and just see does it work or not and yeah and if it works and if it passes the reliability test then you're in good shape but um yeah that part is I think interesting as we go further into the future with that with the power density continuing to climb it's a job security for me that makes sense so so as a summary talking about the the debate of my trip is running too hot um so what do you personally consider as a good operating temperature for a desktop CPU like a loads load scenarios yeah so so high load scenario maybe like running a benchmark you would say that if you're not running on the limit you're wasting potential right yeah and assuming that your system designed well right now to be fair like yes me personally I never even watched the temperature on my own machines at home right so like other than when I first build them and make sure that I didn't mess up the thermal interface material when I installed it right and yeah maybe I'll run a workload just to make sure that yeah like when I'm up at my my po1 or pl2 I'm able to hit my clock speeds that I'm expecting for a reasonable length of time after that I don't even pay attention to it um but yeah I would say in general yeah and if I'm under a load I'd want to be operating right up to that limit like use as much frequency Headroom as I have available in the system and don't set my TCC offset lower because you're just going to give some performance back um there's not much point in it so yeah that should maybe give you a bit more confidence because that's one of the like typical comments you get because it's very subjective for a lot of the people um if they're commenting about my temperature is too high and you ask them what is too high yeah we can just follow specs yeah that's why we publish them right art.intel.com yeah that's where you can look up your CPU and we publish the specs there the only spec we would need is also the the max voltage we should use for overclocking but that's probably like a completely different yeah it's a different discussion yeah yeah um that's harder to answer honestly yeah because because then it cuts into the conversation of what lifetime do you want and things like that yeah so when we quote Max voltages we're quoting it at our spec conditions um and overclocking is throw this back out the window yeah all right uh that's maybe uh some material topic for for a different video so I think we're going to end it here again above 20 minutes so all right thanks again for your information and yeah see you next time [Music]
Info
Channel: der8auer EN
Views: 161,209
Rating: undefined out of 5
Keywords:
Id: h9TjJviotnI
Channel Id: undefined
Length: 22min 27sec (1347 seconds)
Published: Thu Mar 09 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.