CPUs Are Out of Order - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Things we talked about a spectrum meltdown and they rely on some of the more advanced ways that the CPU operates It's probably worth diving down and actually looking at how a CPU actually executes the code be right I mean, we've touched on this before we did a video on pipelining we did a video on caching, but also delve down and see What happens below the surface when we actually get our CPU to execute our code? Let's start by having a simple example: A line of code that we might want to look at what happens. Let's take a line of code that takes a variable Let's take a line of code. It's gonna add up A plus B plus C plus D Times e so I've written this in this sort of see like language So we're gonna do this calculation now as I'm sure most of us are aware When we take that and put into our C compiler run it it gets converted into the machine code that the CPU executes so we take that client of code, and then we'd have to Convert that into the machine code, and then the CPU Executes that machine code so a program like this would end up looking and I'm going to use arm assembly here Just because I know it better than the anything else perhaps for the first instruction. We would load the value for memory of a Into registers, let's pick our zero. We've got 14 or so of them We can use the 16 of them but some of them get used for different things that we don't really use so although the value of a Into our zero next thing we want to do is you want to add that to the value of B Then after make sure we'll get the operator precedence right so we can load the value of B into a register So let's loading the value of C Here into another register And we might as well do D. And E. As well so load or three come on D. And We'll load our four With E as well, and now we can start Adding these things up multiplying them to produce The actual result we want now we're going to make sure we get the precedence right But we could either start by adding a and B together then add on C. And then Multiply D. And E and have them together or we could do that one first I'm just going to start going from left to right as long as the math is right We'll get the right result so we'll add together a and B now I put those two values in r0 and r1 and we need to store the results somewhere We are going to need the value of a again after this, so we'll reuse the register R 0 so we're saying put into R 0 the value of R 0 plus R 1 so this is adding together storing the result in R. 0 so we now added a and B together We want to add on C. And so we could do the same thing add to R 0 The value in R. 0 which is now because of this instruction a plus B want to add on the value in R 2 there's now about a plus B plus C in Our 0 now we need to do the multiplication And we need to do that separately before we add it on so we get the right result so we'll multiply And we'll see we've got an arm, too cheap here, so we've got the multiply instruction there And we need to put the results on whether it's use our 5 D. Which we put in R. 3 and E Which we put in R? 4 and then we want to add the result of that onto the value In our 0 and now our 0 contains the result of a plus B. Plus C plus D times E. And We could then store that back into X So that line of code there at one line of C code would become what 1 2 3 4 5 6 7 8 9 10 different lines an assembler and I've numbered them because I'm going to Refer to them at different times so we can say searching one instruction 5 etc to refer to the different ones now We might expect that our CPU will just xu instruction 1 the new instruction 2 instruction 3 instruction 405 and so on in order To generate the result and some cpus do in fact work exactly like that, but actually if you think about What the cpus and what these descriptions are actually doing you might think well actually? when I get this first one I've got to go an access memory and As we talked about in the caching video many years ago, cache is perhaps a an old-fashioned English word but it basically just means a small place where we can store things so you might use it to store your hidden treasure if you're a pirate or to store Your food for winter on a modern CPU probably say around 200 Nanoseconds to actually go and get the value out of your main memory and load it into the register now of course If these are already cached in the same bit of memory, then you may find that these all execute very quickly We don't know that this isn't the only way we could write this program because if we take this instruction here instruction 6 Where we do the add of r0 and r1 to add up a and B. Well. We've got those two values here They're already in the registers at this point in the program So there's nothing to stop us moving this instruction up there and it would still have exactly the same effect so instruction 6 could be moved to me between instructions 2 & 3 And then we do the next instruction which was the same as instruction 3 here? which would be LDR R to come of the values in memory that's representing the letter the variable see how exactly the same effect. We just moved that Instruction earlier so you could rewrite this program in various different ways now Why is that interesting? well when we think about how a CPU is designed and that you will have various different what impress be termed execution units within there now one of them is what's generally referred to as the ALU or the arithmetic and logic unit and that's the bit of your CPU that does Addition it does subtraction it does sort of logical operators and or and so on But you also have other bits inside there And one of the bits you'll often have in a modern CPU is it part of your CPU that handles loading and storing Values from memory sometimes interact sometimes they don't now Assuming that they are separate parts of the CPU if we look back at our instructions here. We execute instruction 1 It uses a load store. You need to get a value for memory we execute instruction 2 It uses the load store unit to get a value for memory instruction 3 It uses a load store unit to get a value for memory for uses the load store unit to get a value for memory 5 uses the load store unit to get a value for a memory 6 changes and uses the ALU as 2 7 8 & 9 before insertion turn uses the load store unit so we've got a pretty sequential series the first 5 instructions all execute using the load store part of the CPU the next four instructions execute using the ALU and The final instruction again uses the load store unit but as we said we can reorder that into this version here using instructions w x y and z Differentiate them and we execute the first instruction instruction w uses a load store unit instruction X Uses a load store unit instruction Y uses the ALU restrictions ed uses the load store unit Okay, what difference does that make well let's think about what's happening when we're using the load store unit the ALU isn't being used that part of the CV is just sitting there not being used and When we're using the ALU the load store units sitting there not being used, that's what we saw there But does that have to be the case could we actually design it, and you probably guess the answer is that yes? We can so that While the load store unit say is being used that we can run the instructions on the ALU part as well I'd turn the paper round and I'm going to draw This as a sort of timeline so these are our two units and we've got time running along this side as well I'm using the computer for our paper in a Radically different orientation, but never mind, so we're going to execute the instructions On here and the first thing that happens is that we execute instruction W No problem That's going to take certain amount of time to actually that's using the load store unit to execute it These are being fetched and decoded and sort of executed by the different execution units we then execute the next instruction which is X and we couldn't execute this any earlier because the load store unit was being used to execute that one so no difference than what? We had before we're using this one after the other we now come to execute The add instruction now we can't execute this any earlier than this point in time Because this depends on the value of registers r0 and r1 which aren't set? until this point so we need those two values so we can start doing instruction why here now actually It's an ADD it's not going to take as long as fetching things from memory because it's all inside the CPU so we can use A smaller box and we can put instruction Y there and this depends on the value being fetched from there And I'm just going to show this as an arrow here, but the next instruction load r2 comma C well I doesn't depend on anything except the value in there Marie and our load/store units not being used So if we build our CPU right? There's nothing to start that Instruction being executed at the same time and that means that actually when we come to the next instruction Which would be which will be the best instruction to execute next in this example. Let's go back to our program We've executed instructions one to six and three already That's w x y&z we've rewritten the mass let's put instruction seven here What was instruction seven and this is now going to become? I'm gonna have to use it's gonna become instruction a I'll hopefully remember to say instruction a but You can guess the colonics are referring to a on its own is probably the variable if not is probably the instruction so we can now execute instruction a and again instruction a depends on two things It depends on the value of R. 0 which is going to come from this instruction so we have to have that ready But it also depends on the value of R 2 which is coming from this instruction so we have to have that ready as well so it can actually happen any point before This point in time so this would be the LDR R 2 comma dot and this is the add R 0 and this is the next add, but again we can start trying to leave more the instructions because I okay well That's what instruction for here at the same time. We'll call this instruction B And so on we put that at that point we can execute Instruction B at the same time as we do way and I'm really confusing myself with pens here and so again We've saved some time because rather than having to execute that in the same thing we can do these two things at the same time Now to be able to do this we need these instructions need to execute on different execution units we couldn't for example Execute to add instructions at the same time because we haven't got to Al use well, though There's no reason why you can build a CPU with two Al use if you look at modern CPU designs from Intel AMD arm and cetera they all have often have multiple Al used or allow you to do just that but because the different types we can execute them at the same time and the reason we can do that is because They don't depend on the results of one to work out the other so they're working on different things and they're using different parts of the CPU and The CPU that enables you to do this is what's known as a superscalar CPU because it can run multiple instructions at the same time will you continue doing this and we'd end up we execute instruction B then we've got to execute instruction C instruction D uses a Multiply and actually on a CPU probably got a separate execution unit which does and multiplies because you can actually do them faster that way So you have a multiply unit as well so we can execute that multiply D up there We think well okay? Can we do the other at the same time well no because we need the result of that as well so we can then execute the ad down here before finally, and it just fits on the paper like that so we can actually squash things up and we're going to save some time because if you think about it you have the original order of the program and Here's one. I made earlier All right, or as in I'm just about to draw and Shawn will do some very clever Cutting so even if we had a superscalar processor. We've only got one load store unit we've only got one Al you really got one multiply unit we wouldn't have any opportunities with this program To run two instructions at the same time so this version of the program would Still take ten instructions this one still takes ten instructions, but with a superscalar processor we have the opportunity to sort of execute two instructions at the same time because they use different bits of the CPU now you need to design the CPU to allow that but that enables us to Speed things up a little bit because while this is working to get the value for memory. We can execute some more instructions Now that's all very well and superscalar processors started to appear in the mid 90s things like the six eight thousand and sixty the Pentium I think was superscalar But they require the code to be written in a way That enables this to happen so this program wouldn't have been able to do anything This one would but as we said when we were developing this we could work out which Instructions we could move around to get that speed up based on What those instructions depended on so this instruction? We said what what six became why only depended on the values of R 0 and R 1 which has been set by instructions 1 & 2 so we can move that earlier Without affecting anything in our program because it only depended on those 2 values so we can either do this in the compiler or by hand if you write in the assembly yourself like we just did here or It's possible to let the CPU work it out, and so what a modern CPU does what's called an out of water CPU is Reorders the instructions without supposedly breaking the rules of What each instruction does so it'll still execute it as if it was written like this? And it won't change break any of the rules of that but it will say well hang on it will spot that this instruction could happen earlier and So move it earlier to gain some of that parallelism in fact then execute them together at the same time And that works generally get well But as we saw with things like Spector and meltdown if you allow things to happen too far earlier and start doing what's called speculative Evaluation where you say okay? I've got the stuff. I need to execute it now I don't if I need the result but I might do so I'll execute it anyway, and then if I need it I've already done it and if I don't need it while I was still waiting for this to come in anyway So it doesn't matter that I've done it. I've not lost Any time well Then it's turned out that you can have sort of side channels where you can sort of see that that's happened or not Which is caused a few issues with computing? It goes along here like this Intersects the curve somewhere else flips over and it's over here, so this is for G Now we won't look at any more right the edge of a formula for this is just mathematics to do with lines and the tangent of this curve It's actually not very complicated the point is that what we're doing is by multiplying G
Info
Channel: Computerphile
Views: 183,805
Rating: undefined out of 5
Keywords: computers, computerphile, computer, science, CPU, University of Nottingham, Dr Steve Bagley, Spectre, Meltdown, Computer Science, CPU Exploits, out of order, Speculative Execution
Id: _qvOlL8nhN4
Channel Id: undefined
Length: 15min 9sec (909 seconds)
Published: Fri Jan 19 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.