Handmade Hero Day 112 - A Mental Model of CPU Performance

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
ten minutes over today because we're gonna need the time all right hello everyone and welcome to hand made here on the show we code we code a complete game from scratch live on Twitch no library's no no engines this is a essentially a serious about how to program everything in games so we cover everything including how you would program an engine if you're making your own and today we're doing a topic that is one of those topics that is straight to that point because it's not something that you might do that much of if you are just doing very sort of fluffy high level game code and that is optimization basically we've gotten to a point where we finally have some code in hand made hero that we're calling that does enough work where it's actually causing us a framerate problem and so what we want to do is learn some optimization techniques to go ahead and speed that up so that it will not cause us a problem during development right and so our goal right now is not necessarily to optimize it down to like get as fast as possible for anything but we do need to know enough optimization right now to be able to optimize that piece and basically the principles of optimization are more or less always the same whether or not you are trying to do the most optimal version of something or just an optimal enough version of something because really they're the same process it's just a question of when you stop iterating on that process and so really it's a good time to introduce optimization and talk about how we do it because we're going to do it at multiple points on the stream and this is definitely one of those points so I'm going to go over to the blackboard and give a really quick introduction to optimization how to think about it and how we are going to be thinking about it and then we are going to sort of build a few start building a few primitives that we're gonna need to do our optimization work we will probably not do any actual optimization today because as I'm about to talk about on the blackboard you you never the first step to optimization never has anything to do with actually optimizing code it has to do with measuring and determining a bunch of things about your code before you start and if you don't do that you're you're kind of missing out on a big part of optimizations so let's talk about that really quickly here on day 112 alright so let's talk about optimization what it is how you do it everything about optimization first of all I'd like to start with a caveat which is I am NOT one of the world's best optimizers you are not going to be getting any incredible gems of wisdom forming here I am a good-enough optimizer I optimized code enough to get it to the place where it needs to be for for shipping purposes but I am NOT one of those people who you go to when it's like oh my god we have this thing and it needs to be optimized within an inch of its life that is the other guys that rad you're talking about Jeff and Fabien there or you know people who spend all their time working on optimization stuff I'm not one of those guys so I'm giving us basically the generalist view of optimization which is how do you make code so that it's fairly good and fairly performant and typically a lot of the techniques apply but there's a certain level beyond which I won't be able to guide you which gets into things like crazy stuff like cache line aliasing issues and all these sorts of things we won't really be getting into those other than to mention the fact that they exist and if you want to be one of those guys who just spent all your time on optimization things because there are people who really like that that's like what they like to do I'll sort of mention them as we go so that you're aware of what you need to go and learn in order to really push optimization like past sort of that the crazy point if you will all right so essentially what optimization is is understanding ok there is the CPU right and a GPU in your computer ok and everything that happens every frame of your game is going through one of these two things and often times both ok they are both things which look incredibly similar both of these things are chips that you know either soldered on to your motherboard or on the motherboard and on a card and so on they are chips and what those chips do is they decode an instruction stream right so there's a bunch of instructions and we've gone over this before right there you know we've we looked at this in the very first streams basically there's an encoding for that basically tells them what to do so they go over some instructions that are basically bytes that are in their native format and for each one of those things they do some computation right they do something and typically for the instructions that we tend to care about the most are kind of followed to two broad categories right there are things that are like loads and stores which are basically things that modify memory right so there's some memory somewhere and they're gonna do these loads and stores which are basically grabbing stuff from memory and pulling it into the local registers or the cache of these things operating on them or their stores which are saying we've got computer stuff and we push it out right so there's those types of instructions which time to think about and then there's ALU style instructions right and I guess we won't call it ALU it's a little ambiguous we'll just say there's basically than half operations right there's stuff we want to do right and typically the way that those work nowadays and this is a very very important thing to understand regardless of which of these we're talking about because I'm saying both of these things work this way now math tends to be wide okay it tends to be in something called sim D which is single instruction multiple data and it means exactly what it sounds like there is one instruction but it operates on multiple pieces of data at once and typically in the minimal case this is for things right this is 0 1 2 3 things and they're all gonna have the same thing happened to them so you know the way that we've been looking at code where we have stuff like float x equals you know y plus 3 you know we have something like this this is not at all how modern processors ever work when pretty much at all right this does not happen in a CPU or a GP you ever write we write that only because we are not caring about whether the code is optimal or not what will actually happen 99% of the time the if you write this piece of code is that the processor will actually do this operation on for floating-point values right so it'll actually do four X's plus four wise plus three replicated four times and it will simply throw away three of the results and keep only one of them I'm not joking this is really true that's what happens and the reason for that is that most processors are designed in this way single instruction multiple data and four is the minimum number that you would be throwing away actually there's typically as many often as many as eight in an Intel CPU soon to go to sixteen and in the GPU it could be as many as 64 okay again I wish I was making this up I'm not making this up so what that means is when you encounter a particular instruction like I want to add two numbers together or I want to multiply two numbers together any math operation you are typically talking about something that is either slightly wide is least four things wide or absurdly wide like sixty-four wide thirty-two wide that sort of thing okay why do they do that what's the point of that right I mean it sounds kind of crazy right the reason is because and again I'm not a hardware guy so you know this is sort of the hand-wavy reason and if you want you know a more detailed reason you've got to ask a serious hardware person someone who works on chip design for a living but basically the hand-wavy reason is because it's not free for CPUs and GPUs to decode and process these instruction streams right there's a lot of work that goes on in these instruction streams and let's think about what one of these instructions actually is write an instruction might be something like say add right and again there's registers we talked about these before they're the things that processors operate on and so you might say like add a you know the contents of our zero well actually guess we'll say we'll put something in our zero so add the contents of r1 and r2 and put it in our zero right so we want to do like r1 plus r2 and we want to put that in our zero right so that computes r0 equal r1 plus r2 right this might be what's encoded in the actual processor some some bytes that encode the fact that we want to you know add these things and these registers right are just you know some some set the working set the smallest working set of things that the processor works on right maybe there's 16 of them who knows how many there are different processor to processor but basically everything looks like this and if you remember back to the beginning of handmade hero you remember how it looked our instruction streams typically looked like grabbed something from memory put it into one of these registers grab another thing from memory put it into when these registers operate on them and then write the result back out from the destination register to memory right so to give a complete instruction stream just for people who don't remember that stuff very well maybe we'd say like we want to load something into r1 so we load some memory into some register into r1 we load something into r2 we execute the ad which loads r1 and r2 together right and then we might store right that that r0 out somewhere right so we load in two values into the registers of this processor we add them together and then we write them out okay so again a really simple stupid instruction stream but I'm just getting you back in that feeling we haven't talked about this in a while so I'm just recalling back to the beginning of things right I want you to think back about that that's what's actually happening inside the processor and stuff like this okay now what I'm trying to get into I'm trying to point out here for the purposes of the cindy discussion is that this is not free this is a bunch of circuits that actually have to do all this work they have to select registers for add operations they have to know that we're you know taking this and adding it to this and brighting it into this that has to look to see whether or not this gets used somewhere can it be multi pipeline can we issue another ad if there was another ad after this that looked like this right could I assure these two in parallel or not well the answer is no because this one needs the result of that one right but if this was our four it could in blah blah blah right so it's actually incredibly expensive the work that it has to do to decode the instruction stream that's coming in decide how to issue it go find free units because typically processors have multiple adding circuits in them right figure out which adding rockets are free at any given time to do this ad maybe this one goes to one and this one goes to a different one blah blah blah right so you can see that there's just a ton of stuff that goes in here we won't go into hardly any of how that works because it's not super relevant unless you want to again go down the super crazy path of optimization where you need to start knowing about like arithmetic unit pairing and which reservation slots of whether this thing can issue on this cycle or the next cycle there's all these sorts of things that you might want to get into you might want to know the full details here that's true if you want to be a super hardcore optimizer you do need to know that we might be won't ever really get into that on this thing because you probably won't need to but the point is it's incredibly incredibly complicated and that complexity means that if you can leverage if you're going to typically do operations on a bunch of data at once right then if it can leverage all of that thought thinking that it does here right to go ahead and make sure that when it figures out what it needs to do it can do it on a lot of things at once that's a total win in terms of circuitry there and it means that the amount of actual memory that has to be spun through in order to decode the instructions is much less - because if I have to say if I'm gonna load here and I can load four pieces of data at once that's four less instructions of well three less instructions a quarter of the instructions to load the data as if I had to do four instructions to load four things right same with the ads it's three less ads I do one add and add four things instead of having to do four ads right so if you know that your heavy workloads can typically be written in this style it saves a lot of complexity in the processor and it saves a lot of instruction space just the footprint write a lot less memory band but that's just going into the processor in terms of like they call this an AI cache typically it's an instruction cache it's the place where when you read out instruction streams and figure out what to do with them in terms of like what they actually mean for the microprocessor basically that a lot more fits in there at that point right because you just have less instructions so basically all processors work this way now Cindy is everywhere there isn't a processor you're likely to program on at all whether it's a CPU or a GPU that ever doesn't look like this and so what you need to think about obviously when you're in here is when we are talking about optimization the model that we're thinking about is this model and we essentially have three parts to it we might say which I'll kind of codify now that I've I've got the Cindy out of the way hopefully you all understand what that means when I say that the instructions are wide means that they're operating on many things at once right so we have a few things to think about right we essentially have the instructions themselves this is what the CPU or GPU has to actually do okay we've got the cache we've got memory right and what we need to think about is we need to think about all right whenever we're talking about optimizing something essentially everything that we're gonna do whether it's a CPU or GPU again always fits this model because this is all computers can do it's always a question of moving things from memory into a cache from a cache into some instruction set basically I mean you're not really moving them I guess you could say there's the registers right I'm trying to think of how to draw that but you know if I have that say that add here or I guess I'll do a load load are zero something we have to move things from memory which is where the computer store ISM GPU our CPU they both have memory they're gonna go into some kind of a local cache they're going to get loaded into registers inside the processor right they're going to be manipulated like those those you know ads we were talking about before and then they're going to be written back right they're gonna be written back to the cache maybe in a different place and then eventually back out to memory okay and whenever this is happening these instructions are going to be wide so basically almost everything that we do is going to be one of those single instruction multiple data things so it's going to be at least four things wide possibly more right so the model for everything we do is we know we have an instruction stream that instruction stream grabs things from memory and pulls them into the cache it pulls them from the cache into the registers it operates on the registers figuring out something computing some values and then pushing stuff back out into the cache which then goes back out to the memory right okay so we talked about this a little before I'll just mention it briefly again a cache is basically something that is on the processor usually sometimes it's not but usually on the cache and they have names like l1 l2 l3 and the number indicates how essentially how removed they are from the the CPU in a sense you could think of the registers themselves the thing that the CPU actually operates on as an l0 in some sense right it's the things that the CPU actually directly addresses and it is instantaneous for them to get them essentially right it's not entirely true but it's like the fastest thing for the processor to work on is its registers then it has series of diminishing lis small memory I'm sorry it has series of increasingly large memories it can access the l1 cache the l2 cache the l3 cache sometimes might not have one and then main memory right and the cost of getting things from each of these gets increasingly expensive as you go off so whereas an l1 cache may be something like you know 16 cycles to grab from and an l2 cache might be more than that and l3 cache more than that up to main memory which may be something like 300 cycles right and one of these instructions like an ad might be two cycles right so if you figure like okay looking at how this works if I want to talk about how expensive something is I could think about the fact that as I sort of get further away from the instruction stream those registers that I'm actually a proper ating on as I get further away from those I get more and more expensive to - basically how long it takes me to pull something in right okay so let's talk about this model over here because there's a lot of stuff to cover but just keep this in your mind basically this this diagram here we're gonna refer to it a lot all right so first let's talk about these things I just said which are cycles right cycles are basically the smallest thing that the CPU enumerates work in okay so basically when you ask the CPU to do something typically it's going to take some of these right it takes some of these to do something right and what they more or less coincide with is processors themselves run at a certain rate and they have internal clocking it's far beyond my knowledge of how that clocking works I'm sure it works very differently now from the old days when I sort of learned basically what it was but this is sort of what the processor is doing every step every cycle it can do something right each individual cycle it can do something so when we talk about a processor and we talk about how fast it is like for example this processor in this machine right here we can actually look at the details or I thought we could look at the details that's not what I want where's the just the how do I get just the details of this computer you know what I'm saying just computer there it is so you can see this is a three point one nine gigahertz processor it's actually got two of them in their 3.2 gigahertz is the rated speed so we'll use that for now all right so a 3.2 gigahertz processor which is what it's in this machine okay gigahertz is an SI unit so it's actually a thousand it's not 1024 like gigabytes right so the gigahertz is actually a thousand times a thousand times a thousand right a thousand this would be kilohertz this would be megahertz this would be gigahertz right so we're talking about 3.2 times 1,000 times a thousand times thousand right that is how many cycles per second okay that is how many cycles per second we expect this processor to run it right now it could change that can change sometimes because processors have lower power states they have power saving cycles in stuff like that they might underclock themselves there's all kinds of reasons that it might fluctuate but at maximum when it is running full-bore that is what it will do right so that means we have three to zero Jimmy zero zero zero zero zero zero zero zero right basically 3.2 billion cycles per second right so that is how many the processor is actually going to issue how many cycles it's going to go through in one second now what's interesting about that right I'm gonna go ahead and even though we don't need it at the very moment I want to open up Emacs so I can use the little quick calc feature we were using right right so if I do 3.2 times a thousand times thousand times a thousand right there is my cycles oh you know what I don't know how to use the laughs there it is I love B max when it does that for you okay so there's that now we know that that's how many cycles per second but we know we do 30 frames per second right our game runs 30 frames a second which means every frame we have that many right that's how many of poops didn't mean for that to go away come back that's how many we actually have right so you can see here we've got like a hundred and six million or like 107 million cycles per frame right so let's just write that down 107 million cycles per frame okay this is on the CPU so what that means is we have a hundred and seven million cycles to get everything that we need to get done in a frame done right and if we don't get everything that we need to have done in a frame done in that hundred and seven million cycles our gain will not hit 30 frames a second right it's that simple and so the first essentially thing that I want you to think about in terms of optimization is you should always know what this number is right you should always start out by going okay given whatever it is I have to work with right what is actually the base amount of performance I even have to work with what am i working with what are those numbers right and this is the first primary number we need to think about 107 million cycles per frame that's how many cycles we have on one of the CPUs to do our work right now this is of course not entirely that's not the entirety of the picture because this is essentially one core on one CPU right now in a machine there may be multiple processors and maybe multiple CPU cores that sort of thing and technically each core actually gets this many right so really depending on the machine we're targeting let's say we have a four core machine or something like that then we know we actually have more than this if we can divide up the work among multiple processors but we're gonna get to that in a second for now I just want you to focus on on one core on one CPU we have this many cycles per frame all of our work that has to be done on the CPU must fit into that number okay now it's already pretty low number 107 million right it's not that high and why do I say that well think about how many pixels on the screen there would be at 1920 by 1080 right there's let's see there is 2 million right so there's 2 million pixels on the screen at 1980 by 1980 by by 10 20 right an HD display right 2 million so if you think about how many cycles per pixel there actually are to process if we were doing you know like software rendering like we're doing right now they're not that many right there's like you know something like 50 cycles of pixel right that's it 50 cycles of pixel not a very large number right what's this number divided by 2 right 107 divided by 2 like 50 very very small right it's not a very big number so just want to sort of put that in your head that 107 million cycles per frame is actually not that many this is why you know you see a lot of stuff that doesn't run at 30 frames a second right you open up the you know some kind of app and it's super laggy and weird and janky it's like it's actually important that this number not be exceeded and often times things will right so anyway that's the first thing you want to think about that Mesa cycles per frame next thing you won't ever actually get a hundred and seven cycles for frame probably right because well maybe sometimes you will but other times there's stuff that needs to happen that's not your cycles right so for example there's stuff that when we call Windows to display the frame right we just we have stuff we have to do we call windows to get the state of the joystick and whatever right there's a bunch of stuff that we don't actually control that is using some of those cycles so depending on which core and which CPU were talking about we may not even be talking about all 107 million cycles for up being for our use right but in today's world where there's multi-core if you have a user whose machine has been set up to not have tons of like Adobe Acrobat running in the background and like I'm sorry Acrobat Reader like sitting there checking for updates Adobe Creative Cloud looking to see if it's license is valid and everything else happening assuming that you have a user who's got a machine set up for gaming and it's relatively clean maybe on a bunch of those cores you actually are getting 100 site seven cycles per frame but on at least one of them which is the one where you interface with Windows you don't so that's the next thing to be aware of all right so let's keep that in mind just think about that keep it in mind the next thing what is a cycle what actually is it right because we know that it's running at this rate but we don't really know what that actually means right so it helps to have some wonder standing what actually happens in the CPU on something called a cycle right and so basically what will happen in the cycle and again this is a very very high-level overview of what happens in a cycle and a processor this is not a hard-r exhalation it's just a mental model for you to understand roughly what is going on so please take it with with that large grain of salt so what's happening in the cycle first of all right is there is an assumption presumably remember I told you there's that thing called an eye cache right so basically a processor usually has a thing that's some kind of a cache that has the instructions in it and these things are actually decoded they're usually in something called micro code okay and what micro code is is they're not exactly necessarily the instructions that we see when we look at when we start talking about optimizing things and look at the instructions they're not actually the ones that we see they might be so it might be that like an ad like something like this ad actually is just an ad like it's just one instruction in the micro code but there's other instructions that might be multiple micro code instructions so really there's essentially two if you will this is like the memory where our code lives right there's two things there's instructions here and there's instructions here and when they get loaded when they load here it's actually load and decode right if that makes sense so basically code that we write that was compiled into instructions by the compiler is actually getting loaded into this I cache and decoded into a series of micro code instructions which are not necessarily a one-to-one correspondence with our instructions we see so what I'm about to say next is true for micro code instructions not instructions we see and sometimes you will have the benefit of knowing what that micro code is sometimes you won't it depends on how documented the processor is that you're working with and how much the manufacturer wants to share with you and so on okay so we pull these things into this I cache what the I cache is is this just like a cache for data like we were talking about before but all it does is cache decoded instructions usually and then the processor is reading out of this right it's fetching some instruction out of this okay so on one cycle what is happening is it is going to go and fetch some number of instructions right so it's going to fetch and maybe that number might be as many as pay for instructions right why did I just I don't know what just tap in there stop drawing it could be like for instructions it could be who knows how many depends on the processor it's going to fetch some instructions okay Amos going to fetch them potentially out of order right so depending on the processor in the old days this was very simple it always fetch them in order nowadays it doesn't even necessarily fetch them in order right so the processor is going to have a thing it's going to look at some window of instructions in the microcode it's going to fetch some instructions if things that can do and it will issue I think up to four is how many it typically issues on a minor press but I'm going to say it's gonna issue some number of them right so it fetches some number of them that it thinks it can issue right and they go in here right so maybe I have instruction 1 this is instruction 1 0 instruction 2 1 instruction 2 instruction 3 all right so I fetched these instructions and then I'm going to issue these instructions so on this cycle it will look at these instructions right and issue them so they go off to parts of the chip they're going off to parts of the CPU to be operated on right so these go out to some units we don't know what kind of units they are inside the processor you know maybe ones an arithmetic logic unit so it's going off to an ALU one's a memory units going up to some fetch some data I don't know what but point being we're going into this instruction Katrine inside the eye cache we're getting back some things we're gonna issue them now what happens in here I said order question mark what basically happens in here is modern processors in order to speed things up now issue things out of order what that means is that if the instruction stream said ABCDEF you know gee these are some instructions that i want to do the the there's actually a part of the processor that will look at these and go okay i can execute any now i can execute b now see i couldn't execute because c actually depends on a and b to have completed right c uses like a result from b or something with this right and a result from a it's like adding the result of a and b or something so i can't issue that right now but I could still issue another instruction because I could issue like four instructions this frame so I'm gonna grab D and E as well even though C isn't there because they don't have any seize on see let's say and it will bring all of those up and remember that it hasn't executed see so that later once when a and B complete it can grab C an issue it right and so modern-day Intel processors for example are heavily heavily out of order they work around the latency that you can imagine here where if you have to you know if I have to wait for these two things to be done I can't issue this right they work around that by having a very large window of instructions they look at and grabbing maximally from that window right so hopefully that makes some sense okay so once it gets those and it issues them they go out into what's generally called a pipeline right and this is actually the first step of that pipeline okay and the pipeline is actually enumerated in cycles so on a cycle what's actually happening is this this instruction issue is happening but also simultaneously on that these units are working on whatever the last instruction was that they were given and each of those instructions may take a certain number of what's called pipeline stages to complete so if I do an ADD here and add for example might take two stages so let's say this was an ADD here's the ALU unit gets the assignment on the add maybe that takes one cycle maybe on the issuing cycle like it could be most of the time I'd actually like to say that this is not really a line here in fact I probably shouldn't have drawn that line let me remove that typically I believe and again I'm not a hardware guy so don't you know take this with a grain of salt but typically on the on the cycle that's issued it does the first the first part of the work in the unit right so typically this one cycle here right and so here's another cycle that's happening and here's that pipeline okay I needs the stages so this is save zero this is stage one etc so let's say this is an add this is going into what's called an air Matt arithmetic logic unit an ALU which is things that do things like adds and ORS mask bit operations basically right it's going to go into a unit that's free right gonna do stage one of the ads so this is gonna do ad Stage one okay and then on the next processor cycle it's going to move into the next stage of the ad which is going to do ad Stage two okay and then only on stage three let's say does that actually complete and it gets written out right and so you typically have this concept and I don't know again we you have to look up in the documentation for any given processor for any given operation what this actually looks like for any given number of instructions but that's what a cycle actually means it's one tick of these stages right so if I say to you you know how many cycles does an ad take and you tell me that it takes two cycles to complete then what that means is it's going to use this stage and this stage and only after that will the results actually be available right okay so hopefully that makes some sense everybody clear on that everybody clear on that I don't even know are we clear on that who knows I'm going to assume you're clear on that you can ask questions in the Q&A all right why do they do this what's the point of this pipelining well the reason is because this way it's a way of reusing parts of units at a faster rate right it's kind of the same reason that you have a washer and dryer separated when you're doing laundry right if it makes if you think about it right let's say that I have I'm literally I'm just gonna do it let's say you have a washer and a dryer in your house right let's say you have a washer and a dryer for your clothes okay most houses in America have this it's a common thing for people to have not so much in other places like in Japan a lot of people lime dry things which is in fact better for the environment in America yeah people are like an environment be damned we're gonna have a robot wash and dry our clothes so anyway let's say you have a washer and a dryer right well if you think about how that works right you typically have here's my washer here's my dryer right I put some clothes in right so clothes load zero goes into the washer right and then on the next you could actually think of these as the stages so stage zero is I load in the clothes to the washer right when the washer is done we go to the next stage right where zero clothes load zero goes from the washer to the dryer and if I want to I can now use the washer again so I can load the washer with clothes load what right well zeros in here and then the next time around one moves out zero goes to the laundry basket right and it's ready you can go somewhere can be used me war in whatever one moves in here and to with load in right and what you can see hopefully by doing this right is if I was to do it this way how many loads of laundry right can I actually do in a fixed amount of time let's take a look right so two would move here nothing would go in one comes out to the laundry basket and finally here the last time I'm going to take two out of the dryer and into the laundry basket right so how many do I need here I've got stage 1 stage 2 stage 3 stage 4 stage 5 right so you can see essentially when I did it this way to do two loads of sorry to do three loads of laundry right I only needed five operations through my washer and dryer right now let's suppose I had some other mystical mythical device although they do have these which is the washer plus dryer which takes as long as a wash plus dry cycle so it would take two stages right so if I do my wash plus dry a load in zero right and it's still in they're the next phase right and then finally here I would get it out right if that makes sense in fact I guess you'd look at it like this here I would get it out right so it goes in here and it comes out next one goes in comes out final one goes in comes out right no pipelining and what you can see is if I have this see if I take the same amount of time right if this takes the combined time of both of these then I actually blow a whole stage to do my two loads of laundry I've wasted one whole stage because this is six stages and this was only five here right because I can use that first stage I can get that extra stage in there right and you know if I keep going right that gets it gets worse and worse for this mode right so if I'm in Stage five here let's say I go to stage six right where's my washer washer dryer right outcomes three if I wanted to do it here I would have to go to eight right to get three out and so now we're a whole two faster and what you'll basically see is as I crank the number up it keeps taking one more to do this one than this one so it just gets worse and worse and worse and worse and worse until basically at asymptotically it's always twice as long to do anything through this kind of a pipeline than it is to do it through this one so basically processors in order to maximize the value of all of the circuitry that they have they now are all based on a pipelining model like this right they're all based on a pipelining model because by being based on a pipelining model they can use the first part of something when the second part is completing and it gives it a way to sort of reuse circuitry when there are dependencies right zoom out a little bit here hopefully that makes some sense okay so that gets us another part of our puzzle here right which is in this thing actually I guess I want to be in here right now so we've got our code memory we've got our eye cash reserves our micro codes in we're grabbing things we're issuing them right and what we have then is we have two pieces of information that we need to care about right we have a number called throughput and we have a number called latency and for every one of these instructions we would like to know if possible what both of these things are going to be latency means how long does it take for an instruction to go from being instantiated for it for being issued sorry to being complete right so that's end to end and throughput is a number that tells me how many of them I can do if their pipeline fully right so this is assuming issuing back-to-back right so translating that into our washer diet washer/dryer diagram right the latency does not change based on the pipelining right the latency the amount of stages it takes for these to go or cycles right the number of cycles it takes to do my wash and dry right is going to be two no matter which of these I do it takes two cycles for me to wash and dry something and it so that is the latency number if I have something that I need wash and dry right now I have to wear it I'm going to a meeting it's the only thing I can wear it's gonna take me two cycles so that's latency okay on the other hand if all I care about is I have a ton of laundry to do it's laundry day I want to know roughly how long it's going to take me to do these hundred loads of laundry throughput is your number for that that tells you how long it takes if you cut out the startup and shutdown time right this is typically called drain out which is basically to flush the pipeline right I don't know if there's a term I don't remember if there's a term for startup it might just be startup which is basically getting the pipeline going so basically what throughput tells you is how long does it take for things to push through the pipeline once we've locked off the drain out and the the startup and it's basically saying if I put one in how long does it take to get one out I don't care which one right and in this case that's one for here and it's still two for here so this is a this is a latency two throughput to system basically and this is a latency to throughput one system which means this is twice as fast as this guy for operating on large sets of data but if you're just operating on one thing they're the same okay so basically what that tells you is if you only ever do one load of laundry at a time you'd be way better off with a washer/dryer combo because you would have to do less work yourself but if you wash lots of loads of laundry you want a pipelined one that's a washer and dryer right okay so typically when we're talking about performance we're only ever going to talk about this throughput number because latency and I shouldn't say that with respect to instructions we will use the term latency for other things with respect to instructions this we typically don't care about a whole heck of a lot because typically our goal is to structure the code so that we're always working on things in a pipelined fashion so that we're not waiting around for the latency so when we look up things like how long it takes to do an ad what we're going to be looking at is what the throughput is for an ad not what the latency is for an ad because we're going to assume that we're always operating a lot of things at once now sometimes that won't be true but that's our goal anyway okay so if we understand that that's pretty straightforward and so let's talk about where latency does actually come into play and this is again measuring essentially the same thing but in different context let's go up to our caches right so the place where latency really does cause us a huge problem and it means latency and throughput translate exactly basically over here unfortunately we care about them usually in the opposite way but sometimes not so we'll talk about this in a little more detail what am I at here oh man we have gone through almost all the time all right so basically this is gonna be entirely a blackboard session basically that's all right this stuff has to be covered okay so throughput and latency actually typically throughput for whatever reason typically bandwidth is what's actually used to talk about memory even though they kind of mean the same thing they're not exactly the same but you'll see in a second when I tell you what they are whereas latency is exactly the same thing latency just means exactly the same thing it was used in both context alright so what happens with our memory our cache and that sort of stuff well again this diagram where we've got a stages of a pipeline that are issuing so we got instructions that issue they retire etc etc right this this whole this whole diagram is how we think about what the processor does with stuff that it has right that it has in its registers that it can operate on now but the question is what about stuff that the processor does not have that it has to go and get right and typically we've got two things to worry about there one is the total amount of memory that we can move through the processor at a maximum speed and that again is like sort of the throughput it's called bandwidth typically but that's like our throughput number that means if everything's flowing smoothly in a perfectly pipeline fashion how much memory could we just move in right there's like a bus there's a thing called a bus that takes memory off from the motherboard it takes the kind of sub memory from the motherboard and drives it into the caches right that thing and similarly there's there's it's not really a bus but there's lines that allow the cache to travel from the cache memory on the chip into the register store on the chip the register file right and that also has a speed a maximum thing so there's a bandwidth on both of these right and the bandwidth is the total maximum amount of memory that you could move typically you don't care that much about that for cache bandwidth but sometimes you do and you almost always care about that bandwidth for your memory right because the maximum bandwidth to and from memory just determines the total number of stuff you could possibly ever operate on right it's the speed with which things can cycle through the processor and whatever that speed is that's the maximum you will ever be able to operate on stuff right because at a bare minimum even if you did nothing but move it one place to another it has to come into the processor and back out again right so the bandwidth to the memory typically given in like gigabytes per second for example right the bit that memory bandwidth is what determines how much memory we'll be able to move through the processor at peak okay the latency on the memory again exactly analogous to what it was in the instruction case the latency on the memory talks about how long it takes from the time we ask for a piece of memory to when we will get that piece of memory back and it works exactly like instructions if we issue a load and it let's say it takes 300 cycles to get memory we will have to issue that load and then 300 cycles will transpire before we get that particular piece of memory back right what that means is that if we were actually going to do that load and the load could issue at speed right if we literally just issued a load had to wait for it and all of our instructions were operating on whatever that thing that was loaded was like where all that whatever it was getting loaded we were gonna operate on it right away if that's how we were working we would literally suffer a 300 cycle delay while all of these instructions wait for the result of this load to come back right pretty horrible because 300 is a ton of cycles add like I said is like to cycle through put right so it's just a massive hit and that is not a made-up number like memory often will be 200 cycles 300 cycles to access that's not unusual right it's certainly probably in the hundreds even on very fast memory systems so or I guess it could be better if you have a very slow processor in the old days when processors were extremely slow memory was instantaneous it was basically it came right to you right but as processors get faster and faster the memory can't so that's what happens this cycle count keeps going up all right so hopefully that makes some sense and basically what that means is we that is where this whole Cantus concept you've probably heard write of a cache miss a cache miss is very expensive you hear right what a cache miss is right is it that the structure of this thing here where we have these caches on the processor and then memory that fills the caches the cache is what we load from so on basically these caches like I said like 16 cycles or something like that might be I know that's that's like an l2 speed I think right I don't know when an l1 is maybe two cycles I'm not sure but point being these caches are way way faster to fetch from the memory so if you can write algorithms that tend to work entirely out of things that are in the cache and do not have to go to main memory you will not suffer a cache miss a cache misses when you ask for something and it's not in the cache so it has to go to main memory to get it right your code will run much much much much faster because it will never have any of these latency hits right and what's important to understand here is the management of how this cache is filled is a huge part of modern optimization basically there are instructions we can issue will see them as we go called prefetches and what a prefetch is is a way of telling the processor hey look I need to tell you some stuff to put into the cache because I'm gonna need it later like 300 cycles later and I want you to get going now so that when I can I can do a bunch of other operations and when I come back the thing will be in the cache right so there's a tremendous amount of this sort of thinking about the cache thinking about when to fill it and how to do as much work with the stuff that's in the cache as you can before flushing it out again right before it gets replaced by something else in order to avoid these very very expensive latency hits and to that effect this window here this thing I was talking about with the out of order window that is yet another way that the processor tries to overcome that latency you can imagine some code that's not particularly good that does some loads like this and then starts doing ads on the loads right and that stuff's not it's never been seen by the processor before it's not in the cache it has to go out to main memory so it's got 300 cycles to work with if this window is pretty large if this window is like 50 or 100 cycles worth of stuff it could do it can start grabbing things while it waits for that low and not waste all of the 300 cycles right another thing you may heard of is hyper threading right you've probably heard of that right well what hyper threading is is just a processor keeping two states internally right so maybe I have state 0 and state 1 and it's exactly like a totally it's like having two total processors so all of the state of the processor is represented in both of these states right and all it does is pretty lousy to give it two sets of code to run as if you had two separate threads two separate processors right but what it will do is whenever it sees that one of these would stall on a memory hit right if it's like oh I'm ran out of things in this window to execute there's no more things for me to grab I'm waiting all of them are waiting on memory it'll flop over to the other state and see if it can grab anything that thing will then do some work and it'll probably end up stalling on some memory and it'll switch back to state zero right so hyper threading is yet another way that Intel has tried to introduce technology basically to get around the fact that the memory is very very slow right so there's basically lots of things in here that are designed to sort of hide that thing these are called latency bubbles right which are basically times when the processor is waiting on memory and Lily can't do anything because everything it might want to do is waiting for a very long fetch to memory right hyper threading is one way gets around that right and this memory window is the other one and the rest of them it relies on us to issue prefetches to write cache good thing that has cache coherent it's called basically stuff that uses the cache efficiently it's all up to us right alright so now we come to finally and I guess this is perfect we're right up at the end here we got about eight minutes left so now we come to the end of sort of the introduction to thinking about what's actually going on and the next I guess so this will be good because tomorrow we can literally just talk about optimization the process so this is optimization essentially the platform so now you get like this is what we're trying to think about right what we're trying to think about is you have a CPU and a GPU right now we're only concentrating on the CPU but almost all of this stuff also applies to the GPU oddly enough but we're only think about the CPU right now so we're thinking about the CPU our goal is to figure out a good set of instructions that does not have too big of a footprint right so it doesn't blow out this eye cache right because where there's a cache where the instructions come from so we're even thinking about not making our code too large if that makes sense right wanna stream of instructions that's not too large that does the right set of operations to maximize the amount of time it is operating on things in this cache right that knows what memory it's going to use ahead of time if possible and will pre-fill the cache with it where possible right that can be split up potentially along among multiple hyper threads and perhaps even multiple processors let's say to maximize the number of things that can be considered by the processor at any given time and finally whose instruction stream as well as possible issues the maximum number of cycles issues the maximum number of instructions that could on every cycle and in so doing is keeping this pipeline maximally filled with the things that we want it to do right if we can do that we can get the most out of this 107 cycles 107 million cycles per frame that we have and turn that into a very fast high performance piece of code right and all of these things can come into play depending on the type of code some type of code it's all about the memory it's all about managing how the memory comes through in other types of code it can be all about the instructions it's all about figuring out a good way to structure how we're doing things but in most pieces of code it's both it's figuring out how to make sure the memory doesn't stall you out and then once you have it so the memory doesn't stall you out it's about making sure that you do the least amount of work on it as possible right so that is optimization okay now I'm going to mention an adjunct here and this is not something really that we'll talk about too much in this particular pass because we're going to start at the base level of optimization which is about how to just write the code so that it runs quickly but there's another kind of concept here and we might say that this is performance and performance is the hardest one to think through because there's so much domain knowledge like you see all this stuff I had to draw out here right all this domain logic I didn't even talk about everything I didn't talk about aliasing and the cache I didn't talk about any of that stuff right so basically like performance is the one that has the most domain knowledge of the processor its adjunct is efficiency an efficiency is the easiest to understand it will literally take me less than the last couple of minutes we have here to tell you what it is efficiency means just not doing work you don't have to do right the efficiency is all about the algorithm okay so when we're talking about performance we're usually talking about how to make a particular algorithm run as fast as possible using all of this nastiness and thinking through all of this complexity right which you have to do if you want stuff to run quickly if you want to be really good you got to do it right but there's a whole other side of the coin which happens before you even really talk about this part which is what are you actually doing and did you have to do it all because a very inefficient algorithm makes it so this performance optimization really isn't that that interesting because a more efficient algorithm that isn't performance optimized might actually beat it because let's say that I have some algorithm whatever we want to do right in fact we already saw this once right do you remember when I first showed you how to fill the triangle right we had a triangle and we just iterated over every pixel on the screen and said is it inside the triangle right because I was just trying to prove a point right as trying to say here it's how the algorithm works right that ran incredibly slowly so slowly that just that one triangle right that was a little tiny triangle would totally tank the frames per second we then switched to only iterating over the things that were inside the rectangle bounding box of the triangle and it got much faster right I started wasn't a triangle I think it was actually a it's actually a maybe a rectangle I don't know but you get my point that is an example of efficiency we didn't do any performance optimization the cost to compute any given pixel was exactly the same right but what we did is we checked less pixels right and so no amount of this kind of optimization is ever going to make up for efficiency and so you always do want to make sure that you've done some work to make sure you're reasonably efficient before you start doing this but in this case we kind of know roughly what we're doing we'll talk about a little bit of this because we do have a little bit of it to do in the software renderer but most of our work on the software renderer will be this kind because we kind of know at the end of the day that we gotta do pixel filling right okay so that's everything really about optimization in terms of the high-level overview we have efficiency we have performance efficiency means doing the least amount of work that you can do to get the result that you need and performance is structuring the data and the ins and the actual operations on that data in such a way that they move through the processor in the least amount of cycles that they possibly can right and like I said we probably won't hit the idealized performance for anything we do we won't spend enough time on performance to get the absolute best what we'll be shooting for is like the 80 percent case or whatever like you know a pretty well optimized algorithm but it could easily be beaten by somebody who wants to spend a month just working on this algorithm right I'm not algorithm I shouldn't say but but basically the way of writing the algorithm that makes sense so that's that's basically how to think about what we're trying to do and so tomorrow what we'll do is we'll talk about what are the steps that we want to do that like how do we actually do that we we this is how much we have how do we think through how we're going to work with those cycles and that sort of thing how do we come up with an estimate for how many cycles we think we should be taking how do we check against that how do we measure how much we are taking all that stuff that's what we'll talk about tomorrow for now we will go to the QA since I introduced a lot of stuff on the stream just in case anyone has any questions I'd like to just kind of have a brief Q&A where people can ask about them so if you do have a question please go ahead and ask it now put Q colon in front of it and I will see if I can clarify things for you otama click would you be willing to make more blackboard episodes this is very informative I always make blackboard episodes when they come up if you have something in particular that you think we didn't cover in enough detail on a previous episode please let me know and we could do a blackboard session of it eventually but anytime I come to a point in handmade hero where I need to explain something we do do a blackboard episode and we have had multiple blackboard episodes in the past when we have gotten to those things so I always I never skimp out on the blackboard when it's time to blackboard we blackboard grumpy giant 256 are you going to be using anything like vtune for measuring performance I don't know probably not I think we'll just probably our DTSC it but I don't know I'll ask cuz just correct me if I'm wrong v2 still costs a lot of money if I'm not mistaken and I'd like to keep it the stream to things that people can actually use themselves so we'll try to just write our own enough of our own stuff that we can do it and leave vtune is an exercise for the reader with $500 unless they're giving it away for free now I'm not sure how our instructions written in cache memory I have absolutely no idea to be honest I really don't I have no idea that would be a great question for someone at Intel I don't know if they document it they might but they might not that may just be a trade secret I have no idea so you know I programmed for 30 years now and I have and I've never once seen somebody talk about the actual format of the microcode like literally the format so I don't know but yeah in you could I've never trot like I've never had a case it wasn't like I tried to go find it and couldn't find it I just never seen it come up it might be that if you do a search for like Intel micro code format or something or x86 and microcurrent egg 64 micro format maybe there's a page somewhere it talks about it maybe Intel's got a doc on it it could be in the system architecture manual I don't know but yeah so I don't know that'd be a great question for somebody didn't tell is it public and if so where is it do we manually issue prefetching or is that something inferred by the CPU by looking at how we access memory it can actually be both but in our case what I was talking about on this diagram is is manual actually when they introduced Cindy instructions into the processor way back in the pentium days I think was the p6 maybe introduced them I don't know it might have been the p5 with MMX I don't it's too long I don't remember but way way back when for forever ago they introduced manual prefetching to the processor so basically if you if you write some code and you find that you're getting a lot of cache misses but you know which memory you did need and the processor is not predicting it properly you just put in prefetches if that's the only problem you can just put in prefetches that say like okay when I'm doing this part of the loop go prefetch the one that's like four iterations past here or something and it goes each time you know so that it kind of always kind of lead you know like a like a snipers on that kind of leads it and the processor will then always know oh okay I gotta bring that in right they also have non-temporal stores that's update there's there's actually quite a few probably not as much if you're a hardcore optimizer I'm sure there's quite a few things that it doesn't do that you would like to be able to do with programming the cache but they're you know for for a lightweight optimizer like a such as myself there's actually plenty of instructions in there for stuff like prefetching and non-temporal stores like this that allow you to kind of you know if you're just having a basic problem with the cache you could fix it stuff like cache aliasing and these sort of things which are things that I like only vaguely understand and have never optimized around you know they you know a hard crafter might be like I wish they gave me more micro control I could I could do more with it you know whatever and that's to give you some perspective that's why some people really loved optimizing further these things will be SP use on the on the PlayStation 3 on the Cell processor there is a single DSP used which were basically things whose cache was essentially fixed it was like 256 K of cache that you micro program like you said what to move in and when and when to move out and when and what to move out and when and you just had complete control and so the programmer could do whatever they wanted and they were able you know people who really like that find great optimization that's just a lot of power they don't have with an Intel processor right I know this is a long way off but after handmade heroes done do you plan to continue educational streams I can try to yeah but I probably at some point will want to stop doing them five days a week so maybe we'll do like one educational stream a week after handmade heroes done or something like that just to make it a little easier on me how often do you estimate the actual amount of work prior to implementing a feature versus just implementing it and measuring it so it depends on the thing if I'm actually going to go if I actually have some code that I want to be optimized I always do at least two things one is I try to figure out how at peak speed how fast it should go right and two I try to I try to sort of figure out what my target speed would be for that right like I try to get those in mind so I basically say all right let's say that everything issued exactly right and I did the minimum number of opps for this math that I need to and here's a processor I'm on and I have this many cycles per frame what you know speed should this thing run at right I almost always do that if I'm active trying to optimize that said I am NOT a great optimizer like I'm not the guy that you want to take advice from necessarily so that may not be such a great idea but I always like back-of-the-envelope computations in that way a back-of-the-envelope computations basically which is what I'm going to talk about on next stream basically is going okay just a rough estimate how fast should this thing run and the reason that I like that is because it gives me something to measure against so I can go what if it doesn't turn out to be something within a reasonable percentage of that I can say why what's the thing what what was I not thinking about that's causing me to not get what I want right and so like for example if you get into a situation where you're like I should be able to do this many loads and this many multiplies and this many stores all pipeline properly this should work out to this and it's not that's a great time to go call up Intel call up whoever and be like ask your optimization friends what's going on here can you give me some information and I've seen great results from that some optimizers who are much more serious than I am like Mike Sartain for example I've seen them do stuff where they encounter they end up at places where they're like this is running too slowly and I've computed everything out and I know it is and they've gone and talked to Intel and they've gotten Intel to run like top secret traces of the code that they can't even see necessarily how they were doing them because it's like on internal only stuff and Intel will send them back Oh the problem is that there's a thing in the processor that works like this and you're hitting it because you're alternating these two things and it was like okay right and so a really really hard core optimization guy knows the boundary that's stated in the Box sees if they can hit it and when they don't hit it and don't know why they ask the processor manufacturer and find out and that's like the hard core thing right so I always at least try to the first part of that process which is to let me know where I'm at relative to what I should be doing because I think that's a pretty important thing to have because otherwise you don't really know where you're at you're like I don't know like is this code optimizer isn't it who knows because I don't know how fast it should go and if I don't know how fast it should go I don't know whether how fast it's running right now is pretty fast or pretty slow you know so so if memory takes a few hundred cycles if the instructions have to reach out to the hard drive what impact would that have the answer is a lot less now because of SSDs so it's getting better but the way to think about the hard drive is just adds yet another piece right so I think we drew this diagram a long time ago and handmade here actually but since it's a little more relevant now imma draw it again you could think of it this way right register l1 l2 l3 memory just add another box drive right and you can even do two of them drive and network right so cheapest right this is like almost free this is very expensive and everything in between and that's why you end up with this sort of concept that latency you know somewhere around here latency starts being this kind of dominating thing that you're thinking about because it starts getting so big you know 300 cycles up to you know thousands of cycles or tens and thousands of cycles or way worse depending on what we're talking about fetching from here what you end up doing is somewhere you know as latency starts to dominate as you move in this direction that's why everything tends to be overlapped when you get out here in a huge way like drive stuff you issue you know like on the thing that I'm doing it at work right now I issue Drive reads two seconds in advance of where they would be used or more right so we're talking six billion instructions in advance I'm issuing them at least right and and sometimes more depending on the circumstance right so that's just that's just how that that is start your pancakes two questions are there ever any cases where we have to worry about one of our instructions being decoded into multiple micro instructions without our knowledge - in optimizing have you set up the code in such a way that you can optimize things function by function with this eventualities in mind or will we have to restructure some of the functions to allow them to be optimized so question one the answer is typically I never think about micro code all that much because the throughput and latency numbers tend to be all you really need to know for baseline optimization like we're doing if you're trying to do super hard core optimization then yes actually that is something you need to worry about because of this part that I was talking about back a little earlier you can sort of see exactly why this comes into play this thing with the issuing it's the it's the issuing that kills you right so when things get too coated into micro multiple micro code instructions what that means is it puts more pressure on the on your issuing count right if something gets blown out into seven micro instructions that means that that thing you thought was one instruction actually gonna take two cycles even just to issue everything of if they're not independent possibly more it could take up to the seven instructions if they're all serially dependent right and so it's having something broken out into micro code if they're serially dependent and there isn't code around it that can be executed to hide that bubble could absolutely be a problem but again is that common note that's not the common case the common case is you don't have to think about micro code will it happen if you're a super harka code optimizer I suspect the answer is absolutely yes will it happen if you're just a you know reasonable optimizer such as myself no you don't tend to be going you don't tend to push things to the limit of peak performance where you're gonna have to start thinking about things like that question to whether I've set up the code in such a way that we can optimize things function by function the answer is generally yes and generally that's a good thing to be doing just in general like when I try to write code I try to write code knowing what things are expensive and really what that boils down to is order stuff right remember order notation that's the thing that's like Oh N Oh and squared right that sort of thing basically what you want to think about is how many iterations right that's really like what you want to think about so you've got four loops sometimes you get four loops inside four loops and so on that little piece of code in the middle how many iterations does it have to do right and if you think about that in the back of your head you're always going to kind of know what your slow parts of the game are going to be a lot of times right because the things that that executes the most frequently are the things that have that happen and so in the case of the software renderer pixels right is a loop over x and y right so it's om squared in the size of the thing on the screen right it's it's filling in an area so I just tend to know when I'm writing a software renderer okay what are the things that are gonna happen to software renderer I know that I've got relatively big sprites on the screen that fill up a fair number of pixels I know that I have a fairly low number of them by comparison right so like I've got the number of sprites you know maybe that's ten thousand or something total I know I've got the number of pixels you know I know that's like you know millions right and I can just tell like this is the thing that I need to be the most concerned about and this is less so right and so you're already when you're thinking about writing the code it should be obvious to you just that you're typing it in oh what are the places here they're gonna have to be optimized it's that rectangle fill right that's the gonna be the very most important thing optimize so everything else is gonna be secondary to that so you gotta make sure the code is all about allowing me to optimize that rectangle fill right that's at etc but at the end of the day as long as you're not doing totally crazy stuff for their code and this is one of the reasons why again I also hate object-oriented programming is because typically I find that if you write stuff in a straightforward C programming style just the way you would normally write the code tends to end up looking like how it will end up being when it's optimized and the only time that that doesn't happen is because of the fact that C's structs weren't designed when Cindy was the thing and so typically you have to do a little bit of reorganization for this for the cindy aspect of things so that changes things a little but I find that typically the way people normally write things in an OOP style it's very different from how you would write them if you were trying to optimize things and so that's yet another reason I don't like that paradigm it doesn't mean you can't write it doesn't mean you can't learn to program your oop in a style that makes it okay for a play it just means your natural instincts will often be wrong in that domain and so that's another reason I like the C style because I feel like people's natural way they might write something and C tends to be pretty close to how it should be laid out when you actually need to optimize it would it be inefficient to offload the cache to an SSD over SSD over or with minimal RAM usage or with a late I'm sorry I'm not sure I totally understand the question would it be inefficient to offload the cache to an SSD over or with minimal RAM usage or would the latency be too much I'm not sure I understand the question and SSD is basically just very slow Ram right so it sits at again the place that I drew it sits over here in the hierarchy right and so you would never offload things to an SSD you would offload them to memory first and then to the SSD usually because you want to overlap that right you want to overlap that that flush but yeah I'm not sure I totally understand the question so sorry about that premature optimization is the root of all evil what's your take on that quote my take on that quote is it's largely a meaningless quote unfortunately because premature is the problem right it's obviously true just at the face of it I mean well I don't know if the truth face it because I don't know the root of all evil might be a bit much right it might be that it's a statement that would be unequivocally true would be premature optimization is evil or is bad right but the problem with that statement is saying premature optimization doesn't really help anyone because how do you know what's premature right on the one hand optimizing a bunch of things at an instruction level when you are when your game is already running at 30 frames a second or something like that might obviously be stupid and premature and yeah that was a huge waste of everyone's time and makes the code less harder more brittle and because you can't change it as easily in whatever right so obviously that's true but that's not normally what what the the thing that might make a mistake the thing where you tend to make the mistake is okay do I need to store these things in a spatial hierarchy or not right or do I need to do this or not right and the problem is doing those things or not doing those things tends to have architectural implications for the rest of the code and it's not as simple as just not doing it until you absolutely have to because doing it later might have to change how a bunch of stuff was architected and then that optimization was too late so post mature optimization that becomes a problem and so I find the problem with that quote is is not than anything wrong with the court itself it's just that knowing whether something's premature or not is actually very difficult and is not as easy as the quote might make it sound is there any way to use or avoid hyper setting to your advantage yes and we will be doing that on the stream we will be using both hyper threading and regular threading in the in the renderer and we'll talk about how to use those on the next stream probably what you tell someone who doesn't like Emacs I tell them not to use Emacs I just use Emacs because that's the editor that I'm most comfortable with but I think everyone should use provided they can be provided they can be like roughly as fast as they see me code on stream if you can code that fast in whatever editor you want I think you're good if you can't then you might want to consider either learning how to use your editor better or switching to another editor not so the Emacs but just another editor that's better so that you can be at least roughly as fast as I am because that's the speed that I think you want to roughly go at if you're going significantly slower than I'm going that's probably a bad sign especially because I'm coding slower on stream than I tend to code in real life because I'm talking and stuff so if you can't at least hit that speed I think you want to think about whether you're facile enough with the editor that you're using but it doesn't have to be Emacs does hyper threading reduce maximum bandwidth because it has to switch between states or can both states operate at the same time typically yet if the bandwidth is is typically shared between hyper threads so yeah the bandwidth doesn't get any higher just because there's thread it hyper threading so with a reduced bandwidth is a little bit hard to say because bandwidth is a measure of how much stuff is moving into the processor so that obviously is always the same but if you have too hyper threads they are sharing that bandwidth obviously so you don't get double the bandwidth just cuz they're the two hyper threads let's see is your in your active me in your experience what drives the good enough optimization and how does the notice and how did the novice guys get a handle on that actually it's very simple the good enough part is actually pretty pretty straightforward you have a game it needs to run at 30 frames a second you have some expensive stuff you're doing in that game it is good enough when it runs at 30 frames a second on your target platform that is the end of it right and so it's usually actually pretty easy to know when you've hit that now hitting it can be hard if you're trying to do too much but you know knowing when you've hit it is easy the game runs at 30 frames a second done all right and so for any given piece how you know how close that piece is to being optimal is by doing that back-of-the-envelope calculation that I was talking about before it was it's basically about having some idea of how fast the processor could ideal do this workload and then seeing how fast you are doing that workload and that's a good measure of how optimized your performance is efficiency there's not a lot you can do about efficiency I'll be honest with you efficiency is is very hard to know because how do you know if the algorithm you're using is very efficient sometimes you can figure it out by going oh well I know that somebody has proven this algorithm takes at least this many iterations right it's Oh n provably and I'm only doing o n so I know I'm relatively efficient sometimes that's true but sometimes the thing that you're doing it's very hard to know because it's like okay well yeah it's proven to be O n but I'm accelerating it with this lookup structure that makes it a little closer to o n most of the time if it's distributed well but what it's not sure well maybe I could use a different spatial partition and then it would be distributed better so I you know bla bla bla bla bla and like there's all these things where it's like really kind of nuanced and so that part is actually very hard to know and sometimes the answer is just you iterate on both the algorithm and the performance because you just don't know until you can get yourself down to 30 frames a second that's just any it's any it's a grind and there's nothing else we can do right and that's just sometimes that's the case right so it looks like that's about it we're out of time and I don't see any more questions on optimization really sudonym 73 you were posting a gift hub link there for latency numbers do you want to go ahead and post that to the forum so we can like have it up there I could I could tag it as a as i cap ermine opposed or something that might be cool but for now let's go ahead and wrap things up tomorrow hopefully now that we've kind of gotten all the preliminaries out of the way I can just talk briefly about some of the stuff that we need to do so we can actually get to some code because I'd like to do a couple code things relatively soon for us to start working with but anyway yeah there we go that was basically a giant blackboard session on optimization topics hopefully that set the stage so that when I referenced when I go through talking about subjects and their reference things I'm hopefully hopefully that will be nice and easy for everyone to understand so when I reference something like in l3 cache or the latency of this instruction or the throughput or whatever will kind of all be on that same page so if some of this was confusing please go back and read watch this or ask questions on the forums so that we can kind of all be on that page because as I go through this I'm going to just refer to those things like instruction throughput in that sort so I'm gonna refer to those and and assume that we now are clear on what those mean all right thank you everyone for joining me for another episode of hand made here it's been a pleasure talking about optimization with you know coding today hopefully tomorrow so I was talking about I did I hope that cleared some things up and got us on a good sort of on the same page for what we're about to do for this week next week and so on if you would like to follow along with the code we're gonna do this week and you want to pre-order the game that gets you the source code so if you go to hammer here or get Pierrot to the game here and it comes with a source code so you can download it every night after we're done here the forums are a great place to go if you have questions that you want to ask or you want to look up some stuff that we've got up there like that ports to Mac or Linux and also we've got some coding resources anti episode guide just stuff good stuff up there check that site out if you're Fowler for the stream we also have a patreon if you want to support the video series and what we're doing here we would love that you can always subscribe they're very much appreciated and we've have a tweet but which treats the schedule so if you're just trying to catch the stream live on any particular day if that's the place to go it'll always tell you the schedule today we're 5 p.m. Pacific Standard Time every day except Friday when we're doing a morning stream 9 a.m. so please check that out just so you can check that schedule out so you can remind yourself what the schedule is it's a great way to do that so thanks everyone for joining me and I will see you guys tomorrow take it easy everyone
Info
Channel: Molly Rocket
Views: 14,113
Rating: undefined out of 5
Keywords: Handmade Hero
Id: qin-Eps3U_E
Channel Id: undefined
Length: 86min 14sec (5174 seconds)
Published: Mon May 04 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.