When Optimisations Work, But for the Wrong Reasons

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Ever played a game with crazy amounts of  stuff, yet running so smoothly it seems   like magic? Behind that buttery smooth  performance is more than just powerful   hardware. I started as a graphics engineer  almost 20 years ago, and since then I’ve   always been fascinated with optimizations,  but one stands out as both incredibly simple,   yet the reason WHY it works is more complex  than most people realise. Let’s examine one   of the most effective optimization techniques  available, and understand it to the same level   of depth that experienced engineers do. Let’s start with a little world and we’ll   add a buncha stuff to it. This will serve  as kinda our testbed to show things off,   so we’ll keep it reasonably simple. Let’s go ahead and add an object to   the world here, so we’ve got our player and  this object. The idea is relatively simple,   this object will have several versions, each  being progressively simpler. So here’s the most   complex or detailed version of the object,  and we can sort of pan the camera down the   line looking at the simpler and simpler versions.  These will be called levels of detail, or LOD.   What we’re going to do is, when we place one of  these objects in the world, we’ll start it nearby   for convenience, and the idea is as the object  gets further and further away from the camera,   we cycle through the different levels of detail  for that object. So we’re a bit further away, we   switch to a lower level of detail. A bit further,  and we can do that again, switch to an even lower   level of detail, that sorta thing. How many levels  of detail games use is a bit of a tossup, since   authoring them takes artist time, you may only  have enough budget to build a high and low level   of detail, or you may have money to burn and you  can build a half dozen versions of the asset.   This brings us to a really awesome optimization  that’s used quite a bit, called imposters. For   many people, this is just another level  of detail, or a whole other technique,   I mean it doesn’t really matter. Let’s take these 2 objects here in the   distance. From this viewpoint, what are the  differences between those 2? Not a whole lot,   but as we move forward, we can see that  something is kinda off with one of them.   This one looks perfectly normal, but the other  one is clearly just a picture that’s rotating with   us. I can freeze it so that we can walk around it  and see that this is just a floating texture. This   is a billboard or imposter, and from far enough  away, you really can’t tell the difference.   With a little bit of extra work, these imposters  can be really advanced, you can render out normal   maps along with the diffuse, you can render them  from multiple viewpoints and blend between them,   giving you a highly versatile and incredibly  low cost stand in for the original asset.   This scene here from my previous grass video  features something like 50k trees of various   types, and it’s trivially cheap to render  due to most of them being imposters.   The question now is, why does  this technique work so well?   The intuitive answer is that these models  have fewer vertices, and so then it sorta   makes sense that having fewer vertices equals  doing less work on the gpu. But pressed for more   information than that, this is kinda where  the explanation stops for many people. This   answer isn’t wrong, per say, but at the same  time, there’s so much more going on here.   If you were to go and buy a state of the art gpu,  they list their performance these days in TFLOPS,   that’s an absurd amount of computing power,  so why does just reducing the number of   vertices in the scene make such an outsized  difference? There’s more to the story.   Back in the 90’s, GPU’s had what was  called a “fixed function pipeline”,   which meant that the various stages of the GPU  were set in stone. The hardware had different   units allocated to different tasks What this meant practically was that   resources weren’t shared, it was more like  an assembly line with things going in 1 end,   going through a series of fixed steps, and then  coming out fully processed on the other end.   This all changed in around 2004, when AMD,  or ATI at the time, for those of us who have   been doing this for a long time now, developed  the Xenos GPU for the XBox 360. This would be   one of the earliest designs that unified the  shader architecture, meaning that the shader   processing units of the GPU were shared between  vertex and pixel shading duties. Nvidia followed   not too long after, I think it was around  2006 with the Tesla generation of GPUs,   their first unified shader architecture. What this means is that the GPU can now perform   sort of load balancing between the demands  of different types of processing. The GPU   is free to dynamically allocate resources  between say, vertex and pixel processing,   leading to more efficient GPU utilization.  Whereas before with the fixed function pipeline,   it couldn’t. So with that in mind, my uber  powerful fancy new GPU, realistically shouldn’t   be bothered too much by the introduction of  even millions of extra vertices, because doing 1   fullscreen pass would be a few million pixels, but  wouldn’t even make a small dent in performance. If   vertices and pixels are basically shared now on  the hardware, there shouldn’t be a difference.   Let’s look at this example, I have this fullscreen  effect going, meaning that I have an expensive   fragment shader running on the whole screen,  every single pixel. I could be running anything,   this could be changing the colour, doing various  blurs, in reality there’s so many things that   modern games do as post effects that the exact  contents of this shader isn’t that important.   The ONLY important point here is that we’re  running at 100 fps. Now, let’s work through   some numbers together, to get a sense of the scale  of work the GPU is currently doing. So this screen   that I’m recording on is 1920x1080, so 1920  multiplied by 1080 is 2073600 pixels in total,   or let’s round that to a cool 2M pixels, and I’m  doing a tonne of work in the pixel shader.   Here is the vertex shader, it’s incredibly bare,  there’s barely any work in here, in fact the   fragment or pixel shader is doing a magnitude of  work more than this. Buittttt if I do something   really dumb, just to illustrate a point, and make  a buttload of triangles, in fact 1 quad for every   pixel on the screen, so there’s 2M pixels, that  means that we’ve got around 4M triangles. Despite   the awesome power of this GPU, the framerate  drops catastrophically. It absolutely just   grinds to a crawl, so why is it that the raw  power of our GPU doesn’t translate to handling   these 4M triangles effortlessly? Something else  is going on here, let’s unravel this further.   If we’re looking at a single quad, that’s composed  of 2 triangles which are themselves composed of 4   vertices, it’s not a whole lot. If we expand that  out to a full screen of quads, we get somewhere   in the ballpark of 4M triangles, and thus roughly  4M vertices in total. If you think about a screen   having 2M pixels already, and we’re able to do  many many many fullscreen passes, and all of   the associated shader computation that comes with  it, that number of vertices isn’t impressive.   We can try out a few different numbers, to  see how this affects things. We already know   that a single quad is pretty quick, but what  happens as we increase the number of quads,   and thus the number of triangles and vertices? We can start out by just doubling the number of   quads used, and we don’t see much in the way  of differences in performance. If anything is   happening, it’s pretty minor at best. Butttt  as we scale the count up aggressively, as we   make the size of the quads smaller, and thus the  # of quads needed to fill up the screen increases,   effectively increasing the triangle density  on the screen, we start to notice a dropoff in   performance. Initially, it’s nothing crazy, but  what we see is that it’s not a gradual decline,   there’s a real dropoff at some point. Between tiles roughly sized 4x4, that’s 4 pixels   high, 4 pixels wide, and then 2x2, we see this  absolutely catastrophic dropoff in performance,   it’s just bananas how bad this gets. Let’s talk a bit about how GPU’s work to   better understand why this is failing so badly.  There’s this really great post by Nvidia called   “Life of a Triangle - Nvidia’s Logical Pipeline”  that kinda gives you a behind the scenes look   at what happens between that API call and  when things finally appear on the screen.   AMD also has a really great talk entitled “ALL  THE PIPELINES – JOURNEY THROUGH THE GPU“ which   has a nice overview of the different stages  that basically happen between you attempting   to draw something, and pixels appearing on the  screen. We’ll loosely follow along with Nvidia’s,   but I’ll provide a link to the  AMD one in the description.   So if we imagine things starting out in our  own code, we’ll initiate the whole thing   with some sort of draw call. In webgl this  might look something like gl.drawElements,   vulkan might be something like vkCmdDrawIndexed,  etc. I mean the specific commands aren’t   important. What happens now is that the driver  goes ahead and validates whatever you sent,   making sure the data even makes sense, before  prepping it into a GPU friendly format.   That makes it’s way over to the GPU, which does  some processing, let’s kinda just skim this,   the important takeaway here is that it  creates batches of triangles to work on,   and these are sent to these parts of the GPU  that Nvidia calls GPC’s, or graphics processing   clusters. Let’s not wander too far off into the  weeds here, but let’s just mentally go with the   idea of GPC’s as being capable of handling a  portion of the graphics rendering independently   from others. So obviously then, the more you have  of these, the better because that means your GPU   can theoretically do more work simultaneously. The important part at this stage is that these   will grab the vertex data for the triangles,  and begin scheduling work for each vertex.   It’ll do the vertex processing at this point,  and the Nvidia doc goes over various tricks   and optimizations they use, which aren’t  super important for the sake of simplicity.   In the AMD talk, they specifically refer to  this as the Primitive Assembler (PA) stage,   whose job is to put together a triangle and then  forward that on to the rasterization stage.   This is a bit like a step in an assembly  line that takes a bunch of pieces,   the individual vertices, and builds your  primitives for you, usually triangles.   The next stage basically involves figuring out  who’s going to be doing the fragment shader work,   so at a coarse level they’re kinda testing  the triangle against quads on the screen   and divvying up the work. You can see on  Nvidia’s, depending on the screen rectangles,   they’re handing off the work to different GPC’s,  and AMD’s has a similar note about doing some   scan conversion to test triangle overlap. The end result is a bunch of fragment or pixel   shader work is getting queued up, so what it  does here is it gets these 2x2 quads of pixels,   that’s 4 pixels per quad, and queues those  up together, you can see both Nvidia and AMD   both reference this, and the simple reason is  that at that size, they can do some efficient   calculations. Stuff like texture gradients for  mip mapping, well a 2x2 quad has just enough   information to do that, while not being more  complex to implement from a hardware point of   view. So pretty much everyone will do this. At this point, there’s a bit more work to do   depth testing, blending, or various other  operations before finally getting output,   and that’s how pixels are born. One obvious question that falls out of this is,   if 2x2 quads of pixels are the smallest  possible unit the GPU works with, what   happens if I draw a triangle smaller than that? Good question, and this is where you’re realizing   that GPU’s are basically built with, you could  say that it’s an assumption about what you’re   going to draw. And that’s pretty reasonable,  hardware engineers are looking for ways to   maximize the GPU’s potential and then teach you  what to do and what not to do. So basically,   there’s an assumption that there’s a given  ratio between triangle size and pixels covered,   and at some point past this, they’ll still  support whatever it is you’re trying to do,   but in a very disapproving way. When you try to draw something smaller, the GPU   absolutely can still do it, but what happens is  that you’ve got this 2x2 quad, which is kinda like   the smallest unit the GPU works with. The triangle  inhabits this quad here, what happens to these   other 3? The answer is that the GPU still does the  work, but then it just throws away the result.   In fact, it’s always doing this, just most of the  time, you don’t notice it. In this case, it threw   away 75% of the work, but let’s imagine that we  have a bigger triangle, so let’s look at a bigger   screen here, and then we’ve got a triangle that’s  being rendered, so it’s touching a bunch of pixels   right? Except we now know that the GPU works with  quads, so it’s actually touching a whole buncha   quads, some of which lie right on the edges of  the triangle. All of these quads here that lie   on the edge of the triangle, they have pixels that  lie within the triangle, and some that don’t. But   the GPU will perform the work for the entire quad  each time. There’s not much you can do about this,   it’s a natural consequence of rendering. Some  of this work, the GPU is just going to have   to throw it away, it’ll be wasted. But what you CAN affect is what kind of   triangles you feed to the GPU. Really small  triangles are an obvious terrible case,   you get really poor quad utilisation, throwing  away obscene amounts of GPU performance.   But we can go a bit further and examine  different topologies for a mesh. By topology,   I mean the way the vertices are arranged to form  the mesh. You can triangulate a mesh in many ways,   and it’s important to understand that they’re  not all treated the same by the GPU.   Emil Persson, a senior graphics engineer  at Epic Games, known extremely well within   graphics circles as humus, has a great  blog which I’ve referenced in the past,   for example in the last video I talked about  occlusion systems and one point I made was   based on a talk he did for Avalanche games Anyway, there’s a nice article talking about   how the topology of a mesh affects the  performance. What the article goes into   detail is examining how different triangulations  of the same mesh result in different performance,   while things like the mesh’s area, or other  things like # of triangles are kept constant.   The findings shouldn’t surprise us, as we can  see, what the test did was to increase the   number of vertices and gauge performance. We can  see that the max area approach worked out best,   while making ever thinner triangles did poorly,  and you already know why, because you end up with   poor quad utilization, and thus the gpu does  a lot of work that it has to throw away.   As Emil’s research showed, the specific way that  a mesh is triangulated has a very direct impact   on performance, and better triangulation  can lead to better quad utilization,   and thus less wasted work from the GPU. So given that understanding, what happens   now if instead of 1 quad per pixel, what  happens if we push things further? let’s   start upping the number, and we’ll go to  4 quads per pixel, or 9 quads per pixel,   or even 16 quads per pixel, or whatever And what you’ll see is that the framerate   will start to tank further and further. It’s  a lot of triangles, but not THAT many. It’s   2M to start, and grows to 8M, 18M, 32M This was an interesting topic of exploration   on g-truc’s blog, who was looking at the effect  of subpixel triangles on performance. As they   noted near the bottom, as you cranked the # of  vertices up, the framerate absolutely tanks,   even if you’re rendering to lower  resolution. This post is quite old now,   and I don’t have access to the exact setup that  was used there, so my numbers don’t match exactly,   but I am seeing terrible performance,  which is realistically all that matters.   This goes back to the idea that your GPU is this  super powerful monster that can take on anything,   and a buncha triangles, in the grand  scheme of things, isn’t that much. So   then what’s behind this drop in performance? Let’s revisit that primitive assembly stage,   but this time through the lens of AMD’s RDNA  architecture, as detailed by their ‘Journey   through the GPU’ talk. We touched on that earlier,  where we can think of Primitive Assembly like a   step in an assembly line that takes a bunch of  pieces, the individual vertices, and builds your   primitives for you, usually triangles. So on slide 19, on the RDNA architecture,   we can see that various parts of  the pipeline have been labelled   And there’s this one here, called the  Primitive Assembler, which as we know   is important for assembling vertices into a  triangle and then outputs it to the next stage,   which is often called rasterization. This is where it gets interesting. So,   this will obviously vary from architecture to  architecture, I mean you can go and find the   information sometimes. For example, we can look  at the RDNA architecture whitepaper from AMD, this   goes into some details on how the Navi class gpu’s  work. There’s a lot of info in here, but we can   kinda just focus on what we want from here which  is some info about the primitive assembly.   Specifically, they talk about how they scale the  architecture, so as we can see they have this   detailed breakdown of how the Navi class GPU’s,  like the Radeon RX 5700 XT which came out in 2019,   we can see how they’re structured. They  consist of what are called, shader engines,   the Radeon RX 5700 XT has 2 of these, and  within a shader engine, they’re further divided   into shader arrays, which contain, among other  things, a primitive unit and a rasterizer.   This is analogous to the Nvidia setup we  saw earlier, where they partitioned their   GPU’s into what were called GPC’s  or graphics processing clusters.   So the neat thing is then, you can connect that to  this section later in the whitepaper, where they   talk about how the primitive units assemble  triangles. Let’s just read this directly:   The primitive units assemble triangles  from vertices and are also responsible   for fixed-function tessellation. Each primitive  unit has been enhanced and supports culling up   to two primitives per clock, twice as fast  as the prior generation. One primitive per   clock is output to the rasterizer. So each primitive unit can output 1   primitive per clock. So while the GPU may  be exceedingly powerful in some respects,   for example pixel processing, it’s still bound  by the rate the primitive assembler can pass   things off to the rasterization stage. So that’s neat, because it let’s you know   more specifically what the upper limit is, for a  given architecture, on how fast these things can   just feed triangles to the rasterization stage.  This is going to vary wildly from gpu to gpu,   architecture to architecture, maybe a good idea  if you’re working on a console and have some hard   numbers to work against, but personally I think  it’s just best to get the broad strokes idea.   By throwing more and more vertices, even  if they’re not visible, you’re forcing   them through primitive assembly, which may  kinda bottleneck things depending on the card   you’re using. Slightly older cards especially  couldn’t output that many primitives per clock,   and so you’ll end up starving later stages  since this is a bit like an assembly line.   As g-truc points out on their blog, let’s say  you’re dumping out a zillion small triangles,   you’re not making the most out of the  rasterization stage that follows it.   In fact, a tonne of performance is  getting flushed down the toilet.   So now we have all the tools and foundation  to go back to this simple optimization that   we introduced in the beginning, but understand at  a much deeper level what’s actually happening.   Whatever your preconceptions of why level  of detail worked, now you understand that   the GPU really isn’t all that good at  rendering a bunch of micro triangles,   and spends a disproportionate amount of computing  power doing it. Computing power that COULD have   been used to make other stuff look more awesome.  We also now understand that we’re also inundating   the GPU with useless small triangles, which  beyond all that 2x2 quad stuff, you’re forcing   this to go through primitive assembly. A few years ago, Unreal unveiled Nanite,   a really interesting new technology that I’m  sure you’ve heard of by now. But if you haven’t,   no worries, Unreal’s Nanite is, in  essence, a continuous and automatic   level of detail system. No need for artists to  do this, Unreal handles it automagically.   We can look through their Siggraph 2021  presentation entitled Nanite, a Deep Dive,   and one interesting thing that pops up is they  confirm a lot of what we’ve talked about here.   On this page, about 80 pages in, this is a VERY  long presentation, they talk about some technical   decisions they made. We can see here that they  mention just how crappy tiny triangles are,   and they point out that gpu’s are built with  parallel pixels in mind, but not triangles,   which jives with what we’ve seen. I mean, they’re  the Epic team, they’ve done their homework.   As they mention in their slides, a lot of  modern GPU’s can setup 4 triangles/clock max,   so a primitive assembly bottleneck is a very  real problem that they set out to solve,   or at least work around. So what they did was  write a software rasterizer for micro triangles,   and that ended up being considerably  faster. We don’t have to understand   the details of their implementation, but now  AT LEAST we understand their motivation.   If you think you’d like to learn more about  gamedev from an experienced graphics engineer,   I have a variety of courses available, very  suitable for beginners, so if you find yourself   wanting to get a bit better at gamedev and delve  a little deeper on some subjects, check them out.   Otherwise, if you’d still like to support  me and help choose what topic I cover next,   you can vote on my Patreon page. Cheers
Info
Channel: SimonDev
Views: 773,947
Rating: undefined out of 5
Keywords: simondev, game development, programming tutorial
Id: hf27qsQPRLQ
Channel Id: undefined
Length: 22min 18sec (1338 seconds)
Published: Mon Jan 29 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.