When Optimisations Work, But for the Wrong Reasons

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Ever played a game with crazy amounts of stuff, yet running so smoothly it seems like magic? Behind that buttery smooth performance is more than just powerful hardware. I started as a graphics engineer almost 20 years ago, and since then I’ve always been fascinated with optimizations, but one stands out as both incredibly simple, yet the reason WHY it works is more complex than most people realise. Let’s examine one of the most effective optimization techniques available, and understand it to the same level of depth that experienced engineers do. Let’s start with a little world and we’ll add a buncha stuff to it. This will serve as kinda our testbed to show things off, so we’ll keep it reasonably simple. Let’s go ahead and add an object to the world here, so we’ve got our player and this object. The idea is relatively simple, this object will have several versions, each being progressively simpler. So here’s the most complex or detailed version of the object, and we can sort of pan the camera down the line looking at the simpler and simpler versions. These will be called levels of detail, or LOD. What we’re going to do is, when we place one of these objects in the world, we’ll start it nearby for convenience, and the idea is as the object gets further and further away from the camera, we cycle through the different levels of detail for that object. So we’re a bit further away, we switch to a lower level of detail. A bit further, and we can do that again, switch to an even lower level of detail, that sorta thing. How many levels of detail games use is a bit of a tossup, since authoring them takes artist time, you may only have enough budget to build a high and low level of detail, or you may have money to burn and you can build a half dozen versions of the asset. This brings us to a really awesome optimization that’s used quite a bit, called imposters. For many people, this is just another level of detail, or a whole other technique, I mean it doesn’t really matter. Let’s take these 2 objects here in the distance. From this viewpoint, what are the differences between those 2? Not a whole lot, but as we move forward, we can see that something is kinda off with one of them. This one looks perfectly normal, but the other one is clearly just a picture that’s rotating with us. I can freeze it so that we can walk around it and see that this is just a floating texture. This is a billboard or imposter, and from far enough away, you really can’t tell the difference. With a little bit of extra work, these imposters can be really advanced, you can render out normal maps along with the diffuse, you can render them from multiple viewpoints and blend between them, giving you a highly versatile and incredibly low cost stand in for the original asset. This scene here from my previous grass video features something like 50k trees of various types, and it’s trivially cheap to render due to most of them being imposters. The question now is, why does this technique work so well? The intuitive answer is that these models have fewer vertices, and so then it sorta makes sense that having fewer vertices equals doing less work on the gpu. But pressed for more information than that, this is kinda where the explanation stops for many people. This answer isn’t wrong, per say, but at the same time, there’s so much more going on here. If you were to go and buy a state of the art gpu, they list their performance these days in TFLOPS, that’s an absurd amount of computing power, so why does just reducing the number of vertices in the scene make such an outsized difference? There’s more to the story. Back in the 90’s, GPU’s had what was called a “fixed function pipeline”, which meant that the various stages of the GPU were set in stone. The hardware had different units allocated to different tasks What this meant practically was that resources weren’t shared, it was more like an assembly line with things going in 1 end, going through a series of fixed steps, and then coming out fully processed on the other end. This all changed in around 2004, when AMD, or ATI at the time, for those of us who have been doing this for a long time now, developed the Xenos GPU for the XBox 360. This would be one of the earliest designs that unified the shader architecture, meaning that the shader processing units of the GPU were shared between vertex and pixel shading duties. Nvidia followed not too long after, I think it was around 2006 with the Tesla generation of GPUs, their first unified shader architecture. What this means is that the GPU can now perform sort of load balancing between the demands of different types of processing. The GPU is free to dynamically allocate resources between say, vertex and pixel processing, leading to more efficient GPU utilization. Whereas before with the fixed function pipeline, it couldn’t. So with that in mind, my uber powerful fancy new GPU, realistically shouldn’t be bothered too much by the introduction of even millions of extra vertices, because doing 1 fullscreen pass would be a few million pixels, but wouldn’t even make a small dent in performance. If vertices and pixels are basically shared now on the hardware, there shouldn’t be a difference. Let’s look at this example, I have this fullscreen effect going, meaning that I have an expensive fragment shader running on the whole screen, every single pixel. I could be running anything, this could be changing the colour, doing various blurs, in reality there’s so many things that modern games do as post effects that the exact contents of this shader isn’t that important. The ONLY important point here is that we’re running at 100 fps. Now, let’s work through some numbers together, to get a sense of the scale of work the GPU is currently doing. So this screen that I’m recording on is 1920x1080, so 1920 multiplied by 1080 is 2073600 pixels in total, or let’s round that to a cool 2M pixels, and I’m doing a tonne of work in the pixel shader. Here is the vertex shader, it’s incredibly bare, there’s barely any work in here, in fact the fragment or pixel shader is doing a magnitude of work more than this. Buittttt if I do something really dumb, just to illustrate a point, and make a buttload of triangles, in fact 1 quad for every pixel on the screen, so there’s 2M pixels, that means that we’ve got around 4M triangles. Despite the awesome power of this GPU, the framerate drops catastrophically. It absolutely just grinds to a crawl, so why is it that the raw power of our GPU doesn’t translate to handling these 4M triangles effortlessly? Something else is going on here, let’s unravel this further. If we’re looking at a single quad, that’s composed of 2 triangles which are themselves composed of 4 vertices, it’s not a whole lot. If we expand that out to a full screen of quads, we get somewhere in the ballpark of 4M triangles, and thus roughly 4M vertices in total. If you think about a screen having 2M pixels already, and we’re able to do many many many fullscreen passes, and all of the associated shader computation that comes with it, that number of vertices isn’t impressive. We can try out a few different numbers, to see how this affects things. We already know that a single quad is pretty quick, but what happens as we increase the number of quads, and thus the number of triangles and vertices? We can start out by just doubling the number of quads used, and we don’t see much in the way of differences in performance. If anything is happening, it’s pretty minor at best. Butttt as we scale the count up aggressively, as we make the size of the quads smaller, and thus the # of quads needed to fill up the screen increases, effectively increasing the triangle density on the screen, we start to notice a dropoff in performance. Initially, it’s nothing crazy, but what we see is that it’s not a gradual decline, there’s a real dropoff at some point. Between tiles roughly sized 4x4, that’s 4 pixels high, 4 pixels wide, and then 2x2, we see this absolutely catastrophic dropoff in performance, it’s just bananas how bad this gets. Let’s talk a bit about how GPU’s work to better understand why this is failing so badly. There’s this really great post by Nvidia called “Life of a Triangle - Nvidia’s Logical Pipeline” that kinda gives you a behind the scenes look at what happens between that API call and when things finally appear on the screen. AMD also has a really great talk entitled “ALL THE PIPELINES – JOURNEY THROUGH THE GPU“ which has a nice overview of the different stages that basically happen between you attempting to draw something, and pixels appearing on the screen. We’ll loosely follow along with Nvidia’s, but I’ll provide a link to the AMD one in the description. So if we imagine things starting out in our own code, we’ll initiate the whole thing with some sort of draw call. In webgl this might look something like gl.drawElements, vulkan might be something like vkCmdDrawIndexed, etc. I mean the specific commands aren’t important. What happens now is that the driver goes ahead and validates whatever you sent, making sure the data even makes sense, before prepping it into a GPU friendly format. That makes it’s way over to the GPU, which does some processing, let’s kinda just skim this, the important takeaway here is that it creates batches of triangles to work on, and these are sent to these parts of the GPU that Nvidia calls GPC’s, or graphics processing clusters. Let’s not wander too far off into the weeds here, but let’s just mentally go with the idea of GPC’s as being capable of handling a portion of the graphics rendering independently from others. So obviously then, the more you have of these, the better because that means your GPU can theoretically do more work simultaneously. The important part at this stage is that these will grab the vertex data for the triangles, and begin scheduling work for each vertex. It’ll do the vertex processing at this point, and the Nvidia doc goes over various tricks and optimizations they use, which aren’t super important for the sake of simplicity. In the AMD talk, they specifically refer to this as the Primitive Assembler (PA) stage, whose job is to put together a triangle and then forward that on to the rasterization stage. This is a bit like a step in an assembly line that takes a bunch of pieces, the individual vertices, and builds your primitives for you, usually triangles. The next stage basically involves figuring out who’s going to be doing the fragment shader work, so at a coarse level they’re kinda testing the triangle against quads on the screen and divvying up the work. You can see on Nvidia’s, depending on the screen rectangles, they’re handing off the work to different GPC’s, and AMD’s has a similar note about doing some scan conversion to test triangle overlap. The end result is a bunch of fragment or pixel shader work is getting queued up, so what it does here is it gets these 2x2 quads of pixels, that’s 4 pixels per quad, and queues those up together, you can see both Nvidia and AMD both reference this, and the simple reason is that at that size, they can do some efficient calculations. Stuff like texture gradients for mip mapping, well a 2x2 quad has just enough information to do that, while not being more complex to implement from a hardware point of view. So pretty much everyone will do this. At this point, there’s a bit more work to do depth testing, blending, or various other operations before finally getting output, and that’s how pixels are born. One obvious question that falls out of this is, if 2x2 quads of pixels are the smallest possible unit the GPU works with, what happens if I draw a triangle smaller than that? Good question, and this is where you’re realizing that GPU’s are basically built with, you could say that it’s an assumption about what you’re going to draw. And that’s pretty reasonable, hardware engineers are looking for ways to maximize the GPU’s potential and then teach you what to do and what not to do. So basically, there’s an assumption that there’s a given ratio between triangle size and pixels covered, and at some point past this, they’ll still support whatever it is you’re trying to do, but in a very disapproving way. When you try to draw something smaller, the GPU absolutely can still do it, but what happens is that you’ve got this 2x2 quad, which is kinda like the smallest unit the GPU works with. The triangle inhabits this quad here, what happens to these other 3? The answer is that the GPU still does the work, but then it just throws away the result. In fact, it’s always doing this, just most of the time, you don’t notice it. In this case, it threw away 75% of the work, but let’s imagine that we have a bigger triangle, so let’s look at a bigger screen here, and then we’ve got a triangle that’s being rendered, so it’s touching a bunch of pixels right? Except we now know that the GPU works with quads, so it’s actually touching a whole buncha quads, some of which lie right on the edges of the triangle. All of these quads here that lie on the edge of the triangle, they have pixels that lie within the triangle, and some that don’t. But the GPU will perform the work for the entire quad each time. There’s not much you can do about this, it’s a natural consequence of rendering. Some of this work, the GPU is just going to have to throw it away, it’ll be wasted. But what you CAN affect is what kind of triangles you feed to the GPU. Really small triangles are an obvious terrible case, you get really poor quad utilisation, throwing away obscene amounts of GPU performance. But we can go a bit further and examine different topologies for a mesh. By topology, I mean the way the vertices are arranged to form the mesh. You can triangulate a mesh in many ways, and it’s important to understand that they’re not all treated the same by the GPU. Emil Persson, a senior graphics engineer at Epic Games, known extremely well within graphics circles as humus, has a great blog which I’ve referenced in the past, for example in the last video I talked about occlusion systems and one point I made was based on a talk he did for Avalanche games Anyway, there’s a nice article talking about how the topology of a mesh affects the performance. What the article goes into detail is examining how different triangulations of the same mesh result in different performance, while things like the mesh’s area, or other things like # of triangles are kept constant. The findings shouldn’t surprise us, as we can see, what the test did was to increase the number of vertices and gauge performance. We can see that the max area approach worked out best, while making ever thinner triangles did poorly, and you already know why, because you end up with poor quad utilization, and thus the gpu does a lot of work that it has to throw away. As Emil’s research showed, the specific way that a mesh is triangulated has a very direct impact on performance, and better triangulation can lead to better quad utilization, and thus less wasted work from the GPU. So given that understanding, what happens now if instead of 1 quad per pixel, what happens if we push things further? let’s start upping the number, and we’ll go to 4 quads per pixel, or 9 quads per pixel, or even 16 quads per pixel, or whatever And what you’ll see is that the framerate will start to tank further and further. It’s a lot of triangles, but not THAT many. It’s 2M to start, and grows to 8M, 18M, 32M This was an interesting topic of exploration on g-truc’s blog, who was looking at the effect of subpixel triangles on performance. As they noted near the bottom, as you cranked the # of vertices up, the framerate absolutely tanks, even if you’re rendering to lower resolution. This post is quite old now, and I don’t have access to the exact setup that was used there, so my numbers don’t match exactly, but I am seeing terrible performance, which is realistically all that matters. This goes back to the idea that your GPU is this super powerful monster that can take on anything, and a buncha triangles, in the grand scheme of things, isn’t that much. So then what’s behind this drop in performance? Let’s revisit that primitive assembly stage, but this time through the lens of AMD’s RDNA architecture, as detailed by their ‘Journey through the GPU’ talk. We touched on that earlier, where we can think of Primitive Assembly like a step in an assembly line that takes a bunch of pieces, the individual vertices, and builds your primitives for you, usually triangles. So on slide 19, on the RDNA architecture, we can see that various parts of the pipeline have been labelled And there’s this one here, called the Primitive Assembler, which as we know is important for assembling vertices into a triangle and then outputs it to the next stage, which is often called rasterization. This is where it gets interesting. So, this will obviously vary from architecture to architecture, I mean you can go and find the information sometimes. For example, we can look at the RDNA architecture whitepaper from AMD, this goes into some details on how the Navi class gpu’s work. There’s a lot of info in here, but we can kinda just focus on what we want from here which is some info about the primitive assembly. Specifically, they talk about how they scale the architecture, so as we can see they have this detailed breakdown of how the Navi class GPU’s, like the Radeon RX 5700 XT which came out in 2019, we can see how they’re structured. They consist of what are called, shader engines, the Radeon RX 5700 XT has 2 of these, and within a shader engine, they’re further divided into shader arrays, which contain, among other things, a primitive unit and a rasterizer. This is analogous to the Nvidia setup we saw earlier, where they partitioned their GPU’s into what were called GPC’s or graphics processing clusters. So the neat thing is then, you can connect that to this section later in the whitepaper, where they talk about how the primitive units assemble triangles. Let’s just read this directly: The primitive units assemble triangles from vertices and are also responsible for fixed-function tessellation. Each primitive unit has been enhanced and supports culling up to two primitives per clock, twice as fast as the prior generation. One primitive per clock is output to the rasterizer. So each primitive unit can output 1 primitive per clock. So while the GPU may be exceedingly powerful in some respects, for example pixel processing, it’s still bound by the rate the primitive assembler can pass things off to the rasterization stage. So that’s neat, because it let’s you know more specifically what the upper limit is, for a given architecture, on how fast these things can just feed triangles to the rasterization stage. This is going to vary wildly from gpu to gpu, architecture to architecture, maybe a good idea if you’re working on a console and have some hard numbers to work against, but personally I think it’s just best to get the broad strokes idea. By throwing more and more vertices, even if they’re not visible, you’re forcing them through primitive assembly, which may kinda bottleneck things depending on the card you’re using. Slightly older cards especially couldn’t output that many primitives per clock, and so you’ll end up starving later stages since this is a bit like an assembly line. As g-truc points out on their blog, let’s say you’re dumping out a zillion small triangles, you’re not making the most out of the rasterization stage that follows it. In fact, a tonne of performance is getting flushed down the toilet. So now we have all the tools and foundation to go back to this simple optimization that we introduced in the beginning, but understand at a much deeper level what’s actually happening. Whatever your preconceptions of why level of detail worked, now you understand that the GPU really isn’t all that good at rendering a bunch of micro triangles, and spends a disproportionate amount of computing power doing it. Computing power that COULD have been used to make other stuff look more awesome. We also now understand that we’re also inundating the GPU with useless small triangles, which beyond all that 2x2 quad stuff, you’re forcing this to go through primitive assembly. A few years ago, Unreal unveiled Nanite, a really interesting new technology that I’m sure you’ve heard of by now. But if you haven’t, no worries, Unreal’s Nanite is, in essence, a continuous and automatic level of detail system. No need for artists to do this, Unreal handles it automagically. We can look through their Siggraph 2021 presentation entitled Nanite, a Deep Dive, and one interesting thing that pops up is they confirm a lot of what we’ve talked about here. On this page, about 80 pages in, this is a VERY long presentation, they talk about some technical decisions they made. We can see here that they mention just how crappy tiny triangles are, and they point out that gpu’s are built with parallel pixels in mind, but not triangles, which jives with what we’ve seen. I mean, they’re the Epic team, they’ve done their homework. As they mention in their slides, a lot of modern GPU’s can setup 4 triangles/clock max, so a primitive assembly bottleneck is a very real problem that they set out to solve, or at least work around. So what they did was write a software rasterizer for micro triangles, and that ended up being considerably faster. We don’t have to understand the details of their implementation, but now AT LEAST we understand their motivation. If you think you’d like to learn more about gamedev from an experienced graphics engineer, I have a variety of courses available, very suitable for beginners, so if you find yourself wanting to get a bit better at gamedev and delve a little deeper on some subjects, check them out. Otherwise, if you’d still like to support me and help choose what topic I cover next, you can vote on my Patreon page. Cheers

Info

Channel: SimonDev

Views: 773,947

Rating: undefined out of 5

Keywords: simondev, game development, programming tutorial

Id: hf27qsQPRLQ

Channel Id: undefined

Length: 22min 18sec (1338 seconds)

Published: Mon Jan 29 2024