Ever played a game with crazy amounts of 
stuff, yet running so smoothly it seems   like magic? Behind that buttery smooth 
performance is more than just powerful   hardware. I started as a graphics engineer 
almost 20 years ago, and since then I’ve   always been fascinated with optimizations, 
but one stands out as both incredibly simple,   yet the reason WHY it works is more complex 
than most people realise. Let’s examine one   of the most effective optimization techniques 
available, and understand it to the same level   of depth that experienced engineers do.
Let’s start with a little world and we’ll   add a buncha stuff to it. This will serve 
as kinda our testbed to show things off,   so we’ll keep it reasonably simple.
Let’s go ahead and add an object to   the world here, so we’ve got our player and 
this object. The idea is relatively simple,   this object will have several versions, each 
being progressively simpler. So here’s the most   complex or detailed version of the object, 
and we can sort of pan the camera down the   line looking at the simpler and simpler versions. 
These will be called levels of detail, or LOD.
  What we’re going to do is, when we place one of 
these objects in the world, we’ll start it nearby   for convenience, and the idea is as the object 
gets further and further away from the camera,   we cycle through the different levels of detail 
for that object. So we’re a bit further away, we   switch to a lower level of detail. A bit further, 
and we can do that again, switch to an even lower   level of detail, that sorta thing. How many levels 
of detail games use is a bit of a tossup, since   authoring them takes artist time, you may only 
have enough budget to build a high and low level   of detail, or you may have money to burn and you 
can build a half dozen versions of the asset.
  This brings us to a really awesome optimization 
that’s used quite a bit, called imposters. For   many people, this is just another level 
of detail, or a whole other technique,   I mean it doesn’t really matter.
Let’s take these 2 objects here in the   distance. From this viewpoint, what are the 
differences between those 2? Not a whole lot,   but as we move forward, we can see that 
something is kinda off with one of them.   This one looks perfectly normal, but the other 
one is clearly just a picture that’s rotating with   us. I can freeze it so that we can walk around it 
and see that this is just a floating texture. This   is a billboard or imposter, and from far enough 
away, you really can’t tell the difference.
  With a little bit of extra work, these imposters 
can be really advanced, you can render out normal   maps along with the diffuse, you can render them 
from multiple viewpoints and blend between them,   giving you a highly versatile and incredibly 
low cost stand in for the original asset.
  This scene here from my previous grass video 
features something like 50k trees of various   types, and it’s trivially cheap to render 
due to most of them being imposters.
  The question now is, why does 
this technique work so well? 
  The intuitive answer is that these models 
have fewer vertices, and so then it sorta   makes sense that having fewer vertices equals 
doing less work on the gpu. But pressed for more   information than that, this is kinda where 
the explanation stops for many people. This   answer isn’t wrong, per say, but at the same 
time, there’s so much more going on here.
  If you were to go and buy a state of the art gpu, 
they list their performance these days in TFLOPS,   that’s an absurd amount of computing power, 
so why does just reducing the number of   vertices in the scene make such an outsized 
difference? There’s more to the story.
  Back in the 90’s, GPU’s had what was 
called a “fixed function pipeline”,   which meant that the various stages of the GPU 
were set in stone. The hardware had different   units allocated to different tasks
What this meant practically was that   resources weren’t shared, it was more like 
an assembly line with things going in 1 end,   going through a series of fixed steps, and then 
coming out fully processed on the other end.
  This all changed in around 2004, when AMD, 
or ATI at the time, for those of us who have   been doing this for a long time now, developed 
the Xenos GPU for the XBox 360. This would be   one of the earliest designs that unified the 
shader architecture, meaning that the shader   processing units of the GPU were shared between 
vertex and pixel shading duties. Nvidia followed   not too long after, I think it was around 
2006 with the Tesla generation of GPUs,   their first unified shader architecture.
What this means is that the GPU can now perform   sort of load balancing between the demands 
of different types of processing. The GPU   is free to dynamically allocate resources 
between say, vertex and pixel processing,   leading to more efficient GPU utilization. 
Whereas before with the fixed function pipeline,   it couldn’t. So with that in mind, my uber 
powerful fancy new GPU, realistically shouldn’t   be bothered too much by the introduction of 
even millions of extra vertices, because doing 1   fullscreen pass would be a few million pixels, but 
wouldn’t even make a small dent in performance. If   vertices and pixels are basically shared now on 
the hardware, there shouldn’t be a difference.
  Let’s look at this example, I have this fullscreen 
effect going, meaning that I have an expensive   fragment shader running on the whole screen, 
every single pixel. I could be running anything,   this could be changing the colour, doing various 
blurs, in reality there’s so many things that   modern games do as post effects that the exact 
contents of this shader isn’t that important.
  The ONLY important point here is that we’re 
running at 100 fps. Now, let’s work through   some numbers together, to get a sense of the scale 
of work the GPU is currently doing. So this screen   that I’m recording on is 1920x1080, so 1920 
multiplied by 1080 is 2073600 pixels in total,   or let’s round that to a cool 2M pixels, and I’m 
doing a tonne of work in the pixel shader.
  Here is the vertex shader, it’s incredibly bare, 
there’s barely any work in here, in fact the   fragment or pixel shader is doing a magnitude of 
work more than this. Buittttt if I do something   really dumb, just to illustrate a point, and make 
a buttload of triangles, in fact 1 quad for every   pixel on the screen, so there’s 2M pixels, that 
means that we’ve got around 4M triangles. Despite   the awesome power of this GPU, the framerate 
drops catastrophically. It absolutely just   grinds to a crawl, so why is it that the raw 
power of our GPU doesn’t translate to handling   these 4M triangles effortlessly? Something else 
is going on here, let’s unravel this further.
  If we’re looking at a single quad, that’s composed 
of 2 triangles which are themselves composed of 4   vertices, it’s not a whole lot. If we expand that 
out to a full screen of quads, we get somewhere   in the ballpark of 4M triangles, and thus roughly 
4M vertices in total. If you think about a screen   having 2M pixels already, and we’re able to do 
many many many fullscreen passes, and all of   the associated shader computation that comes with 
it, that number of vertices isn’t impressive.
  We can try out a few different numbers, to 
see how this affects things. We already know   that a single quad is pretty quick, but what 
happens as we increase the number of quads,   and thus the number of triangles and vertices?
We can start out by just doubling the number of   quads used, and we don’t see much in the way 
of differences in performance. If anything is   happening, it’s pretty minor at best. Butttt 
as we scale the count up aggressively, as we   make the size of the quads smaller, and thus the 
# of quads needed to fill up the screen increases,   effectively increasing the triangle density 
on the screen, we start to notice a dropoff in   performance. Initially, it’s nothing crazy, but 
what we see is that it’s not a gradual decline,   there’s a real dropoff at some point.
Between tiles roughly sized 4x4, that’s 4 pixels   high, 4 pixels wide, and then 2x2, we see this 
absolutely catastrophic dropoff in performance,   it’s just bananas how bad this gets.
Let’s talk a bit about how GPU’s work to   better understand why this is failing so badly. 
There’s this really great post by Nvidia called   “Life of a Triangle - Nvidia’s Logical Pipeline” 
that kinda gives you a behind the scenes look   at what happens between that API call and 
when things finally appear on the screen.
  AMD also has a really great talk entitled “ALL 
THE PIPELINES – JOURNEY THROUGH THE GPU“ which   has a nice overview of the different stages 
that basically happen between you attempting   to draw something, and pixels appearing on the 
screen. We’ll loosely follow along with Nvidia’s,   but I’ll provide a link to the 
AMD one in the description.
  So if we imagine things starting out in our 
own code, we’ll initiate the whole thing   with some sort of draw call. In webgl this 
might look something like gl.drawElements,   vulkan might be something like vkCmdDrawIndexed, 
etc. I mean the specific commands aren’t   important. What happens now is that the driver 
goes ahead and validates whatever you sent,   making sure the data even makes sense, before 
prepping it into a GPU friendly format.
  That makes it’s way over to the GPU, which does 
some processing, let’s kinda just skim this,   the important takeaway here is that it 
creates batches of triangles to work on,   and these are sent to these parts of the GPU 
that Nvidia calls GPC’s, or graphics processing   clusters. Let’s not wander too far off into the 
weeds here, but let’s just mentally go with the   idea of GPC’s as being capable of handling a 
portion of the graphics rendering independently   from others. So obviously then, the more you have 
of these, the better because that means your GPU   can theoretically do more work simultaneously.
The important part at this stage is that these   will grab the vertex data for the triangles, 
and begin scheduling work for each vertex.
  It’ll do the vertex processing at this point, 
and the Nvidia doc goes over various tricks   and optimizations they use, which aren’t 
super important for the sake of simplicity.   In the AMD talk, they specifically refer to 
this as the Primitive Assembler (PA) stage,   whose job is to put together a triangle and then 
forward that on to the rasterization stage.
  This is a bit like a step in an assembly 
line that takes a bunch of pieces,   the individual vertices, and builds your 
primitives for you, usually triangles.
  The next stage basically involves figuring out 
who’s going to be doing the fragment shader work,   so at a coarse level they’re kinda testing 
the triangle against quads on the screen   and divvying up the work. You can see on 
Nvidia’s, depending on the screen rectangles,   they’re handing off the work to different GPC’s, 
and AMD’s has a similar note about doing some   scan conversion to test triangle overlap.
The end result is a bunch of fragment or pixel   shader work is getting queued up, so what it 
does here is it gets these 2x2 quads of pixels,   that’s 4 pixels per quad, and queues those 
up together, you can see both Nvidia and AMD   both reference this, and the simple reason is 
that at that size, they can do some efficient   calculations. Stuff like texture gradients for 
mip mapping, well a 2x2 quad has just enough   information to do that, while not being more 
complex to implement from a hardware point of   view. So pretty much everyone will do this.
At this point, there’s a bit more work to do   depth testing, blending, or various other 
operations before finally getting output,   and that’s how pixels are born.
One obvious question that falls out of this is,   if 2x2 quads of pixels are the smallest 
possible unit the GPU works with, what   happens if I draw a triangle smaller than that?
Good question, and this is where you’re realizing   that GPU’s are basically built with, you could 
say that it’s an assumption about what you’re   going to draw. And that’s pretty reasonable, 
hardware engineers are looking for ways to   maximize the GPU’s potential and then teach you 
what to do and what not to do. So basically,   there’s an assumption that there’s a given 
ratio between triangle size and pixels covered,   and at some point past this, they’ll still 
support whatever it is you’re trying to do,   but in a very disapproving way.
When you try to draw something smaller, the GPU   absolutely can still do it, but what happens is 
that you’ve got this 2x2 quad, which is kinda like   the smallest unit the GPU works with. The triangle 
inhabits this quad here, what happens to these   other 3? The answer is that the GPU still does the 
work, but then it just throws away the result.
  In fact, it’s always doing this, just most of the 
time, you don’t notice it. In this case, it threw   away 75% of the work, but let’s imagine that we 
have a bigger triangle, so let’s look at a bigger   screen here, and then we’ve got a triangle that’s 
being rendered, so it’s touching a bunch of pixels   right? Except we now know that the GPU works with 
quads, so it’s actually touching a whole buncha   quads, some of which lie right on the edges of 
the triangle. All of these quads here that lie   on the edge of the triangle, they have pixels that 
lie within the triangle, and some that don’t. But   the GPU will perform the work for the entire quad 
each time. There’s not much you can do about this,   it’s a natural consequence of rendering. Some 
of this work, the GPU is just going to have   to throw it away, it’ll be wasted.
But what you CAN affect is what kind of   triangles you feed to the GPU. Really small 
triangles are an obvious terrible case,   you get really poor quad utilisation, throwing 
away obscene amounts of GPU performance.
  But we can go a bit further and examine 
different topologies for a mesh. By topology,   I mean the way the vertices are arranged to form 
the mesh. You can triangulate a mesh in many ways,   and it’s important to understand that they’re 
not all treated the same by the GPU.
  Emil Persson, a senior graphics engineer 
at Epic Games, known extremely well within   graphics circles as humus, has a great 
blog which I’ve referenced in the past,   for example in the last video I talked about 
occlusion systems and one point I made was   based on a talk he did for Avalanche games
Anyway, there’s a nice article talking about   how the topology of a mesh affects the 
performance. What the article goes into   detail is examining how different triangulations 
of the same mesh result in different performance,   while things like the mesh’s area, or other 
things like # of triangles are kept constant.
  The findings shouldn’t surprise us, as we can 
see, what the test did was to increase the   number of vertices and gauge performance. We can 
see that the max area approach worked out best,   while making ever thinner triangles did poorly, 
and you already know why, because you end up with   poor quad utilization, and thus the gpu does 
a lot of work that it has to throw away.
  As Emil’s research showed, the specific way that 
a mesh is triangulated has a very direct impact   on performance, and better triangulation 
can lead to better quad utilization,   and thus less wasted work from the GPU.
So given that understanding, what happens   now if instead of 1 quad per pixel, what 
happens if we push things further? let’s   start upping the number, and we’ll go to 
4 quads per pixel, or 9 quads per pixel,   or even 16 quads per pixel, or whatever
And what you’ll see is that the framerate   will start to tank further and further. It’s 
a lot of triangles, but not THAT many. It’s   2M to start, and grows to 8M, 18M, 32M
This was an interesting topic of exploration   on g-truc’s blog, who was looking at the effect 
of subpixel triangles on performance. As they   noted near the bottom, as you cranked the # of 
vertices up, the framerate absolutely tanks,   even if you’re rendering to lower 
resolution. This post is quite old now,   and I don’t have access to the exact setup that 
was used there, so my numbers don’t match exactly,   but I am seeing terrible performance, 
which is realistically all that matters.
  This goes back to the idea that your GPU is this 
super powerful monster that can take on anything,   and a buncha triangles, in the grand 
scheme of things, isn’t that much. So   then what’s behind this drop in performance?
Let’s revisit that primitive assembly stage,   but this time through the lens of AMD’s RDNA 
architecture, as detailed by their ‘Journey   through the GPU’ talk. We touched on that earlier, 
where we can think of Primitive Assembly like a   step in an assembly line that takes a bunch of 
pieces, the individual vertices, and builds your   primitives for you, usually triangles.
So on slide 19, on the RDNA architecture,   we can see that various parts of 
the pipeline have been labelled
  And there’s this one here, called the 
Primitive Assembler, which as we know   is important for assembling vertices into a 
triangle and then outputs it to the next stage,   which is often called rasterization.
This is where it gets interesting. So,   this will obviously vary from architecture to 
architecture, I mean you can go and find the   information sometimes. For example, we can look 
at the RDNA architecture whitepaper from AMD, this   goes into some details on how the Navi class gpu’s 
work. There’s a lot of info in here, but we can   kinda just focus on what we want from here which 
is some info about the primitive assembly.
  Specifically, they talk about how they scale the 
architecture, so as we can see they have this   detailed breakdown of how the Navi class GPU’s, 
like the Radeon RX 5700 XT which came out in 2019,   we can see how they’re structured. They 
consist of what are called, shader engines,   the Radeon RX 5700 XT has 2 of these, and 
within a shader engine, they’re further divided   into shader arrays, which contain, among other 
things, a primitive unit and a rasterizer.
  This is analogous to the Nvidia setup we 
saw earlier, where they partitioned their   GPU’s into what were called GPC’s 
or graphics processing clusters.
  So the neat thing is then, you can connect that to 
this section later in the whitepaper, where they   talk about how the primitive units assemble 
triangles. Let’s just read this directly:
  The primitive units assemble triangles 
from vertices and are also responsible   for fixed-function tessellation. Each primitive 
unit has been enhanced and supports culling up   to two primitives per clock, twice as fast 
as the prior generation. One primitive per   clock is output to the rasterizer.
So each primitive unit can output 1   primitive per clock. So while the GPU may 
be exceedingly powerful in some respects,   for example pixel processing, it’s still bound 
by the rate the primitive assembler can pass   things off to the rasterization stage.
So that’s neat, because it let’s you know   more specifically what the upper limit is, for a 
given architecture, on how fast these things can   just feed triangles to the rasterization stage. 
This is going to vary wildly from gpu to gpu,   architecture to architecture, maybe a good idea 
if you’re working on a console and have some hard   numbers to work against, but personally I think 
it’s just best to get the broad strokes idea.
  By throwing more and more vertices, even 
if they’re not visible, you’re forcing   them through primitive assembly, which may 
kinda bottleneck things depending on the card   you’re using. Slightly older cards especially 
couldn’t output that many primitives per clock,   and so you’ll end up starving later stages 
since this is a bit like an assembly line.
  As g-truc points out on their blog, let’s say 
you’re dumping out a zillion small triangles,   you’re not making the most out of the 
rasterization stage that follows it.   In fact, a tonne of performance is 
getting flushed down the toilet.
  So now we have all the tools and foundation 
to go back to this simple optimization that   we introduced in the beginning, but understand at 
a much deeper level what’s actually happening.
  Whatever your preconceptions of why level 
of detail worked, now you understand that   the GPU really isn’t all that good at 
rendering a bunch of micro triangles,   and spends a disproportionate amount of computing 
power doing it. Computing power that COULD have   been used to make other stuff look more awesome. 
We also now understand that we’re also inundating   the GPU with useless small triangles, which 
beyond all that 2x2 quad stuff, you’re forcing   this to go through primitive assembly.
A few years ago, Unreal unveiled Nanite,   a really interesting new technology that I’m 
sure you’ve heard of by now. But if you haven’t,   no worries, Unreal’s Nanite is, in 
essence, a continuous and automatic   level of detail system. No need for artists to 
do this, Unreal handles it automagically.
  We can look through their Siggraph 2021 
presentation entitled Nanite, a Deep Dive,   and one interesting thing that pops up is they 
confirm a lot of what we’ve talked about here.   On this page, about 80 pages in, this is a VERY 
long presentation, they talk about some technical   decisions they made. We can see here that they 
mention just how crappy tiny triangles are,   and they point out that gpu’s are built with 
parallel pixels in mind, but not triangles,   which jives with what we’ve seen. I mean, they’re 
the Epic team, they’ve done their homework.
  As they mention in their slides, a lot of 
modern GPU’s can setup 4 triangles/clock max,   so a primitive assembly bottleneck is a very 
real problem that they set out to solve,   or at least work around. So what they did was 
write a software rasterizer for micro triangles,   and that ended up being considerably 
faster. We don’t have to understand   the details of their implementation, but now 
AT LEAST we understand their motivation.
  If you think you’d like to learn more about 
gamedev from an experienced graphics engineer,   I have a variety of courses available, very 
suitable for beginners, so if you find yourself   wanting to get a bit better at gamedev and delve 
a little deeper on some subjects, check them out.   Otherwise, if you’d still like to support 
me and help choose what topic I cover next,   you can vote on my Patreon page.
Cheers