Ever played a game with crazy amounts of
stuff, yet running so smoothly it seems like magic? Behind that buttery smooth
performance is more than just powerful hardware. I started as a graphics engineer
almost 20 years ago, and since then I’ve always been fascinated with optimizations,
but one stands out as both incredibly simple, yet the reason WHY it works is more complex
than most people realise. Let’s examine one of the most effective optimization techniques
available, and understand it to the same level of depth that experienced engineers do.
Let’s start with a little world and we’ll add a buncha stuff to it. This will serve
as kinda our testbed to show things off, so we’ll keep it reasonably simple.
Let’s go ahead and add an object to the world here, so we’ve got our player and
this object. The idea is relatively simple, this object will have several versions, each
being progressively simpler. So here’s the most complex or detailed version of the object,
and we can sort of pan the camera down the line looking at the simpler and simpler versions.
These will be called levels of detail, or LOD.
What we’re going to do is, when we place one of
these objects in the world, we’ll start it nearby for convenience, and the idea is as the object
gets further and further away from the camera, we cycle through the different levels of detail
for that object. So we’re a bit further away, we switch to a lower level of detail. A bit further,
and we can do that again, switch to an even lower level of detail, that sorta thing. How many levels
of detail games use is a bit of a tossup, since authoring them takes artist time, you may only
have enough budget to build a high and low level of detail, or you may have money to burn and you
can build a half dozen versions of the asset.
This brings us to a really awesome optimization
that’s used quite a bit, called imposters. For many people, this is just another level
of detail, or a whole other technique, I mean it doesn’t really matter.
Let’s take these 2 objects here in the distance. From this viewpoint, what are the
differences between those 2? Not a whole lot, but as we move forward, we can see that
something is kinda off with one of them. This one looks perfectly normal, but the other
one is clearly just a picture that’s rotating with us. I can freeze it so that we can walk around it
and see that this is just a floating texture. This is a billboard or imposter, and from far enough
away, you really can’t tell the difference.
With a little bit of extra work, these imposters
can be really advanced, you can render out normal maps along with the diffuse, you can render them
from multiple viewpoints and blend between them, giving you a highly versatile and incredibly
low cost stand in for the original asset.
This scene here from my previous grass video
features something like 50k trees of various types, and it’s trivially cheap to render
due to most of them being imposters.
The question now is, why does
this technique work so well?
The intuitive answer is that these models
have fewer vertices, and so then it sorta makes sense that having fewer vertices equals
doing less work on the gpu. But pressed for more information than that, this is kinda where
the explanation stops for many people. This answer isn’t wrong, per say, but at the same
time, there’s so much more going on here.
If you were to go and buy a state of the art gpu,
they list their performance these days in TFLOPS, that’s an absurd amount of computing power,
so why does just reducing the number of vertices in the scene make such an outsized
difference? There’s more to the story.
Back in the 90’s, GPU’s had what was
called a “fixed function pipeline”, which meant that the various stages of the GPU
were set in stone. The hardware had different units allocated to different tasks
What this meant practically was that resources weren’t shared, it was more like
an assembly line with things going in 1 end, going through a series of fixed steps, and then
coming out fully processed on the other end.
This all changed in around 2004, when AMD,
or ATI at the time, for those of us who have been doing this for a long time now, developed
the Xenos GPU for the XBox 360. This would be one of the earliest designs that unified the
shader architecture, meaning that the shader processing units of the GPU were shared between
vertex and pixel shading duties. Nvidia followed not too long after, I think it was around
2006 with the Tesla generation of GPUs, their first unified shader architecture.
What this means is that the GPU can now perform sort of load balancing between the demands
of different types of processing. The GPU is free to dynamically allocate resources
between say, vertex and pixel processing, leading to more efficient GPU utilization.
Whereas before with the fixed function pipeline, it couldn’t. So with that in mind, my uber
powerful fancy new GPU, realistically shouldn’t be bothered too much by the introduction of
even millions of extra vertices, because doing 1 fullscreen pass would be a few million pixels, but
wouldn’t even make a small dent in performance. If vertices and pixels are basically shared now on
the hardware, there shouldn’t be a difference.
Let’s look at this example, I have this fullscreen
effect going, meaning that I have an expensive fragment shader running on the whole screen,
every single pixel. I could be running anything, this could be changing the colour, doing various
blurs, in reality there’s so many things that modern games do as post effects that the exact
contents of this shader isn’t that important.
The ONLY important point here is that we’re
running at 100 fps. Now, let’s work through some numbers together, to get a sense of the scale
of work the GPU is currently doing. So this screen that I’m recording on is 1920x1080, so 1920
multiplied by 1080 is 2073600 pixels in total, or let’s round that to a cool 2M pixels, and I’m
doing a tonne of work in the pixel shader.
Here is the vertex shader, it’s incredibly bare,
there’s barely any work in here, in fact the fragment or pixel shader is doing a magnitude of
work more than this. Buittttt if I do something really dumb, just to illustrate a point, and make
a buttload of triangles, in fact 1 quad for every pixel on the screen, so there’s 2M pixels, that
means that we’ve got around 4M triangles. Despite the awesome power of this GPU, the framerate
drops catastrophically. It absolutely just grinds to a crawl, so why is it that the raw
power of our GPU doesn’t translate to handling these 4M triangles effortlessly? Something else
is going on here, let’s unravel this further.
If we’re looking at a single quad, that’s composed
of 2 triangles which are themselves composed of 4 vertices, it’s not a whole lot. If we expand that
out to a full screen of quads, we get somewhere in the ballpark of 4M triangles, and thus roughly
4M vertices in total. If you think about a screen having 2M pixels already, and we’re able to do
many many many fullscreen passes, and all of the associated shader computation that comes with
it, that number of vertices isn’t impressive.
We can try out a few different numbers, to
see how this affects things. We already know that a single quad is pretty quick, but what
happens as we increase the number of quads, and thus the number of triangles and vertices?
We can start out by just doubling the number of quads used, and we don’t see much in the way
of differences in performance. If anything is happening, it’s pretty minor at best. Butttt
as we scale the count up aggressively, as we make the size of the quads smaller, and thus the
# of quads needed to fill up the screen increases, effectively increasing the triangle density
on the screen, we start to notice a dropoff in performance. Initially, it’s nothing crazy, but
what we see is that it’s not a gradual decline, there’s a real dropoff at some point.
Between tiles roughly sized 4x4, that’s 4 pixels high, 4 pixels wide, and then 2x2, we see this
absolutely catastrophic dropoff in performance, it’s just bananas how bad this gets.
Let’s talk a bit about how GPU’s work to better understand why this is failing so badly.
There’s this really great post by Nvidia called “Life of a Triangle - Nvidia’s Logical Pipeline”
that kinda gives you a behind the scenes look at what happens between that API call and
when things finally appear on the screen.
AMD also has a really great talk entitled “ALL
THE PIPELINES – JOURNEY THROUGH THE GPU“ which has a nice overview of the different stages
that basically happen between you attempting to draw something, and pixels appearing on the
screen. We’ll loosely follow along with Nvidia’s, but I’ll provide a link to the
AMD one in the description.
So if we imagine things starting out in our
own code, we’ll initiate the whole thing with some sort of draw call. In webgl this
might look something like gl.drawElements, vulkan might be something like vkCmdDrawIndexed,
etc. I mean the specific commands aren’t important. What happens now is that the driver
goes ahead and validates whatever you sent, making sure the data even makes sense, before
prepping it into a GPU friendly format.
That makes it’s way over to the GPU, which does
some processing, let’s kinda just skim this, the important takeaway here is that it
creates batches of triangles to work on, and these are sent to these parts of the GPU
that Nvidia calls GPC’s, or graphics processing clusters. Let’s not wander too far off into the
weeds here, but let’s just mentally go with the idea of GPC’s as being capable of handling a
portion of the graphics rendering independently from others. So obviously then, the more you have
of these, the better because that means your GPU can theoretically do more work simultaneously.
The important part at this stage is that these will grab the vertex data for the triangles,
and begin scheduling work for each vertex.
It’ll do the vertex processing at this point,
and the Nvidia doc goes over various tricks and optimizations they use, which aren’t
super important for the sake of simplicity. In the AMD talk, they specifically refer to
this as the Primitive Assembler (PA) stage, whose job is to put together a triangle and then
forward that on to the rasterization stage.
This is a bit like a step in an assembly
line that takes a bunch of pieces, the individual vertices, and builds your
primitives for you, usually triangles.
The next stage basically involves figuring out
who’s going to be doing the fragment shader work, so at a coarse level they’re kinda testing
the triangle against quads on the screen and divvying up the work. You can see on
Nvidia’s, depending on the screen rectangles, they’re handing off the work to different GPC’s,
and AMD’s has a similar note about doing some scan conversion to test triangle overlap.
The end result is a bunch of fragment or pixel shader work is getting queued up, so what it
does here is it gets these 2x2 quads of pixels, that’s 4 pixels per quad, and queues those
up together, you can see both Nvidia and AMD both reference this, and the simple reason is
that at that size, they can do some efficient calculations. Stuff like texture gradients for
mip mapping, well a 2x2 quad has just enough information to do that, while not being more
complex to implement from a hardware point of view. So pretty much everyone will do this.
At this point, there’s a bit more work to do depth testing, blending, or various other
operations before finally getting output, and that’s how pixels are born.
One obvious question that falls out of this is, if 2x2 quads of pixels are the smallest
possible unit the GPU works with, what happens if I draw a triangle smaller than that?
Good question, and this is where you’re realizing that GPU’s are basically built with, you could
say that it’s an assumption about what you’re going to draw. And that’s pretty reasonable,
hardware engineers are looking for ways to maximize the GPU’s potential and then teach you
what to do and what not to do. So basically, there’s an assumption that there’s a given
ratio between triangle size and pixels covered, and at some point past this, they’ll still
support whatever it is you’re trying to do, but in a very disapproving way.
When you try to draw something smaller, the GPU absolutely can still do it, but what happens is
that you’ve got this 2x2 quad, which is kinda like the smallest unit the GPU works with. The triangle
inhabits this quad here, what happens to these other 3? The answer is that the GPU still does the
work, but then it just throws away the result.
In fact, it’s always doing this, just most of the
time, you don’t notice it. In this case, it threw away 75% of the work, but let’s imagine that we
have a bigger triangle, so let’s look at a bigger screen here, and then we’ve got a triangle that’s
being rendered, so it’s touching a bunch of pixels right? Except we now know that the GPU works with
quads, so it’s actually touching a whole buncha quads, some of which lie right on the edges of
the triangle. All of these quads here that lie on the edge of the triangle, they have pixels that
lie within the triangle, and some that don’t. But the GPU will perform the work for the entire quad
each time. There’s not much you can do about this, it’s a natural consequence of rendering. Some
of this work, the GPU is just going to have to throw it away, it’ll be wasted.
But what you CAN affect is what kind of triangles you feed to the GPU. Really small
triangles are an obvious terrible case, you get really poor quad utilisation, throwing
away obscene amounts of GPU performance.
But we can go a bit further and examine
different topologies for a mesh. By topology, I mean the way the vertices are arranged to form
the mesh. You can triangulate a mesh in many ways, and it’s important to understand that they’re
not all treated the same by the GPU.
Emil Persson, a senior graphics engineer
at Epic Games, known extremely well within graphics circles as humus, has a great
blog which I’ve referenced in the past, for example in the last video I talked about
occlusion systems and one point I made was based on a talk he did for Avalanche games
Anyway, there’s a nice article talking about how the topology of a mesh affects the
performance. What the article goes into detail is examining how different triangulations
of the same mesh result in different performance, while things like the mesh’s area, or other
things like # of triangles are kept constant.
The findings shouldn’t surprise us, as we can
see, what the test did was to increase the number of vertices and gauge performance. We can
see that the max area approach worked out best, while making ever thinner triangles did poorly,
and you already know why, because you end up with poor quad utilization, and thus the gpu does
a lot of work that it has to throw away.
As Emil’s research showed, the specific way that
a mesh is triangulated has a very direct impact on performance, and better triangulation
can lead to better quad utilization, and thus less wasted work from the GPU.
So given that understanding, what happens now if instead of 1 quad per pixel, what
happens if we push things further? let’s start upping the number, and we’ll go to
4 quads per pixel, or 9 quads per pixel, or even 16 quads per pixel, or whatever
And what you’ll see is that the framerate will start to tank further and further. It’s
a lot of triangles, but not THAT many. It’s 2M to start, and grows to 8M, 18M, 32M
This was an interesting topic of exploration on g-truc’s blog, who was looking at the effect
of subpixel triangles on performance. As they noted near the bottom, as you cranked the # of
vertices up, the framerate absolutely tanks, even if you’re rendering to lower
resolution. This post is quite old now, and I don’t have access to the exact setup that
was used there, so my numbers don’t match exactly, but I am seeing terrible performance,
which is realistically all that matters.
This goes back to the idea that your GPU is this
super powerful monster that can take on anything, and a buncha triangles, in the grand
scheme of things, isn’t that much. So then what’s behind this drop in performance?
Let’s revisit that primitive assembly stage, but this time through the lens of AMD’s RDNA
architecture, as detailed by their ‘Journey through the GPU’ talk. We touched on that earlier,
where we can think of Primitive Assembly like a step in an assembly line that takes a bunch of
pieces, the individual vertices, and builds your primitives for you, usually triangles.
So on slide 19, on the RDNA architecture, we can see that various parts of
the pipeline have been labelled
And there’s this one here, called the
Primitive Assembler, which as we know is important for assembling vertices into a
triangle and then outputs it to the next stage, which is often called rasterization.
This is where it gets interesting. So, this will obviously vary from architecture to
architecture, I mean you can go and find the information sometimes. For example, we can look
at the RDNA architecture whitepaper from AMD, this goes into some details on how the Navi class gpu’s
work. There’s a lot of info in here, but we can kinda just focus on what we want from here which
is some info about the primitive assembly.
Specifically, they talk about how they scale the
architecture, so as we can see they have this detailed breakdown of how the Navi class GPU’s,
like the Radeon RX 5700 XT which came out in 2019, we can see how they’re structured. They
consist of what are called, shader engines, the Radeon RX 5700 XT has 2 of these, and
within a shader engine, they’re further divided into shader arrays, which contain, among other
things, a primitive unit and a rasterizer.
This is analogous to the Nvidia setup we
saw earlier, where they partitioned their GPU’s into what were called GPC’s
or graphics processing clusters.
So the neat thing is then, you can connect that to
this section later in the whitepaper, where they talk about how the primitive units assemble
triangles. Let’s just read this directly:
The primitive units assemble triangles
from vertices and are also responsible for fixed-function tessellation. Each primitive
unit has been enhanced and supports culling up to two primitives per clock, twice as fast
as the prior generation. One primitive per clock is output to the rasterizer.
So each primitive unit can output 1 primitive per clock. So while the GPU may
be exceedingly powerful in some respects, for example pixel processing, it’s still bound
by the rate the primitive assembler can pass things off to the rasterization stage.
So that’s neat, because it let’s you know more specifically what the upper limit is, for a
given architecture, on how fast these things can just feed triangles to the rasterization stage.
This is going to vary wildly from gpu to gpu, architecture to architecture, maybe a good idea
if you’re working on a console and have some hard numbers to work against, but personally I think
it’s just best to get the broad strokes idea.
By throwing more and more vertices, even
if they’re not visible, you’re forcing them through primitive assembly, which may
kinda bottleneck things depending on the card you’re using. Slightly older cards especially
couldn’t output that many primitives per clock, and so you’ll end up starving later stages
since this is a bit like an assembly line.
As g-truc points out on their blog, let’s say
you’re dumping out a zillion small triangles, you’re not making the most out of the
rasterization stage that follows it. In fact, a tonne of performance is
getting flushed down the toilet.
So now we have all the tools and foundation
to go back to this simple optimization that we introduced in the beginning, but understand at
a much deeper level what’s actually happening.
Whatever your preconceptions of why level
of detail worked, now you understand that the GPU really isn’t all that good at
rendering a bunch of micro triangles, and spends a disproportionate amount of computing
power doing it. Computing power that COULD have been used to make other stuff look more awesome.
We also now understand that we’re also inundating the GPU with useless small triangles, which
beyond all that 2x2 quad stuff, you’re forcing this to go through primitive assembly.
A few years ago, Unreal unveiled Nanite, a really interesting new technology that I’m
sure you’ve heard of by now. But if you haven’t, no worries, Unreal’s Nanite is, in
essence, a continuous and automatic level of detail system. No need for artists to
do this, Unreal handles it automagically.
We can look through their Siggraph 2021
presentation entitled Nanite, a Deep Dive, and one interesting thing that pops up is they
confirm a lot of what we’ve talked about here. On this page, about 80 pages in, this is a VERY
long presentation, they talk about some technical decisions they made. We can see here that they
mention just how crappy tiny triangles are, and they point out that gpu’s are built with
parallel pixels in mind, but not triangles, which jives with what we’ve seen. I mean, they’re
the Epic team, they’ve done their homework.
As they mention in their slides, a lot of
modern GPU’s can setup 4 triangles/clock max, so a primitive assembly bottleneck is a very
real problem that they set out to solve, or at least work around. So what they did was
write a software rasterizer for micro triangles, and that ended up being considerably
faster. We don’t have to understand the details of their implementation, but now
AT LEAST we understand their motivation.
If you think you’d like to learn more about
gamedev from an experienced graphics engineer, I have a variety of courses available, very
suitable for beginners, so if you find yourself wanting to get a bit better at gamedev and delve
a little deeper on some subjects, check them out. Otherwise, if you’d still like to support
me and help choose what topic I cover next, you can vote on my Patreon page.
Cheers