Despite having been a graphics programmer
since the early 2000’s, the first time I stepped into Horizon Forbidden West, I was
taken aback by the sheer scale of the world. How do games like this, or Assassin’s Creed,
or Spiderman, which also boast enormous, detailed worlds, HOW do they draw all of this without
compromising on performance? It’s really interesting, because the techniques
used have been built on and refined for decades at this point. So the first method is relatively simple. Let’s start with a world and populate it
with some stuff. It doesn’t really matter what that stuff
is, but we’ll keep this relatively simple. Now, let’s take a look from above, so let’s
pretend that we’re taking an eagle’s eye view of what the player is seeing. So that’s the player down there, and we
can visualize what they see with this green box. This is called the view frustum, it’s what
they can see from their perspective. And a view frustum is basically like a box. It’s defined by 6 sides, you have the left
and right, which of course correspond to the left and right parts of your screen, you have
the top and bottom, which correspond to the top and bottom of your screen, and finally
you have the near and far, the near being how close you can be to things before they
cut off, and the far being how far you can see. Those 6 planes define everything that’s
visible to them. This is called the viewing frustum. On top of that, each object in the scene can
be thought of as having a simple volume around them, sphere’s are popular because doing
math with sphere’s is really simple, things like distance calculations, those sorts of
things, those are a breeze. You can get a much tighter fit with a box
though, the downside of course being that the math tends to be a tiny bit more complex,
but not overly so, making them a nice tradeoff on efficiency. Now that we have this idea of a viewing volume,
and each object having a simple bounding volume, then you’re simply going to do an intersection
test, for every single object in the scene, you test whether their bounding volume intersects
with the view frustum, and everything that’s completely outside, that’s discarded, because
that’s not visible And everything that’s inside or bordering,
that’s drawn. Remember this test needs to be conservative,
meaning that if you draw a little too much, that’s totally ok, because that’s better
than drawing too little. If we do that, here’s our scene, everything
that’s being discarded is being coloured red, while everything that’s being kept
is being coloured green. And as an example, if our world happens to
contain hundreds of objects, well, with this simple pass, we’ve whittled that number
down considerably with just this simple technique. But frustum culling, as awesome as it is,
isn’t usually enough, as the recently released Cities: Skylines 2 found out. Ignoring all the other problems with the game,
this report from Tom’s Hardware shows that they relied solely on frustum culling, which
is a really big mistake. It’s the first step, not the only step. You need to go further, and this is where
occlusion culling comes in. Let’s take this simple example of a camera
and some objects sitting in the world. Now we can easily discard what’s blatantly
visible or not with our view frustum check. So everything outside of the viewing frustum,
we’ve already discarded those, being left with just the things inside the view frustum. Now imagine a big object here, near the camera,
let’s say that it’s a wall, or a building, or anything big and solid. You can’t see past this thing, so we can
draw lines from the camera to the edges of this object, and extend them beyond, forming
a volume, an area where realistically, nothing inside of it is visible to you. So that’s occlusion, but how do we calculate
occlusion? And most importantly, how do games manage
to do it so fast? Because this needs to be done near instantly. Let’s start with a really simple idea, that
occlusion culling isn’t all that different from frustum culling, at least in this simple
case here. In frustum culling, we’re getting rid of
what’s OUTSIDE of this area, defined by our view or our camera. Taking a superrrrrr simple approach to occlusion
culling, it’s almost like a reverse of that, instead of getting rid of everything that’s
outside this area, we want to get rid of everything that’s inside of it Inside outside, outside inside That in itself, isn’t all that complex,
and perhaps surprisingly, some big games in the past have been shipped with this extremely
simple but effective approach. Basically, all that it entails is that somebody
goes into Maya, or whatever the 3D editing software is, and they manually author occlusion
volumes. So let’s say that you have your object in
the world, you’ve got a big fat tree or rock or whatever, doesn’t really matter
that much. Then an artist would secretly place a box,
or several boxes, inside, and these boxes aren’t visible to players, they’re only
used by the rendering system. Ideally, you’d place a box that fits pretty
snug inside, without spilling out. And you’d do this for every object in the
world that fits some sort of criteria, like it’s big enough to be important, that sort
of thing. Artists didn’t have time to go and make
an occlusion volume for every pebble and twig on the ground, and you as an engine programmer
don’t want to have to deal with that much data anyway. So how is it that you actually perform occlusion
culling with one of these boxes? Buncha ways, it’s not particularly complex. Let’s draw a border, and this will be our
screen, and now let’s show the box on the screen, so we basically need to project the
box to screen space, which might sound fancy but it just means draw it on the screen. See this silhouette here? That’s defined by at least 4 sides, and
up to 6 sides, which you can create planes out of, forming a viewing frustum, same as
we just dealt with for the camera. Except in reverse. So now that you’ve got your data, your set
of all the occluders in the scene, and the first thing you’re probably going to do
is a basic frustum culling pass on them, to make sure they’re even relevant. This is where things get a little custom for
each approach, but basically you’re going to pick a bunch of them, how you pick them
is kinda up to you, optimization is a game of tradeoffs, you need to decide how much
time to burn in order to save time yeah, so it’s just an instance of that. Pick too many, and you may just spend way
too much time going through them. Pick too few, and you may not cull enough. In practice, you may just do something super
simple, like pick the N closest ones and hope for the best, which according to their presentation
from siggraph, is what Just Cause 2 did. The developers at Avalanche studios outlined
a simple and easy approach, with a SIMD optimized box culling system, which they called BFBC
or brute force box culling, and that was it, no fancy structures, nothing. The very first game I worked on, Prototype,
did something really similar, if I’m remembering right, it’s been so long since I looked
at that code. So while those tree structures can be awesome
and theoretically faster, a well optimized, hand written, SIMD brute force version may
be more than enough. If you catch yourself wondering, wouldn’t
a tree be faster? Watch my video “Memory, Cache Locality,
and why Arrays are Fast”. Theoretically the tree should win, but a brute
force SIMD optimized one may absolutely destroy it on modern hardware. So once you’ve got something like that in
place, we can place a simple occluder in the scene, and bam, suddenly a lot of objects
that are behind it aren’t getting coloured green anymore, they’re yellow, which in
this case means they are being occluded. It’s not perfect, and some things can slip
through the cracks, we can trivially set up a situation that doesn’t work the way we
might expect it to. Like here we have 2 occluders, side by side,
and 1 object behind them, but in the middle. Since neither of these occluders fully occludes
the object behind them individually, and we don’t create a sort of union of occluders,
because that’s complicated, that object gets flagged as drawn. So that’s it, that’s how modern games
do occlusion culling? Well, you can use this approach if you need
a decent, but not absolute state of the art system. What it does do though, is it served as a
great jumping off point to understand some of the other techniques. When you draw a scene, you don’t only generate
the colour that’s on the screen, you also generate what’s called a depth buffer, and
this is really useful for the GPU. What this does is keeps track of the depth
of every pixel on the screen, and this allows us, and the GPU, to do a variety of things. For the gpu, it allows it to do things like
render things in the right order, but also discard things that aren’t visible. If we didn’t have a depth buffer, the scene
would be a mess, parts of objects would draw through others, and then you’d have to resort
to sorting the individual triangles of the entire scene in order to have any hope of
drawing things somewhat ok. With a depth buffer, now the GPU can simply
compare the depth of what you’re rendering, against what’s already there, and if it’s
closer, it gets drawn, and if not, it doesn’t. For us, it’s pretty interesting too, because
we have a texture that basically has the depth of the scene available, we can create effects
like this fire that doesn’t clip into the scene, or even cheat and make really cheap
reflections, like we’re seeing in this water. The interesting part, in the context of occlusion
cullling, is how it allows us to discard entire objects that aren’t visible. If we already had a depth buffer drawn, let’s
ignore the problem of HOW do we generate one, and let’s say that I can somehow magic one
into existence at the beginning of the frame, how useful would it be? The answer is of course, SUPER USEFUL. If we SOMEHOW had that, we’d be able to
read from it, and do comparisons, so if I wanted to check if I can draw an object, we
could take it’s bounding volume, and we could take the screen space bounds of that,
map those to our depth buffer, and do a comparison of the object’s depth in the scene with
the values in the depth buffer. If it’s behind the values in the depth buffer,
poof, we eliminate the entire object without the gpu ever having done any work at all. In fact, we could go a step further, say I
take that depth buffer, and progressively down sample it to form a mip chain, or hierarchy
of these occlusion maps. Well then, when we project the screen space
bounds of our object, we could simply choose the appropriate mip level so that we only
have to read 4 texels, and bam, an even faster and easier comparison. These are called hierarchical z buffers, or
they’re often referred to as HZB’s, which we’ll be doing from this point forward. So remember, HZB refers to this hierarchy
of z buffers, this sort of chain of progressively downsampled occlusion maps. But we have a chicken and egg situation here. All this depends on HAVING a depth buffer,
and that’s created as part of rendering the scene. But we need this to decide WHAT to render,
to save us work. Which is a bit of a problem, isn’t it? Early GPU’s were pretty restricted in what
they allowed you to do and not do, and one thing that wasn’t easy was to read the depth
buffer back so that we could play with the data ourselves. One way around that is to not bother with
the GPU at all, and implement a software rasterizer. I mean, Quake came out in 1996 and required
a Pentium 75, so there’s no reason that with vastly more powerful hardware, you shouldn’t
be able to spit out a few polygons on screen really really quickly. And you can do all sorts of optimizations
here, remember the occlusion culling doesn’t have to be perfect, it has to be good enough
and really fast. So one thing that you can do is instead of
drawing a full resolution buffer, you could render to half resolution, or you can go even
further and go quarter resolution, i mean, whatever works right And you can hand build your occluders still,
well I mean have artists make them, and your software rasterizer is basically only responsible
for drawing out something that would look low poly even in the ps1 era. Now of course, doing it this way isn’t perfect,
but it gives you a bunch of advantages over the previous, box only method. But if you remember the example of the object
sitting between 2 occluders. Well, using this new approach of generating
your occlusion map cpu side, the union of occluders is not only possible, but a natural
consequence of the approach, as you can see here. We’ve got 2 occluders, side by side, and
they’re occluding the object behind it. You also have a lot of flexibility in your
occluder shapes now, so say, a wall with a window was now possible. There wasn’t any good way to do that before,
but now it’s pretty easy to handle that case, so this is huge progress. As we pan out, we can see that the window
is being taken into account for the occlusion culling, which is amazing. The approach itself isn’t that complex. The initial steps of this are really similar
to what we had before, we do basic frustum culling and such on the scene, and we also
get the list of occluders. The new part is here, we have our new HZB,
and that can be whatever resolution works good enough and fast. Then you’d draw all of your occluders into
that, and once that’s done you’ll go ahead and downsample that to form the hierarchy. Then you use that filled in hierarchy to manually
test objects to see if they’re visible or not, and that’s it. Conceptually, it’s not super complex. So here we can see the HZB in action with
rendered objects. In the top right corner, we’ve got on display
the top level mipmap, basically the depth that was being drawn out. We’ve also got a few more levels of the
HZB underneath, at various mip levels. One of the really interesting things to take
note of is that, look at how, as you go towards the lowest resolution, the bright or “far
away” pixels are kinda taking over everything. That’s because we do this conservatively,
as we downsample, we always pick the furthest away value. Now of course, doing it this way isn’t perfect,
but it gives you a bunch of advantages over the previous, box only method. Recall the example we looked at earlier, of
having an object sitting behind 2 occluders, which failed miserably because of the approach
being used. Well, using this new approach of generating
your occlusion map cpu side, the union of occluders is not only possible, but a natural
consequence of the approach, as you can see here. We’ve got 2 occluders, side by side, and
they’re occluding the object behind it. In fact, we can change this up and now we
can have a whole bunch of thin objects, lined up side by side, and the overall net effect
of them is the same as having 1 giant wall. Look at the object behind, it now stays yellow,
the system is now reporting it as occluded. Or even better yet, we can see that using
non box shapes, like here I’ve got a disc, and the technique just doesn’t care at all,
the shape of your objects, your occluders, doesn’t matter in the slightest. Here, we’re using the infamous stanford
bunny, but as an occluder. The shape of your occluders no longer matters,
which is amazing, because it suddenly grants this huge amount of flexibility that was lacking
before. An example of a game that launched with this
was Killzone 3. In their Siggraph 2011 talk, Guerrilla Games
talks about using the SPU’s on the PS3. Now if you’ve never done any PS3 development,
the PS3 shipped with what was called the Cell Broadband Engine, under the hood you had a
powerpc main cpu, an nvidia gpu, and developers had access to, I believe, 6 of the SPU’s
or Synergistic Processing Units, which were these stupidly powerful but difficult to use
vector processors. Anyway, programming for them was a pain, but
if you had the right workload and a lot of patience, once you got them working, boy could
they do a lot. So Killzone 3 generated their occlusion map
using a simplified version of the scene, a lower resolution depth buffer, and some highly
optimized SPU side rendering. Battlefield 3 gave a really similar talk at
gdc 2011, entitled Culling the Battlefield, where they went into detail on a really similar
approach for the famous Frostbite engine used in so many EA games. But let’s be real, GPU’s are just wayyy
better at drawing than the CPU is, because the raw horsepower they have is absolutely
unmatched. And that gap is only growing every year, not
getting smaller. Ideally then, what you want is that you want
to get the GPU to simply tell us what we want to know, which is exactly what started happening. We started getting support for things like
hardware occlusion queries, a mechanism for just asking the GPU, hey how much did you
draw? The idea is pretty simple, you could draw
part of your scene, like maybe just the major occluders. Then once you’ve done that, you want to
figure out all the small things, whether they’re visible or not, so you take the bounding volumes
of those objects, and issue queries to the GPU asking if they’re visible or not. If they’re not visible, you can ignore them,
if they are, you can choose to render them for real later. Articles like “Hardware Occlusion Queries
Made Useful” appeared in places like the GPU Gems series, that went into detail on
how you could use these queries to build very general purpose occlusion culling systems. The team on Splinter Cell Conviction went
this route at first, ditching their precomputed visibility system in favour of a new one built
around this newly exposed functionality. The problem was that issuing a lot of these
queries incurred cpu overhead, and you don’t get the answer immediately. Both of these pose significant problems. If you want to cut down on queries, your first
idea is some sort of hierarchy of bounding volumes. So given a group of objects, maybe query on
the whole group first, that should save on CPU overhead. But that of course, runs head first into the
other issue, which is you don’t get the answer immediately. So you can stall the GPU by just waiting for
the answer, which is super easy, and just an awful idea. So let’s not do that. If you want to cut down on queries, your first
idea is probably some sort of hierarchy of bounding volumes. So you’d have some sort of hierarchy of
bounding volumes, where each node represents the bounding volume of all the children. So this root node, for example, is the bounding
volume of everything. So given a node in the tree, you’d do a
query using that node’s combined bounding volume, saving you the cpu overhead of having
to run a query on each object. If that query fails, the entire subtree isn’t
visible, and you can safely move on. But that of course, runs head first into the
other issue, which is you don’t get the answer immediately. So you could just stall and wait for the answer,
which is super easy, and just an awful idea. You don’t want the GPU or CPU just sitting
there idle. So then you come up with various attempts
to mitigate THAT problem, like Nvidia was proposing here, but ultimately, it’s a whole
can of worms and difficult to get working well, which is why the Splinter Cell team
just wholly abandoned all the work they did and went in another direction. Late in the project, they pivoted away from
their query based system to something that’s getting a lot closer to what’s in use today. They take their set of occluders, and remember
these can be simplified versions of existing geometry, and render them out to a depth buffer. Now they have a depth buffer, and they use
that to build a depth hierarchy, similar to what we showed before. So each level is half the resolution of the
last, and you take the maximum depth of the 4 texels from the higher resolution map that
turn into 1 in the lower resolution. Then, they take everything they want to test
and draw that as a set of points, so each point knows the screen space box extents of
the object they want to test, and the depth it’s testing, and it’s responsible for
outputting a pixel with visible or not visible. Then that final, single render target is read
back from the GPU to the CPU, and now they can see the results of their queries. And the awesome thing about this approach
was that, most of it happened on the GPU, there’s only this single stall at the end
reading back. But in reality, renderers have a crapload
of work to do on the CPU as well, so if you can front-load this occlusion work early in
your frame, meaning you do it as early in your frame update as you can, then shove whatever
other bookkeeping work you can find, let the GPU do it’s thing, and then do your single
read back afterwards, then it’s not the end of the world. One set of problems that we’d like to get
away from though is this manual work step of authoring occlusion geometry, and having
to select occluders to use. That’s time being spent that could be spent
on making more content, polish, etc. One observation that you can make about any
given scene is that, from one frame to the next, very little changes. In reality, to generate the current frame,
we mostly take the last frame and mutate it a tiny bit. Objects might be moving around, but at 60fps,
for example, that translates into barely anything. Same with the camera, that will have moved
but only every so slightly. What if we could exploit this somehow? One attempt to exploit this temporal coherence
was described by the developers from Assassin’s Creed. Full disclosure, whether or not they came
up with it first, I have no idea, but it’s where I saw it talked about. Their approach seemed to be to take a bunch
of the nearby stuff and render it into a depth buffer. Then, they’d take last frame’s depth buffer,
reproject that to the current frame, and combine those 2 together. Wait, what’s reprojection? The idea isn’t overly complex. If you know roughly where things were last
frame, where the camera was, and which direction things were moving, then you can re-project,
or in other words, guess where they should be this frame. That’s all. Anyway, so they describe using this approach
of combining the old and new, creating the HZB from that, and then doing their occlusion
queries. So that’s really a neat approach. But unfortunately, quite difficult to manage
properly, with a lot of special casing in the engine and implementation in order to
accommodate this approach. The basic assumption, that stuff doesn’t
change too much from frame to frame, that’s still good, but the technique wasn’t quite
right. So that leads us to more or less what’s
being used today in engines like Unreal. The basic idea is that all the stuff that
passed the visibility test last frame, they’re very likely visible this frame, so take those
probably visible objects, and use them as occluders. So we draw all of those, and then we take
the depth or z buffer and we build the HZB. So, if you’re wondering, this doesn’t
seem totally right, because let’s say that an object was off screen before, and now is
suddenly on-screen. They should be occluding things, but that’s
not happening right? That’s ok, this approach is conservative,
you might overdraw a bit. Like I said before, that’s better than missing
things. Secondly, because it was off-screen, and we
only drew things that were on-screen, we didn’t even draw this new object. Well, that’s a problem, so we need to move
onto the 2nd phase of this algorithm. This approach is known as 2 pass occlusion
culling because, well, it has 2 passes. It’s right there in the name. It’s not a mystery. In this 2nd pass, we’re going to go over
everything that wasn’t drawn the first time, and we’re going to re-test it against the
HZB that we built in step 1. So that big shiny new object that moved into
the middle of the screen? That will obviously pass the test, and get
drawn. And because nowadays we have compute shaders,
which have the ability to generate the arguments for draw calls themselves, this entire pipeline,
all of these steps, takes place on the GPU, which is awesome. What does the future bring? I’m not sure, there’s really exciting
work being presented where individual objects are now being broken apart and occlusion culled,
which is a logical next step, and then you have technology like Nanite from Unreal engine,
which is something of an automatic level of detail system. Very cool stuff. Be under no illusions that this is a complete
picture of how visibility is done, but it should give you a pretty decent overview of
what’s used today, and how we got there. We haven’t touched on things like precomputed
visibility, or other aspects of just making the scene cheaper in general to render. I can take this scene, and with very little
in the way of visible differences, the scene can suddenly render significantly faster. There’s so much work that goes into modern
rendering, and we’ve just started scratching the surface. Cheers