How Games Have Worked for 30 Years to Do Less Work

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Despite having been a graphics programmer since the early 2000’s, the first time I stepped into Horizon Forbidden West, I was taken aback by the sheer scale of the world. How do games like this, or Assassin’s Creed, or Spiderman, which also boast enormous, detailed worlds, HOW do they draw all of this without compromising on performance? It’s really interesting, because the techniques used have been built on and refined for decades at this point. So the first method is relatively simple. Let’s start with a world and populate it with some stuff. It doesn’t really matter what that stuff is, but we’ll keep this relatively simple. Now, let’s take a look from above, so let’s pretend that we’re taking an eagle’s eye view of what the player is seeing. So that’s the player down there, and we can visualize what they see with this green box. This is called the view frustum, it’s what they can see from their perspective. And a view frustum is basically like a box. It’s defined by 6 sides, you have the left and right, which of course correspond to the left and right parts of your screen, you have the top and bottom, which correspond to the top and bottom of your screen, and finally you have the near and far, the near being how close you can be to things before they cut off, and the far being how far you can see. Those 6 planes define everything that’s visible to them. This is called the viewing frustum. On top of that, each object in the scene can be thought of as having a simple volume around them, sphere’s are popular because doing math with sphere’s is really simple, things like distance calculations, those sorts of things, those are a breeze. You can get a much tighter fit with a box though, the downside of course being that the math tends to be a tiny bit more complex, but not overly so, making them a nice tradeoff on efficiency. Now that we have this idea of a viewing volume, and each object having a simple bounding volume, then you’re simply going to do an intersection test, for every single object in the scene, you test whether their bounding volume intersects with the view frustum, and everything that’s completely outside, that’s discarded, because that’s not visible And everything that’s inside or bordering, that’s drawn. Remember this test needs to be conservative, meaning that if you draw a little too much, that’s totally ok, because that’s better than drawing too little. If we do that, here’s our scene, everything that’s being discarded is being coloured red, while everything that’s being kept is being coloured green. And as an example, if our world happens to contain hundreds of objects, well, with this simple pass, we’ve whittled that number down considerably with just this simple technique. But frustum culling, as awesome as it is, isn’t usually enough, as the recently released Cities: Skylines 2 found out. Ignoring all the other problems with the game, this report from Tom’s Hardware shows that they relied solely on frustum culling, which is a really big mistake. It’s the first step, not the only step. You need to go further, and this is where occlusion culling comes in. Let’s take this simple example of a camera and some objects sitting in the world. Now we can easily discard what’s blatantly visible or not with our view frustum check. So everything outside of the viewing frustum, we’ve already discarded those, being left with just the things inside the view frustum. Now imagine a big object here, near the camera, let’s say that it’s a wall, or a building, or anything big and solid. You can’t see past this thing, so we can draw lines from the camera to the edges of this object, and extend them beyond, forming a volume, an area where realistically, nothing inside of it is visible to you. So that’s occlusion, but how do we calculate occlusion? And most importantly, how do games manage to do it so fast? Because this needs to be done near instantly. Let’s start with a really simple idea, that occlusion culling isn’t all that different from frustum culling, at least in this simple case here. In frustum culling, we’re getting rid of what’s OUTSIDE of this area, defined by our view or our camera. Taking a superrrrrr simple approach to occlusion culling, it’s almost like a reverse of that, instead of getting rid of everything that’s outside this area, we want to get rid of everything that’s inside of it Inside outside, outside inside That in itself, isn’t all that complex, and perhaps surprisingly, some big games in the past have been shipped with this extremely simple but effective approach. Basically, all that it entails is that somebody goes into Maya, or whatever the 3D editing software is, and they manually author occlusion volumes. So let’s say that you have your object in the world, you’ve got a big fat tree or rock or whatever, doesn’t really matter that much. Then an artist would secretly place a box, or several boxes, inside, and these boxes aren’t visible to players, they’re only used by the rendering system. Ideally, you’d place a box that fits pretty snug inside, without spilling out. And you’d do this for every object in the world that fits some sort of criteria, like it’s big enough to be important, that sort of thing. Artists didn’t have time to go and make an occlusion volume for every pebble and twig on the ground, and you as an engine programmer don’t want to have to deal with that much data anyway. So how is it that you actually perform occlusion culling with one of these boxes? Buncha ways, it’s not particularly complex. Let’s draw a border, and this will be our screen, and now let’s show the box on the screen, so we basically need to project the box to screen space, which might sound fancy but it just means draw it on the screen. See this silhouette here? That’s defined by at least 4 sides, and up to 6 sides, which you can create planes out of, forming a viewing frustum, same as we just dealt with for the camera. Except in reverse. So now that you’ve got your data, your set of all the occluders in the scene, and the first thing you’re probably going to do is a basic frustum culling pass on them, to make sure they’re even relevant. This is where things get a little custom for each approach, but basically you’re going to pick a bunch of them, how you pick them is kinda up to you, optimization is a game of tradeoffs, you need to decide how much time to burn in order to save time yeah, so it’s just an instance of that. Pick too many, and you may just spend way too much time going through them. Pick too few, and you may not cull enough. In practice, you may just do something super simple, like pick the N closest ones and hope for the best, which according to their presentation from siggraph, is what Just Cause 2 did. The developers at Avalanche studios outlined a simple and easy approach, with a SIMD optimized box culling system, which they called BFBC or brute force box culling, and that was it, no fancy structures, nothing. The very first game I worked on, Prototype, did something really similar, if I’m remembering right, it’s been so long since I looked at that code. So while those tree structures can be awesome and theoretically faster, a well optimized, hand written, SIMD brute force version may be more than enough. If you catch yourself wondering, wouldn’t a tree be faster? Watch my video “Memory, Cache Locality, and why Arrays are Fast”. Theoretically the tree should win, but a brute force SIMD optimized one may absolutely destroy it on modern hardware. So once you’ve got something like that in place, we can place a simple occluder in the scene, and bam, suddenly a lot of objects that are behind it aren’t getting coloured green anymore, they’re yellow, which in this case means they are being occluded. It’s not perfect, and some things can slip through the cracks, we can trivially set up a situation that doesn’t work the way we might expect it to. Like here we have 2 occluders, side by side, and 1 object behind them, but in the middle. Since neither of these occluders fully occludes the object behind them individually, and we don’t create a sort of union of occluders, because that’s complicated, that object gets flagged as drawn. So that’s it, that’s how modern games do occlusion culling? Well, you can use this approach if you need a decent, but not absolute state of the art system. What it does do though, is it served as a great jumping off point to understand some of the other techniques. When you draw a scene, you don’t only generate the colour that’s on the screen, you also generate what’s called a depth buffer, and this is really useful for the GPU. What this does is keeps track of the depth of every pixel on the screen, and this allows us, and the GPU, to do a variety of things. For the gpu, it allows it to do things like render things in the right order, but also discard things that aren’t visible. If we didn’t have a depth buffer, the scene would be a mess, parts of objects would draw through others, and then you’d have to resort to sorting the individual triangles of the entire scene in order to have any hope of drawing things somewhat ok. With a depth buffer, now the GPU can simply compare the depth of what you’re rendering, against what’s already there, and if it’s closer, it gets drawn, and if not, it doesn’t. For us, it’s pretty interesting too, because we have a texture that basically has the depth of the scene available, we can create effects like this fire that doesn’t clip into the scene, or even cheat and make really cheap reflections, like we’re seeing in this water. The interesting part, in the context of occlusion cullling, is how it allows us to discard entire objects that aren’t visible. If we already had a depth buffer drawn, let’s ignore the problem of HOW do we generate one, and let’s say that I can somehow magic one into existence at the beginning of the frame, how useful would it be? The answer is of course, SUPER USEFUL. If we SOMEHOW had that, we’d be able to read from it, and do comparisons, so if I wanted to check if I can draw an object, we could take it’s bounding volume, and we could take the screen space bounds of that, map those to our depth buffer, and do a comparison of the object’s depth in the scene with the values in the depth buffer. If it’s behind the values in the depth buffer, poof, we eliminate the entire object without the gpu ever having done any work at all. In fact, we could go a step further, say I take that depth buffer, and progressively down sample it to form a mip chain, or hierarchy of these occlusion maps. Well then, when we project the screen space bounds of our object, we could simply choose the appropriate mip level so that we only have to read 4 texels, and bam, an even faster and easier comparison. These are called hierarchical z buffers, or they’re often referred to as HZB’s, which we’ll be doing from this point forward. So remember, HZB refers to this hierarchy of z buffers, this sort of chain of progressively downsampled occlusion maps. But we have a chicken and egg situation here. All this depends on HAVING a depth buffer, and that’s created as part of rendering the scene. But we need this to decide WHAT to render, to save us work. Which is a bit of a problem, isn’t it? Early GPU’s were pretty restricted in what they allowed you to do and not do, and one thing that wasn’t easy was to read the depth buffer back so that we could play with the data ourselves. One way around that is to not bother with the GPU at all, and implement a software rasterizer. I mean, Quake came out in 1996 and required a Pentium 75, so there’s no reason that with vastly more powerful hardware, you shouldn’t be able to spit out a few polygons on screen really really quickly. And you can do all sorts of optimizations here, remember the occlusion culling doesn’t have to be perfect, it has to be good enough and really fast. So one thing that you can do is instead of drawing a full resolution buffer, you could render to half resolution, or you can go even further and go quarter resolution, i mean, whatever works right And you can hand build your occluders still, well I mean have artists make them, and your software rasterizer is basically only responsible for drawing out something that would look low poly even in the ps1 era. Now of course, doing it this way isn’t perfect, but it gives you a bunch of advantages over the previous, box only method. But if you remember the example of the object sitting between 2 occluders. Well, using this new approach of generating your occlusion map cpu side, the union of occluders is not only possible, but a natural consequence of the approach, as you can see here. We’ve got 2 occluders, side by side, and they’re occluding the object behind it. You also have a lot of flexibility in your occluder shapes now, so say, a wall with a window was now possible. There wasn’t any good way to do that before, but now it’s pretty easy to handle that case, so this is huge progress. As we pan out, we can see that the window is being taken into account for the occlusion culling, which is amazing. The approach itself isn’t that complex. The initial steps of this are really similar to what we had before, we do basic frustum culling and such on the scene, and we also get the list of occluders. The new part is here, we have our new HZB, and that can be whatever resolution works good enough and fast. Then you’d draw all of your occluders into that, and once that’s done you’ll go ahead and downsample that to form the hierarchy. Then you use that filled in hierarchy to manually test objects to see if they’re visible or not, and that’s it. Conceptually, it’s not super complex. So here we can see the HZB in action with rendered objects. In the top right corner, we’ve got on display the top level mipmap, basically the depth that was being drawn out. We’ve also got a few more levels of the HZB underneath, at various mip levels. One of the really interesting things to take note of is that, look at how, as you go towards the lowest resolution, the bright or “far away” pixels are kinda taking over everything. That’s because we do this conservatively, as we downsample, we always pick the furthest away value. Now of course, doing it this way isn’t perfect, but it gives you a bunch of advantages over the previous, box only method. Recall the example we looked at earlier, of having an object sitting behind 2 occluders, which failed miserably because of the approach being used. Well, using this new approach of generating your occlusion map cpu side, the union of occluders is not only possible, but a natural consequence of the approach, as you can see here. We’ve got 2 occluders, side by side, and they’re occluding the object behind it. In fact, we can change this up and now we can have a whole bunch of thin objects, lined up side by side, and the overall net effect of them is the same as having 1 giant wall. Look at the object behind, it now stays yellow, the system is now reporting it as occluded. Or even better yet, we can see that using non box shapes, like here I’ve got a disc, and the technique just doesn’t care at all, the shape of your objects, your occluders, doesn’t matter in the slightest. Here, we’re using the infamous stanford bunny, but as an occluder. The shape of your occluders no longer matters, which is amazing, because it suddenly grants this huge amount of flexibility that was lacking before. An example of a game that launched with this was Killzone 3. In their Siggraph 2011 talk, Guerrilla Games talks about using the SPU’s on the PS3. Now if you’ve never done any PS3 development, the PS3 shipped with what was called the Cell Broadband Engine, under the hood you had a powerpc main cpu, an nvidia gpu, and developers had access to, I believe, 6 of the SPU’s or Synergistic Processing Units, which were these stupidly powerful but difficult to use vector processors. Anyway, programming for them was a pain, but if you had the right workload and a lot of patience, once you got them working, boy could they do a lot. So Killzone 3 generated their occlusion map using a simplified version of the scene, a lower resolution depth buffer, and some highly optimized SPU side rendering. Battlefield 3 gave a really similar talk at gdc 2011, entitled Culling the Battlefield, where they went into detail on a really similar approach for the famous Frostbite engine used in so many EA games. But let’s be real, GPU’s are just wayyy better at drawing than the CPU is, because the raw horsepower they have is absolutely unmatched. And that gap is only growing every year, not getting smaller. Ideally then, what you want is that you want to get the GPU to simply tell us what we want to know, which is exactly what started happening. We started getting support for things like hardware occlusion queries, a mechanism for just asking the GPU, hey how much did you draw? The idea is pretty simple, you could draw part of your scene, like maybe just the major occluders. Then once you’ve done that, you want to figure out all the small things, whether they’re visible or not, so you take the bounding volumes of those objects, and issue queries to the GPU asking if they’re visible or not. If they’re not visible, you can ignore them, if they are, you can choose to render them for real later. Articles like “Hardware Occlusion Queries Made Useful” appeared in places like the GPU Gems series, that went into detail on how you could use these queries to build very general purpose occlusion culling systems. The team on Splinter Cell Conviction went this route at first, ditching their precomputed visibility system in favour of a new one built around this newly exposed functionality. The problem was that issuing a lot of these queries incurred cpu overhead, and you don’t get the answer immediately. Both of these pose significant problems. If you want to cut down on queries, your first idea is some sort of hierarchy of bounding volumes. So given a group of objects, maybe query on the whole group first, that should save on CPU overhead. But that of course, runs head first into the other issue, which is you don’t get the answer immediately. So you can stall the GPU by just waiting for the answer, which is super easy, and just an awful idea. So let’s not do that. If you want to cut down on queries, your first idea is probably some sort of hierarchy of bounding volumes. So you’d have some sort of hierarchy of bounding volumes, where each node represents the bounding volume of all the children. So this root node, for example, is the bounding volume of everything. So given a node in the tree, you’d do a query using that node’s combined bounding volume, saving you the cpu overhead of having to run a query on each object. If that query fails, the entire subtree isn’t visible, and you can safely move on. But that of course, runs head first into the other issue, which is you don’t get the answer immediately. So you could just stall and wait for the answer, which is super easy, and just an awful idea. You don’t want the GPU or CPU just sitting there idle. So then you come up with various attempts to mitigate THAT problem, like Nvidia was proposing here, but ultimately, it’s a whole can of worms and difficult to get working well, which is why the Splinter Cell team just wholly abandoned all the work they did and went in another direction. Late in the project, they pivoted away from their query based system to something that’s getting a lot closer to what’s in use today. They take their set of occluders, and remember these can be simplified versions of existing geometry, and render them out to a depth buffer. Now they have a depth buffer, and they use that to build a depth hierarchy, similar to what we showed before. So each level is half the resolution of the last, and you take the maximum depth of the 4 texels from the higher resolution map that turn into 1 in the lower resolution. Then, they take everything they want to test and draw that as a set of points, so each point knows the screen space box extents of the object they want to test, and the depth it’s testing, and it’s responsible for outputting a pixel with visible or not visible. Then that final, single render target is read back from the GPU to the CPU, and now they can see the results of their queries. And the awesome thing about this approach was that, most of it happened on the GPU, there’s only this single stall at the end reading back. But in reality, renderers have a crapload of work to do on the CPU as well, so if you can front-load this occlusion work early in your frame, meaning you do it as early in your frame update as you can, then shove whatever other bookkeeping work you can find, let the GPU do it’s thing, and then do your single read back afterwards, then it’s not the end of the world. One set of problems that we’d like to get away from though is this manual work step of authoring occlusion geometry, and having to select occluders to use. That’s time being spent that could be spent on making more content, polish, etc. One observation that you can make about any given scene is that, from one frame to the next, very little changes. In reality, to generate the current frame, we mostly take the last frame and mutate it a tiny bit. Objects might be moving around, but at 60fps, for example, that translates into barely anything. Same with the camera, that will have moved but only every so slightly. What if we could exploit this somehow? One attempt to exploit this temporal coherence was described by the developers from Assassin’s Creed. Full disclosure, whether or not they came up with it first, I have no idea, but it’s where I saw it talked about. Their approach seemed to be to take a bunch of the nearby stuff and render it into a depth buffer. Then, they’d take last frame’s depth buffer, reproject that to the current frame, and combine those 2 together. Wait, what’s reprojection? The idea isn’t overly complex. If you know roughly where things were last frame, where the camera was, and which direction things were moving, then you can re-project, or in other words, guess where they should be this frame. That’s all. Anyway, so they describe using this approach of combining the old and new, creating the HZB from that, and then doing their occlusion queries. So that’s really a neat approach. But unfortunately, quite difficult to manage properly, with a lot of special casing in the engine and implementation in order to accommodate this approach. The basic assumption, that stuff doesn’t change too much from frame to frame, that’s still good, but the technique wasn’t quite right. So that leads us to more or less what’s being used today in engines like Unreal. The basic idea is that all the stuff that passed the visibility test last frame, they’re very likely visible this frame, so take those probably visible objects, and use them as occluders. So we draw all of those, and then we take the depth or z buffer and we build the HZB. So, if you’re wondering, this doesn’t seem totally right, because let’s say that an object was off screen before, and now is suddenly on-screen. They should be occluding things, but that’s not happening right? That’s ok, this approach is conservative, you might overdraw a bit. Like I said before, that’s better than missing things. Secondly, because it was off-screen, and we only drew things that were on-screen, we didn’t even draw this new object. Well, that’s a problem, so we need to move onto the 2nd phase of this algorithm. This approach is known as 2 pass occlusion culling because, well, it has 2 passes. It’s right there in the name. It’s not a mystery. In this 2nd pass, we’re going to go over everything that wasn’t drawn the first time, and we’re going to re-test it against the HZB that we built in step 1. So that big shiny new object that moved into the middle of the screen? That will obviously pass the test, and get drawn. And because nowadays we have compute shaders, which have the ability to generate the arguments for draw calls themselves, this entire pipeline, all of these steps, takes place on the GPU, which is awesome. What does the future bring? I’m not sure, there’s really exciting work being presented where individual objects are now being broken apart and occlusion culled, which is a logical next step, and then you have technology like Nanite from Unreal engine, which is something of an automatic level of detail system. Very cool stuff. Be under no illusions that this is a complete picture of how visibility is done, but it should give you a pretty decent overview of what’s used today, and how we got there. We haven’t touched on things like precomputed visibility, or other aspects of just making the scene cheaper in general to render. I can take this scene, and with very little in the way of visible differences, the scene can suddenly render significantly faster. There’s so much work that goes into modern rendering, and we’ve just started scratching the surface. Cheers

Info

Channel: SimonDev

Views: 485,950

Rating: undefined out of 5

Keywords: simondev, game development, programming tutorial

Id: CHYxjpYep_M

Channel Id: undefined

Length: 23min 40sec (1420 seconds)

Published: Tue Dec 05 2023