Refactoring the Mesh Drawing Pipeline | Unreal Fest Europe 2019 | Unreal Engine

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

>>Nick Penwarden: There we go. All right, a little technical difficulty. We are good to go now, though. Good morning, everyone. Thanks for coming. My name is Nick Penwarden. I am the Director of Engineering for Unreal, and this morning, I am going to be talking about some of the work that our rendering team has been up to. This is a presentation that we showed at GDC that Marcus Wassmer gave, and I wanted to present it here to be able to give the same information to you guys. Over the past year, the rendering team have worked on a large refactor of the Mesh Drawing Pipeline in UE4. I want to talk about why we did it, what the benefits are, some of the benefits that we have seen, and give you kind of an idea of what you will need to go to upgrade any rendering modifications that you have made to Unreal when you update to 4.22. Talk a little bit about where we are taking things. Why did we spend a year refactoring the Mesh drawing code in UE4? One reason is that, I think if you were at the keynote yesterday, I was talking about how more people are making open world games. We want to really focus on allowing people to build ever more complex scenes, so supporting huge view ranges, higher detail, stuff like that. When you are building a World, we want to make that easier. You tend to build it modularly, you pull in individual Meshes. You use that to construct your World. We do not want artists to have to worry as much about taking individual Meshes and going and merging them to reduce draw call counts for optimization purposes. Going forward, dynamic lighting and shadowing are going to become ever more important as games become more and more dynamic. Those additional passes are going to require drawing Meshes multiple times for multiple viewpoints. Also, upcoming technologies need more information about the scene during evaluation. For instance, take DXR -- in DXR, when you shoot a ray, you can hit something in the scene that might not be visible in the primary view. When that happens, you need to be able to evaluate the surface shader for that triangle and return the data back to the shader that cast that ray. That means that we need all the information about the scene on the GPU, even if it was invisible, this frame. Traditional rendering pipelines, the way UE4 used to work, the way most game engines work -- that information is not usually available on the GPU, because each frame, based on the view, you only give the GPU the data that it needs to render exactly what is visible. Moving forward, as we want to move more and more of the computation for rendering off of CPU and onto the GPU, that means that the GPU is going to need more and more information about the scene to be able to make the decisions that it needs to set up all the command buffers needed for rendering. How do we get there? We started by looking at draw call merging. What would we need to do to take multiple batches of draw calls and get it down to reduce the number of API calls that we make per frame? Looking at something like D3D11, the only option for doing that, really, is instance drawing. In order for that to work, the only data that you have per draw call that is going to be different for different Objects in the scene is the InstanceID. That means in order to make that work, we can no longer be setting shader parameters for every single draw call. Therefore, what we need to do is be able to have that data all available on the GPU, and be able to access it using only the PrimitiveID. Another way for us to get to the point where we are doing less work on the CPU as we are rendering is with more aggressive caching; so things are not changing, making sure that we are not doing duplicate work. A lot of the World does not change every frame, it is very temporally coherent. Static scene draws, we want to be able to build those when they are added to the scene. Then only when they move or update in a meaningful way should we invalidate that data. By doing more aggressive caching, this allows the platform, the RHI layer, as we call it, to prebuild as much as possible; things like the shader binding table entry that I mentioned for DXR, and the graphics pipeline state for DX12, for Vulkan, etc. Again, the goal here is to remove all of the renderer overhead in building that static geometry every single frame. I wanted to start by talking about how Mesh drawing works in UE4, in 4.21 and earlier. Okay, so we start with the Primitive Scene Proxy, and the Primitive Scene Proxy is sort of the rendering threads copy of the Primitive Component. The Primitive Component, that is what lives on the game thread, that is what game code works with, that is what you move around as Objects move in your game. The renderer has this copy of the data, partly for thread safety purposes, and partly so it can cache information that only the rendering thread needs to care about. Every frame, we take that Scene Proxy and we generate a number of Mesh Batches from it. The reason for this -- what is a Mesh Batch? It is basically all of the high-level data that we need to figure out how to make a draw call in a given pass. It decouples the Scene Proxy from all of the details of the rendering implementation; the Scene Proxy does not need to know about what passes it is going to render in, it does not even need to know what passes potentially exist in the renderer. We do not want it to. This is what gives us the flexibility to implement new passes, new algorithms, without having to go and update every single Object that potentially wants to render. Now we have this Mesh Batch, and we now need to go and use something that we call a drawing policy to take that Mesh Batch and convert it into a set of RHI commands. This would be, for instance, we have a depth drawing policy for rendering the pre-pass, we have a base pass drawing policy for rendering the GBuffer, and so on. This reads things from the Mesh Batch, like the Vertex Factory and the Material, and uses that to make decisions. Finally, the RHI command list is sort of a cross-platform set of commands for rendering, that then the RHI implementation, the actual low-level API layer will actually make API calls. Our D3D11 implementation will turn these RHI commands into draw index primitive calls on D3D11, and corresponding calls on OpenGL, on PS4, on Vulkan, and so on. I want to dig a little bit more into this sort of part of the renderer where we are taking Mesh batches and we are generating RHI commands. The way this works in 4.21 and earlier is, we traverse the Static Mesh Draw List, or the Dynamic Meshes, and the first thing that the drawing policy needs to do is, figure out what shader do we render with? We use the Material, we use the Vertex Factory, we use that to figure out what shader to render with. Then we have to actually gather all of the shader bindings. What are all the parameters we need to draw for it? These are things like the parameters that the artist has set on a Material and its constant, things that are needed for the Vertex Factory, information about the view. We need to bind all those parameters. The other thing about the Static Mesh Draw List is that it is prebuilt. We did some level of caching in 4.21 and earlier, and so at add time, when a parameter is added to the scene, we would take the Mesh Batch and we would run it through a part of the drawing policy, cache off some of that information, and store it in the Static Mesh Draw List. The Static Mesh Draw List would actually contain all of these semi-pre-processed Mesh batches for the entire frame, and then it would sort them based on state so that we could minimize the number of times we are changing shaders and changing other rendering state, and then at runtime, when we actually go to render the Mesh draw list, it would actually go through and it looks at every single Static Mesh in the scene, and looks up into a bit Array for visibility, basically saying, is this Mesh visible? Yes. Draw it. Is this Mesh visible? No. Is this Mesh visible? No. That means that a certain aspect of drawing scaled with the entire size of the scene, not just the number of Meshes that were visible. That traversal is relatively fast, so it works okay for certain sizes of scenes. But once you get to a certain point, it does not scale any longer. Another problem with it is, it is an entirely different code path in dynamic drawing, so dynamic Mesh batches need to go through the entire drawing policy every frame. That means that we cannot actually sort between the two. Anything that goes down the static path, we cannot sort it with things that go down the dynamic path. From a code point of view, it is also kind of -- there is a lot of boilerplate to deal with this. The static draw list is templated by that drawing policy, and therefore, you end up with a bunch of these drawing lists just sitting on the scene. Adding one is kind of extra overhead. Long story short, Static Mesh Draw List prevents us from doing efficient draw merging, partly because the drawing policies are too tightly coupled with the Static Mesh Draw List, and partly because the way that we design drawing policies, they are allowed to set shader parameters directly to the RHI per draw call. That is something that we need to avoid if we want to be able to merge draw calls. Let us take a look at the new Mesh drawing pipeline that we designed for 4.22. First of all, we are getting rid of this traverse Static Mesh Draw List, drawing policy bit, and we are replacing it with what we call Mesh Draw Commands. What is a Mesh Draw Command? A Mesh Draw Command stores everything that the RHI needs to know in order to issue a draw call. This is a full, stand-alone, low-level description, so it has things like the pipeline state Object, it has things like shader bindings, it has things like the index buffer that we need to render with. It is completely stateless. There is no data here that points you back to where this Mesh Draw Command comes from; there is no context. That is really nice from a data transformation perspective, because it means that as we go, we are able to optimize, replace, merge these things without having to worry about the implementation of Mesh Batch, or the implementation of a Scene Proxy. There is actually a debug pointer on Mesh Draw Commands, so you can see where the draw call originated from, which is really useful for debugging. But we strip it out in shipping builds, so you should never actually depend on it for any code that you are writing. Having this sort of stateless description of what the RHI needs to render to issue a draw call allows us to do more aggressive caching, which is what we are looking for. We can build the Mesh Draw Command for static geometry at the time we add to the scene. I mentioned that Static Mesh Draw List does some amount of caching; the Mesh Draw Command does even more. It is basically the entire execution of what the drawing policy used to do, all cached in a single Mesh Draw Command. This gets us closer to what we are looking for, being able to have the pipeline state Object, and all of the shader bindings needed to draw anything in the scene at the time it is added to the scene. The other thing is that because this is purely state, we get robust draw call merging. We can just look at the state, and if it is the same across multiple draw calls, we can just combine them into a single instance draw call. But more on that later. Getting back to sort of the lifetime here, how does this actually -- what does this actually look like? We have a Mesh Batch and we need to generate Mesh Draw Commands for it. To do that, we have this Mesh Pass Processor. The Mesh Pass Processor, its purposes, again, is to build the Mesh Draw Commands. It selects a shader, collects all the bindings for rendering that draw call, bindings from the pass, from the Vertex Factory, from Materials, and so on. Now you need to write a Mesh Pass Processor per pass that you want in your renderer. One of the nice things, though, is that this is not templated on the shader class, it is just a relatively simple class to write. We use the same code path for both dynamic and static draws, so there is no more duplication of rendering code when you are implementing a pass. What I want to do is take a quick look at an example. We will take a look at the Depth Pass. This is the simplest pass in the renderer, probably, where we just want to render the depth for every primitive in the scene. There is a little bit of copy-paste boilerplate when you are going to create a new Mesh processor; this is about the extent of it, so it is not too bad. This is basically just setting up the things that FMesh Pass Processor needs. Then the part that you implement, the part that you really care about is this. This is Add Mesh Batch. This is called for each Mesh Batch in the scene that is visible. The two things you are doing here, really, is filtering -- do I care about this for this pass -- and selecting the shader. For instance, for the Depth Pass, we are not going to render translucency into the Depth Pass, so we filter that out. We filter out anything that is not actually going to render in the main pass, etc.. In terms of selecting the shader, we look at, in this case, whether it is an opaque Mesh or not. If it is an Opaque Surface, then we do some optimizations, in this case we can render with the depth-only vertex shader, which is a bit faster than otherwise. If it is masked, then we need to evaluate the Material so that we can do alpha testing. That goes through a slightly different path. Here is sort of the second part of selecting the shader, and finally, we call this Build Mesh Draw Commands Function, which will gather the bindings. This is actually code that is shared among all death pass processors, so you do not have to write that function yourself. That is it. You have this function, and this one is actually just a convenience template function about choosing whether we are using the depth-only shader or not. That is it. Depth drawing policy in 4.21 was somewhere around 642 lines of code. The Depth Pass Mesh Processor is around 100. It is a lot simpler to set up these new passes, a lot simpler to add them. The Base Pass also went down from almost 1000 lines of code to just over 200. I think all engineers like deleting code, so having simpler code is always a good thing. I want to take a closer look at that function; I told you, you do not need to write this, the Build Mesh Draw Commands. This is the one that actually collects all of the shader bindings that are needed for the draw call, and this is part of where removing a lot of the boilerplate came from, why implementing a new Mesh pass is simpler now than it was before. The main change here is that rather than directly grabbing all of the parameters that we want to set for the shader, and directly asking RHI to set them, you know, set this, float four at this offset -- instead, what we do is, we gather them into basically a shader bindings table and we cache them off. After we have gathered all of the Mesh Draw Commands, now we can sort them. This, as I was mentioning earlier, we could never do state sorting before between dynamic path Meshes and Static Meshes because they go down different code paths. Now, because they are all using the same data, it is really simple for us to just sort all of the Mesh Draw Commands. Finally, we submit them. Now we have a list of Mesh Draw Commands, and all we need to do is iterate that visible list, and generate the RHI commands for rendering it this frame. The nice properties of Submit Mesh Draw Commands, it scales only with the visible number of primitives, so we no longer have this property where traversing the Static Mesh Draw List would have some cost for the entire set of visible Meshes. It is also easier to parallelize than the Static Mesh Draw List was for two reasons; one, it is easier to split it into an even number of tasks. When you are generating parallel tasks, parallel jobs, you want to be able to generate roughly even amounts of work; you do not want one job that takes 10 times longer than another job, because that is going to end up being your critical path. In this case, because it is only a divisible set, it is really easy. If there are 5000 visible Meshes, we will break it into groups of 100 or 200, and send those off to task threads to process. With the Static Mesh Draw List, we could not really do that. We had to have some notion of which one is going to look at the first 100 or 200 or 300 Meshes in the frame, and then it would go over the bit Array, and only draw what is visible within those areas. There is kind of some complicated code that would, based on the previous frame's visibility, try to figure out how to break up tasks to make them even. It worked okay in practice, but it was a lot more complicated than just a simple parallel for. Also, all we are doing is taking these Mesh Draw Commands, which are just data, and writing into an RHI command list, which is just a data stream. It is completely side effect-free, we are not touching global state, we are not modifying global state. There is no worry about raised conditions. Another thing that we do in Submit Mesh Draw Commands is, we do state filtering above the RHI. This reduces the load, the amount of time that we have to spend executing the command list. I think in Fortnite we have seen maybe a 20 percent savings by doing this alone. This also means that by doing it at this level, there is less burden on the per-platform implementations, to make sure that they are doing very fine-grained state caching. It is also cache-coherent. This is just one big, flat Array of data. Therefore, since everything we need is in this contiguous memory space, we are not hitting a lot of cache misses during rendering anymore. We are able to just run through this flat Array. We get a lot of benefit out of pre-fetching from the CPU, and everything goes really quickly. All right, so we have done all this work to refactor how we are taking primitives and generating Mesh batches and generating Mesh Draw Commands, and so on. Now let us get to the caching part of things. Again, the majority of the Worlds that we are dealing with are modular Static Meshes, so we should not be doing all of the work, every frame. What we want to do is just cache these on the scene, and then just select the right ones to draw every frame. This means that anything referenced by Mesh Draw Commands cannot change frequently, because if it does change, then we need to invalidate it. What we needed to do was use a level of indirection to make sure that that was the case. For instance, let us take a look at per view parameters. These are values that change every single frame. The way that the renderer used to work, every frame, we would gather all of that data, we would make one Uniform Buffer, the View Uniform Buffer. Then as we are issuing draw calls, we would bind the View Uniform Buffer that we created uniquely for that frame and issue the draw call. That is not going to work, because it means that we would have to invalidate every Mesh Draw Command every frame to update that binding. Instead, we keep the binding stable. We create a single View Uniform Buffer that is always bound to every Draw Command, and then every frame when we are about to render a view, we update the contents of the View Uniform Buffer instead. The indirection remains constant, even though the data in the buffer ends up changing. In order to support that, the RHI now supports a code path for updating the contents of a Uniform Buffer dynamically, which we did not previously. Taking a look at the type of data that we reference from Draw Commands, and how we organize them into Uniform Buffers, we have the Primitive Uniform Buffer for instance -- this is where we store all data that is unique to that primitive, things like the local to World transform. Then we have the Material Uniform Buffer, this is where everything unique to the Material, things like parameters that artists are setting in Material Instance Constants, color tints and different Textures, and so on. We have Material parameter collections, which are unique to the scene. These are data that can be accessed by a Material, but they are global for the entire scene. The precomputed lighting Uniform Buffer, for accessing data like, what lightmaps we should be referencing, what our offsets are into those lightmaps. Those tend to be per primitive as well. Then we have Pass Uniform Buffers. This is any data that a particular pass needs per primitive to render, but is not shared among different passes. The depth pre-pass might need different bindings than the Base Pass, for instance. Once we have moved all that data into those Uniform Buffers, now we had to make sure that any time that anything that really does need to change, we do invalidate the Mesh Draw Commands. This is really important, because if you screw this up, you are going to get incorrect rendering. Fortunately, most of this was already handled, because the caching done in Static Mesh draw lists cached sort of the first part of the evaluation that we now do to generate Mesh Draw Commands, so all of those cases are still relevant. Whenever we would invalidate a Static Mesh Draw List before, we would have to invalidate Mesh Draw Commands. But it is not quite everything, because now we are caching the shader bindings as well, which we were not, previously. Therefore, every time that something would change in the scene that would cause the shader bindings to change, we need to remember to invalidate cached Mesh Draw Commands. We have added a validation mode; it adds quite a bit of overhead, so it does not run most of the time. But if you are running into a bug that you think is because cache invalidation is not working correctly, you can enable this, and this will assert when it is screwed up, which is really, really useful for finding out what is actually going wrong. We use this a lot as we were bringing up this code path, and looking for corner cases and bugs that we were trying to track down. It is very useful. If you run into this -- in your case, you start seeing something like flickering, or some unexplained crash in the renderer after modifying it, this would be a good first step for debugging. Vertex Factories, if you are familiar with those, basically how we generate the vertex data for in the vertex shader for drawing, and how we get that data from CPU to GPU. Right now, we only support caching for Vertex Factories that do not need to update their bindings per view. For instance, the local Vertex Factory -- and the local Vertex Factory is what we use for Static Meshes, we use it for a couple of other cases. It is by far the most commonly used Vertex Factory in the Engine. Other Vertex Factories which do need to change per view, we do allow them to still cache Mesh batches, so some of the work is still cached, but not all of it. These are things like landscape that the LOD computations are a bit more complicated than with Static Meshes, and so need to be evaluated per view; BSP, instancing, stuff like that. This means we have a couple of different Caching Code Paths in terms of efficiency. We have the least efficient path, which we call dynamic relevance. This would be cases like Cascade particles, or Niagara particles. These change every frame; we get new data pushed to the rendering thread from the game thread. Every frame, the Scene Proxy needs to generate a new set of Mesh Batches. We need to take those Mesh Batches, generate a new set of Draw Commands, and then take those Draw Commands and finally generate the RHI commands for them. For static primitives, in cases where things are not changing in terms of the game code is not sending us new data, but rather, we need to change how we create those Draw Commands based on the view, we do a little bit more work. We cashed the Mesh Batches, but then every frame, we take the Mesh Batches, generate the Draw Commands, and then those Draw Commands will get turned into RHI commands. Then we have our most efficient path, the path used by Static Meshes, for instance, where we have cached the Mesh Draw Commands at scene add time. Then at runtime we just need to find -- we gather all of those visible Mesh Draw Commands, and we issue RHI commands for them. High-level overview of a frame with caching enabled -- when we add a primitive to the scene, if it is static, we cache the Mesh Batches. Then if the Vertex Factory does not need the view, if the Vertex Factory supports Draw Command caching, then we also take those Mesh Batches and we generate Mesh Draw Commands at that time and cache those. There are a couple of places, for instance, like when we change the skylight, that is going to change shader bindings for pretty much everything in the scene. We invalidate the cache Mesh Draw Commands. When it comes time to render the scene, we will need to go over all of the cached batches and regenerate those Mesh Draw Commands. Then each frame, when we are rendering, we figure out all of the primitives that are visible in the scene. If they are static, we compute their LOD, and we add the cache Mesh Draw Commands to a visible list. If they are dynamic, then we will go ahead, we will gather their Mesh Batches, we will generate Draw Commands from them, and then we take those Draw Commands and add them to the same visible list. After that point, after init views, every subsequent pass only needs to look at this one list of visible Mesh Draw Commands, and then draw them. All right, so now we have this nice, data-oriented layout of Mesh Draw Commands, and we set all this up so we could do draw call merging. What does that look like? On D3D11, we are using Draw Instance Indirect. That means that the only thing that can change between merge draws is the InstanceID; we cannot change shaders, we cannot change bindings -- nothing else. But it does mean that we can do Dynamic Instancing. It is pretty easy to implement at this point, with the architecture that we now have. In the future, on D3D12, for instance, Execute Indirect allows us to change state in between instance draw calls. That is going to actually allow us to take these big lists of Mesh Draw Commands and make far, far fewer API calls in order to render them. Well, let us take a closer look at -- I should mention that we have not actually done the D3D12 path yet. That will be a future endeavor that we will embark on. For now we have done the D3D11-style Dynamic Instancing. Let us take a closer look at Dynamic Instancing. Now that we have these Mesh Draw Commands, we can robustly merge them, because again, they are just low-level RHI state. All we need to do is compare the data that exists on there, and if they are all the same, then we can merge the draw calls and change it into an Instance draw call. This has the really nice property of, content creators do not have to worry about it; they do need to go and create a special Instance Static Mesh and try to combine them, and worry about, what about culling, and how do I balance draw calls versus culling efficiency, and so on? It just works. There is one downside, of course. Assigning those state buckets is slow. Doing all those comparisons does take a good amount of time, and we do not want to do that every frame. Again, let us cache this to add the scene. When we generated the Draw Command to begin with, at that point, what we can do is, we look up into a cache of state buckets and we see which state bucket does this Draw Command belong to. Then this lets us group, early on, which Draw Commands can be merged together. Then the actual merging operation is just a data transformation on the visible list. We have this big list of visible draw calls, and any command in the same bucket, we just replace that with a single Mesh Draw Command. We have our assorted list of Mesh Draw Commands, and as we go through them, we are looking up what state bucket they are in. As long as one is in the same state bucket, we just keep it in a single instance draw call. The output it at merging pass is just the smaller list of merged Mesh Draw Commands. Now that we can merge Mesh Draw Commands, that is great. How do we make sure they are as effective as possible? One of the things that we needed to do was get rid of any sort of per draw bindings that would have broken up Draw Commands into multiple sub-commands. For instance, anything that was going to be bound per pass, per primitive, we really need to make sure that does not happen. In order to support this, what we did was, we created a pass Uniform Buffer frequency, like a per pass place to store data. For instance, things like in the Base Pass, we potentially sample the debuffer Textures; we potentially sample the eye adaptation Texture, stuff like that, the fog parameters -- those things do not change for the entirety of the Base Pass. Why set them on every single draw call, we just set up a single Uniform Buffer for the Base Pass for this frame, and we render it. Then we have some remaining per draw call bindings. We have the Global Constant Buffer values, we have the Primitive Uniform Buffer. This changes on every single primitive that is in the World. If we do not change the way we store primitive uniform data in some way, this is trivially going to break apart all of our merge draw calls; we are not going to be able to merge anything. Precomputed Lighting -- so again, this is the lightmaps that the primitive need to render with, and the distance call data, basically. As you are distance culling or LOD fading, different primitives can be in a different state, and so that can break merging effectiveness. What we want to do instead is, we want to upload all of this data into a single scene-wide buffer that we can then look up into, based on the InstanceID in the shader. That way, when we merge the draw calls, it does not matter if we render them individually or merged; no matter what, we are going to be able to look up and get that data without changing the bindings for the shader. In order to support this, there were a number of cases where we needed to do something like this. We implemented sort of a generic -- think of it like a GPUT TArray implementation, so dynamically resizable Array on GPU, where when you want to insert data from the CPU, we kind of track, add updates, removes. Then before we render, we launch a couple of compute shaders to shuffle data around as needed to make this possible. This means we did not want to, every frame, just completely upload the entire primitive buffer, so this lets us just track deltas and perform only those operations. One result of this is, when you are accessing the data in a shader, you need to use Get Primitive Data. Before, you could just directly access the Primitive Uniform Buffer, but now we are going through this indirection where we need the PrimitiveID. Anything that was going to access the data needs to go through this Get Primitive Data helper function. Again, this is only used by supporting Vertex Factories, and it is abstracted by this Get Primitive Data method, so it is not too bad. But how do we actually get the PrimitiveID in the shader? All we have is the InstanceID. We could do something like have a buffer that we bind per draw call with the primitive IDs, and index into that. But adding a single Global Constant Buffer value per draw actually increased the Base Pass drawing time by 20 percent. We did not want to eat that overhead of just having to update a single parameter every draw call. Fortunately, the input assembler has a much faster path for it. We can just Set Stream Source with a dynamic offset. This lets us set up the PrimitiveID vertex input at per-instance frequency, and then after draw call merging, we build the PrimitiveID's buffer out of the list of Mesh Draw Commands. All right, so we did all this work to merge draw calls. Let us take a look at the result; what does that actually look like? I have this map called a GPU Perf Test. It is actually -- we just grabbed one of the cities from Fortnite's Save the World, probably a year or two ago, and used this as just a static representation that we could use to iterate on and test rendering performance. Initially, primarily intended as GP performance, but it was basically a way for us to have a static profiling set for optimizing Fortnite. Taking a look at this, looking at the Depth Pass, we were able to reduce the number of draw calls with merging by more than 2X, and same thing with Base Pass around a factor of 2. Depth Pass ends up being a little more efficient, just because there are a lot more cases where you can share the same shader for Meshes that are not alpha-masked, and can just use the depth-only drawing policy. But that was not nearly enough to really tell us that, like, 2X is good, but we really wanted to see how it is scaled. We did the programmer art version of trying to make a more complex scene. We took the scene and we duplicated it three times. Then we removed distance culling to just render things really far out. The results of that, we ended up with a Depth Pass that -- we ended up basically drawing around 7400 draw calls per frame. We were able to, with draw call merging, reduce the Depth Pass by about 10X in terms of draw calls, and the Base Pass by closer to 5X. Now all that said, let me caveat that with, this was, by far, an optimal case. We literally just copy-pasted content, so it was pretty much perfect for Dynamic Instancing. But it is sort of proving that if you do make a modularized static scene, then this code can do a really good job of optimizing that on the back end. Some performance gotchas to keep in mind -- things that will break draw call merging -- lightmaps that make small Textures. Basically, if you are using static lighting and a lot of your primitives end up in different lightmaps, then we are not going to be able to merge draw calls where they do not use the same lightmap. You can tweak the radius that we use to pack primitives into the same lightmap Texture, and that can help with it. It is kind of a trade-off between Texture streaming efficiency from memory of lightmaps, and draw call efficiency of how much you are able to merge. Using vertex painting on instances in the World will also break draw call merging, because those exist in their own independent vertex buffers; they need to be bound separately. The Speed Tree Wind Node -- again, if you use the Speed Tree Wind Node, we do not support merging. There is no good technical reason for that, we just did not get to it. That could be improved in the future. If you are still using the Legacy Sparse Volume lighting samples for lighting dynamic Objects with static lighting -- that will not work, either. However, the newer volumetric lightmaps work perfectly well with draw call merging. All right, so before, we were looking at just draw calls; how well did merging work in terms of reducing the number of draw calls. But what does that actually mean in terms of milliseconds, because that is what we actually care about at the end of the day? We tested it on PlayStation 4, it is just nice to be able to use a console platform where we do not have to worry about the OS getting in the way, we do not have to worry about differences in hardware, just a very consistent platform to be testing on. With the base GPU Perf Test, with the old code in 4.21, the Depth Pass was running in about 2 milliseconds, and the Base Pass, 3.2. The new Mesh Draw Command pipeline with Dynamic Instancing, depth drawing to 0.3 milliseconds, and Base Pass, 0.4 milliseconds, so much, much faster, six to seven times faster. In our programmer art larger scene, the old path was running at around 15.7 milliseconds for the Depth Pass, and 27.8 milliseconds for the Base Pass, again, issuing about 7,400 draw calls. The new path is now running in 1.2 milliseconds for the Depth Pass, and 2.4 for the Base Pass -- so massively faster, right? More than 10 times faster. But let me caveat that. In this test case, it worked awesome. These are, by far, best case results. For one, this is a heavily modular scene with very limited Mesh and shader variety. Also, the speedup I was showing, that is only the Mesh drawing part of the code, and that may or may not be on your critical path for a variety of reasons. For instance, I am not talking about init views, which is the visibility part, so if visibility is still dominating your draw time, then taking the Mesh drawing and making that smaller might not help you. If you are bound by -- oh, another reason, we were already paralyzing the render fairly heavily. If you have a lot of task threads that are not doing other useful work, we were already using them very effectively. Even though those go away, you still might not see an actual frame rate benefit from it. Also, your content, as well as ours, what we have been shipping in Fortnite and all of our demos and other Projects is, we are optimized for the previous renderer, so our artists were very carefully making sure that they kept draw call counts down. These benefits really start to show up as your scene scales up by factors of three, four, five of what they are today. Some casualties from the change, things that no longer work -- all of the deferred primitive update mechanisms that were in the renderer for efficiency sake -- those had to go away, because they are just incompatible with the idea of caching all of that data when we create the scene. These are things like, if a primitive moves but it is not visible, we would never bother updating the Primitive Uniform Buffer for that primitive. We cannot do that anymore, because we need to update some GPU data so that DXR can trace a ray to it, or so that we can build a Mesh Draw Command that can reference it. Also, if you have any Materials that are using custom expressions, they can pretty much type whatever HLSL code they want, and they might be grabbing data from the Primitive Uniform Buffer. If they were, you need to make sure that you go in and change those custom modes to use the Get Primitive Data accessor function. This is something where if you update to 4.22, and some of your Materials are not compiling anymore, there is a good chance that this is why. Go and look for custom expressions that might be breaking the new rules. The forward renderer can now only support a single global planar reflection, because, again, to do multiple global planar reflections, we would have had to take into account per view information, in this case. I already talked a little bit about this, but current UE4 Projects may or may not see benefits. I just want to caveat, you might go home, you try 4.22, and you are, like, where is that 10X speedup? Well, here is why you might not be getting a 10X speedup just by updating 4.22. But that said, we do have a couple of testimonials, so when we release the preview release, we had some developers on Twitter who were talking. Joe Wintergreen mentioned that in his Project, 4.22 ended up saving about 1000 draw calls. The Shores Unknown guys were letting us know, they had a scene that was around 18,000 draw calls with 4.22, now it is around 2.3 thousand. So their scene that was running at 30 FPS is now running at 60 FPS, just by updating to 4.22. Depending on how you build your scene, you might get a massive speedup, you might not. But at the end of the day, I think we have managed to allow the renderer to scale much better with a larger set of draw calls, and so going forward, you will be able to build larger, more complex scenes. There was one caveat that I did forget to mention around draw call merging as well. That is in 4.22, we do not support draw call merging on mobile. There is no technical reason for that, we just did not get to it in time for 4.22. We will have it in for 4.23. The simple reason for that is, there are different passes in the mobile renderer versus the deferred renderer, and we needed to go in and do some of the same transformations of pulling out data that we were binding per draw call, and make sure that we put them into per pass buffers, or put them into global scene buffers that we can then access per primitive. That is it. That is an explanation of the new Mesh drawing pipeline in UE4. Thank you for coming out to the talk; I hope it was useful. I hope you learned a little bit about why we did it. I hope when you update to 4.22, you do get a massive improvement in frame rate. If not, add more Meshes to your scene. Thanks. [Applause] ♫ Unreal logo music ♫

Info

Channel: Unreal Engine

Views: 3,860

Rating: undefined out of 5

Keywords: Unreal Fest Europe, Epic Games, Game Engine, Unreal, UE4, UE Fest, Prague, Unreal Fest, Game Development, Unreal Engine, Game Dev

Id: UJ6f1pm_sdU

Channel Id: undefined

Length: 48min 45sec (2925 seconds)

Published: Mon May 20 2019