>>Nick Penwarden: There we go. All right,
a little technical difficulty. We are good to go now, though.
Good morning, everyone. Thanks for coming.
My name is Nick Penwarden. I am the Director
of Engineering for Unreal, and this morning,
I am going to be talking about some of the work that our
rendering team has been up to. This is a presentation
that we showed at GDC that Marcus Wassmer gave,
and I wanted to present it here to be able to give the same
information to you guys. Over the past year,
the rendering team have worked on a large refactor of the Mesh
Drawing Pipeline in UE4. I want to talk about
why we did it, what the benefits are, some of the benefits
that we have seen, and give you kind of an idea
of what you will need to go to upgrade
any rendering modifications that you have made to Unreal
when you update to 4.22. Talk a little bit about
where we are taking things. Why did we spend a year refactoring the Mesh
drawing code in UE4? One reason is that, I think if you were
at the keynote yesterday, I was talking about
how more people are making open world games. We want to really focus
on allowing people to build ever
more complex scenes, so supporting huge view ranges,
higher detail, stuff like that. When you are building a World,
we want to make that easier. You tend to build it modularly,
you pull in individual Meshes. You use that
to construct your World. We do not want artists
to have to worry as much about taking individual Meshes
and going and merging them to reduce draw call counts
for optimization purposes. Going forward,
dynamic lighting and shadowing are going
to become ever more important as games become more
and more dynamic. Those additional passes
are going to require drawing Meshes multiple times
for multiple viewpoints. Also, upcoming technologies
need more information about the scene
during evaluation. For instance, take DXR --
in DXR, when you shoot a ray, you can hit something
in the scene that might not be visible
in the primary view. When that happens,
you need to be able to evaluate the surface shader
for that triangle and return the data back to
the shader that cast that ray. That means that we need
all the information about the scene on the GPU, even if it was invisible,
this frame. Traditional
rendering pipelines, the way UE4 used to work, the way most game engines
work -- that information is not
usually available on the GPU, because each frame,
based on the view, you only give the GPU
the data that it needs to render exactly what is visible.
Moving forward, as we want to move more
and more of the computation for rendering off of CPU
and onto the GPU, that means that the GPU
is going to need more and more information
about the scene to be able to make the decisions
that it needs to set up all the command buffers
needed for rendering. How do we get there? We started by looking
at draw call merging. What would we need to do to take
multiple batches of draw calls and get it down to reduce
the number of API calls that we make per frame? Looking at something
like D3D11, the only option for doing that,
really, is instance drawing. In order for that to work, the only data
that you have per draw call that is going to be different
for different Objects in the scene
is the InstanceID. That means in order
to make that work, we can no longer
be setting shader parameters for every single draw call. Therefore, what we need to do is be able to have that data
all available on the GPU, and be able to access it using
only the PrimitiveID. Another way for us to get to
the point where we are doing less work
on the CPU as we are rendering is with more aggressive caching;
so things are not changing, making sure that we are
not doing duplicate work. A lot of the World
does not change every frame, it is very temporally coherent. Static scene draws, we want
to be able to build those when they are added
to the scene. Then only when they move
or update in a meaningful way should we invalidate that data. By doing more aggressive
caching, this allows the platform,
the RHI layer, as we call it, to prebuild as much as possible; things like the shader
binding table entry that I mentioned for DXR, and the graphics pipeline state
for DX12, for Vulkan, etc. Again, the goal here is to remove
all of the renderer overhead in building that static geometry
every single frame. I wanted to start by talking
about how Mesh drawing works in UE4, in 4.21 and earlier. Okay, so we start with
the Primitive Scene Proxy, and the Primitive Scene Proxy
is sort of the rendering threads copy of the Primitive Component.
The Primitive Component, that is what lives
on the game thread, that is what game
code works with, that is what you move around
as Objects move in your game. The renderer has this copy
of the data, partly for
thread safety purposes, and partly so it
can cache information that only the rendering thread
needs to care about. Every frame, we take that
Scene Proxy and we generate a number
of Mesh Batches from it. The reason for this --
what is a Mesh Batch? It is basically
all of the high-level data that we need to figure out how to make a draw call
in a given pass. It decouples the Scene Proxy
from all of the details of the rendering implementation; the Scene Proxy does not need
to know about what passes it is going to render in,
it does not even need to know what passes potentially exist
in the renderer. We do not want it to. This is what gives us
the flexibility to implement new passes,
new algorithms, without having to go
and update every single Object that potentially
wants to render. Now we have this Mesh Batch, and we now need to go
and use something that we call a drawing policy
to take that Mesh Batch and convert it
into a set of RHI commands. This would be, for instance, we have a depth drawing policy
for rendering the pre-pass, we have a base pass
drawing policy for rendering the GBuffer,
and so on. This reads things
from the Mesh Batch, like the Vertex Factory
and the Material, and uses that to make decisions. Finally, the RHI command list
is sort of a cross-platform set of commands for rendering, that then the RHI
implementation, the actual low-level API layer
will actually make API calls. Our D3D11 implementation
will turn these RHI commands into draw index
primitive calls on D3D11, and corresponding calls
on OpenGL, on PS4, on Vulkan, and so on. I want to dig a little bit
more into this sort of part
of the renderer where we are taking Mesh batches and we are generating
RHI commands. The way this works
in 4.21 and earlier is, we traverse
the Static Mesh Draw List, or the Dynamic Meshes, and the first thing that the
drawing policy needs to do is, figure out what shader
do we render with? We use the Material,
we use the Vertex Factory, we use that to figure out
what shader to render with. Then we have to actually gather
all of the shader bindings. What are all the parameters
we need to draw for it? These are things
like the parameters that the artist has set
on a Material and its constant, things that are needed
for the Vertex Factory, information about the view. We need to bind all those
parameters. The other thing about
the Static Mesh Draw List is that it is prebuilt. We did some level of caching
in 4.21 and earlier, and so at add time, when a parameter
is added to the scene, we would take the Mesh Batch and we would run it through
a part of the drawing policy, cache off some
of that information, and store it in
the Static Mesh Draw List. The Static Mesh Draw List
would actually contain all of these
semi-pre-processed Mesh batches for the entire frame, and then it would sort them
based on state so that we could minimize
the number of times we are changing shaders and
changing other rendering state, and then at runtime,
when we actually go to render the Mesh draw list,
it would actually go through and it looks at every single
Static Mesh in the scene, and looks up into a bit Array
for visibility, basically saying,
is this Mesh visible? Yes. Draw it.
Is this Mesh visible? No. Is this Mesh visible?
No. That means that a certain aspect
of drawing scaled with the entire size
of the scene, not just the number of Meshes
that were visible. That traversal
is relatively fast, so it works okay
for certain sizes of scenes. But once you get
to a certain point, it does not scale any longer. Another problem with it is, it is an entirely different
code path in dynamic drawing, so dynamic Mesh batches need
to go through the entire drawing policy
every frame. That means that we cannot
actually sort between the two. Anything that goes down
the static path, we cannot sort it with things
that go down the dynamic path. From a code point of view,
it is also kind of -- there is a lot of boilerplate
to deal with this. The static draw list
is templated by that drawing policy, and therefore, you end up with
a bunch of these drawing lists just sitting on the scene. Adding one
is kind of extra overhead. Long story short,
Static Mesh Draw List prevents us from doing efficient
draw merging, partly because
the drawing policies are too tightly coupled
with the Static Mesh Draw List, and partly because the way
that we design drawing policies, they are allowed to set
shader parameters directly to the RHI
per draw call. That is something
that we need to avoid if we want to be able
to merge draw calls. Let us take a look at the new
Mesh drawing pipeline that we designed for 4.22. First of all,
we are getting rid of this traverse
Static Mesh Draw List, drawing policy bit,
and we are replacing it with what we call
Mesh Draw Commands. What is a Mesh Draw Command? A Mesh Draw Command
stores everything that the RHI needs to know in order
to issue a draw call. This is a full, stand-alone,
low-level description, so it has things like
the pipeline state Object, it has things
like shader bindings, it has things like
the index buffer that we need to render with.
It is completely stateless. There is no data here
that points you back to where this Mesh Draw Command
comes from; there is no context. That is really nice from a data
transformation perspective, because it means that as we go,
we are able to optimize, replace, merge these things
without having to worry about the implementation
of Mesh Batch, or the implementation
of a Scene Proxy. There is actually
a debug pointer on Mesh Draw Commands, so you can see where
the draw call originated from, which is really useful
for debugging. But we strip it out
in shipping builds, so you should never actually
depend on it for any code
that you are writing. Having this sort of stateless
description of what the RHI needs to render
to issue a draw call allows us
to do more aggressive caching, which is what
we are looking for. We can build
the Mesh Draw Command for static geometry at the time
we add to the scene. I mentioned
that Static Mesh Draw List does some amount of caching; the Mesh Draw Command
does even more. It is basically
the entire execution of what the drawing policy
used to do, all cached in a single
Mesh Draw Command. This gets us closer
to what we are looking for, being able to have
the pipeline state Object,
and all of the shader bindings needed to draw anything
in the scene at the time it is added to the scene. The other thing is that
because this is purely state, we get robust draw call merging.
We can just look at the state, and if it is the same
across multiple draw calls, we can just combine them into
a single instance draw call. But more on that later. Getting back to sort of
the lifetime here, how does this actually -- what does this
actually look like? We have a Mesh Batch and we need to generate
Mesh Draw Commands for it. To do that, we have
this Mesh Pass Processor. The Mesh Pass Processor,
its purposes, again, is to build
the Mesh Draw Commands. It selects a shader,
collects all the bindings for rendering that draw call,
bindings from the pass, from the Vertex Factory,
from Materials, and so on. Now you need to write
a Mesh Pass Processor per pass that you want in your renderer. One of the nice things,
though, is that this is not templated
on the shader class, it is just a relatively
simple class to write. We use the same code path for
both dynamic and static draws, so there is no more duplication
of rendering code when you are
implementing a pass. What I want to do is take
a quick look at an example. We will take a look
at the Depth Pass. This is the simplest pass
in the renderer, probably, where we just want
to render the depth for every primitive
in the scene. There is a little bit
of copy-paste boilerplate when you are going to create
a new Mesh processor; this is about the extent of it,
so it is not too bad. This is basically
just setting up the things that FMesh
Pass Processor needs. Then the part
that you implement, the part that you really
care about is this. This is Add Mesh Batch. This is called
for each Mesh Batch in the scene that is visible. The two things you are doing
here, really, is filtering -- do I care about this
for this pass -- and selecting the shader. For instance,
for the Depth Pass, we are not going
to render translucency into the Depth Pass,
so we filter that out. We filter out anything that is
not actually going to render in the main pass, etc.. In terms of selecting
the shader, we look at, in this case, whether it is
an opaque Mesh or not. If it is an Opaque Surface,
then we do some optimizations, in this case we can render with
the depth-only vertex shader, which is a bit faster
than otherwise. If it is masked, then we need
to evaluate the Material so that we can
do alpha testing. That goes through
a slightly different path. Here is sort of the second
part of selecting the shader, and finally, we call this Build Mesh
Draw Commands Function, which will gather the bindings.
This is actually code that is shared among
all death pass processors, so you do not have to write
that function yourself. That is it.
You have this function, and this one is actually just
a convenience template function about choosing whether we are using
the depth-only shader or not. That is it.
Depth drawing policy in 4.21 was somewhere around
642 lines of code. The Depth Pass Mesh Processor
is around 100. It is a lot simpler
to set up these new passes, a lot simpler to add them.
The Base Pass also went down from almost 1000 lines of code
to just over 200. I think all engineers
like deleting code, so having simpler code
is always a good thing. I want to take a closer look
at that function; I told you,
you do not need to write this, the Build Mesh Draw Commands. This is the one
that actually collects all of the shader bindings that
are needed for the draw call, and this is part of
where removing a lot of the boilerplate came from,
why implementing a new Mesh pass is simpler now
than it was before. The main change here is that
rather than directly grabbing all of the parameters that
we want to set for the shader, and directly asking RHI
to set them, you know, set this, float four at this offset --
instead, what we do is, we gather them into basically
a shader bindings table and we cache them off. After we have gathered
all of the Mesh Draw Commands, now we can sort them. This, as I was mentioning
earlier, we could never do state sorting
before between dynamic path Meshes and Static Meshes because they go down
different code paths. Now, because they are
all using the same data, it is really simple for us to just sort all of the Mesh
Draw Commands. Finally, we submit them. Now we have a list
of Mesh Draw Commands, and all we need to do
is iterate that visible list, and generate the RHI commands
for rendering it this frame. The nice properties of
Submit Mesh Draw Commands, it scales only with the visible
number of primitives, so we no longer
have this property where traversing the Static Mesh
Draw List would have some cost for the entire set
of visible Meshes. It is also easier to parallelize
than the Static Mesh Draw List was for two reasons; one, it is easier to split it
into an even number of tasks. When you are generating
parallel tasks, parallel jobs, you want to be able to generate
roughly even amounts of work; you do not want one job that takes 10 times longer
than another job, because that is going to end up
being your critical path. In this case, because
it is only a divisible set, it is really easy. If there are 5000
visible Meshes, we will break it into groups
of 100 or 200, and send those off
to task threads to process. With the Static Mesh Draw List,
we could not really do that. We had to have some notion
of which one is going to look at the first 100 or 200
or 300 Meshes in the frame, and then it would go over
the bit Array, and only draw what is visible
within those areas. There is kind of some
complicated code that would, based on the previous
frame's visibility, try to figure out how to break
up tasks to make them even. It worked okay in practice, but it was a lot
more complicated than just a simple parallel for. Also, all we are doing is taking
these Mesh Draw Commands, which are just data, and writing
into an RHI command list, which is just a data stream. It is completely side
effect-free, we are not touching
global state, we are not
modifying global state. There is no worry
about raised conditions. Another thing that we do
in Submit Mesh Draw Commands is, we do state
filtering above the RHI. This reduces the load,
the amount of time that we have to spend
executing the command list. I think in Fortnite we have seen maybe a 20 percent
savings by doing this alone. This also means that
by doing it at this level, there is less burden on the
per-platform implementations, to make sure that they are doing
very fine-grained state caching. It is also cache-coherent. This is just one big,
flat Array of data. Therefore,
since everything we need is in this contiguous
memory space, we are not hitting
a lot of cache misses during rendering anymore. We are able to just
run through this flat Array. We get a lot of benefit out of
pre-fetching from the CPU, and everything
goes really quickly. All right, so we have done
all this work to refactor how we are taking primitives
and generating Mesh batches and generating
Mesh Draw Commands, and so on. Now let us get to
the caching part of things. Again,
the majority of the Worlds that we are dealing with
are modular Static Meshes, so we should not be doing
all of the work, every frame. What we want to do is
just cache these on the scene, and then just select the
right ones to draw every frame. This means that anything
referenced by Mesh Draw Commands cannot change frequently,
because if it does change, then we need to invalidate it. What we needed to do
was use a level of indirection to make sure that
that was the case. For instance, let us take a look
at per view parameters. These are values
that change every single frame. The way that the renderer
used to work, every frame, we would gather
all of that data, we would
make one Uniform Buffer, the View Uniform Buffer. Then as we are issuing
draw calls, we would bind
the View Uniform Buffer that we created uniquely
for that frame and issue the draw call.
That is not going to work, because it means
that we would have to invalidate every Mesh Draw Command every frame
to update that binding. Instead, we keep the binding
stable. We create a single
View Uniform Buffer that is always bound
to every Draw Command, and then every frame when we are
about to render a view, we update the contents of
the View Uniform Buffer instead. The indirection
remains constant, even though the data
in the buffer ends up changing. In order to support that,
the RHI now supports a code path for updating the contents
of a Uniform Buffer dynamically, which we did not previously. Taking a look
at the type of data that we reference
from Draw Commands, and how we organize them
into Uniform Buffers, we have the Primitive Uniform
Buffer for instance -- this is where we store all data that is unique
to that primitive, things like the local
to World transform. Then we have
the Material Uniform Buffer, this is where everything
unique to the Material, things like parameters that artists are setting
in Material Instance Constants, color tints and different
Textures, and so on. We have Material parameter
collections, which are unique to the scene. These are data that can be
accessed by a Material, but they are global
for the entire scene. The precomputed lighting Uniform
Buffer, for accessing data like, what lightmaps we should be
referencing, what our offsets
are into those lightmaps. Those tend to be per primitive
as well. Then we have
Pass Uniform Buffers. This is any data
that a particular pass needs per primitive to render, but is not shared
among different passes. The depth pre-pass might need
different bindings than the Base Pass,
for instance. Once we have moved all that
data into those Uniform Buffers, now we had to make sure
that any time that anything that really does need to change, we do invalidate
the Mesh Draw Commands. This is really important,
because if you screw this up, you are going to get incorrect
rendering. Fortunately, most of this
was already handled, because the caching done
in Static Mesh draw lists cached sort of the first part
of the evaluation that we now do to generate
Mesh Draw Commands, so all of those cases
are still relevant. Whenever we would invalidate
a Static Mesh Draw List before, we would have to invalidate
Mesh Draw Commands. But it is not quite everything, because now we are caching
the shader bindings as well, which we were not, previously.
Therefore, every time that something
would change in the scene that would cause the shader
bindings to change, we need to remember
to invalidate cached Mesh Draw Commands. We have added a validation mode;
it adds quite a bit of overhead, so it does not run
most of the time. But if you are running into
a bug that you think is because cache invalidation
is not working correctly, you can enable this, and this will assert
when it is screwed up, which is really,
really useful for finding out what is actually going wrong. We use this a lot as we were
bringing up this code path, and looking for corner cases
and bugs that we were trying
to track down. It is very useful.
If you run into this -- in your case, you start seeing
something like flickering, or some unexplained crash in
the renderer after modifying it, this would be a good
first step for debugging. Vertex Factories, if you are
familiar with those, basically how we generate
the vertex data for in the vertex shader
for drawing, and how we get that data
from CPU to GPU. Right now, we only support
caching for Vertex Factories that do not need to update
their bindings per view. For instance,
the local Vertex Factory -- and the local Vertex Factory is
what we use for Static Meshes, we use it for
a couple of other cases. It is by far the most
commonly used Vertex Factory in the Engine. Other Vertex Factories which
do need to change per view, we do allow them
to still cache Mesh batches, so some of the work is still
cached, but not all of it. These are things like landscape
that the LOD computations are a bit more complicated
than with Static Meshes, and so need to be
evaluated per view; BSP, instancing,
stuff like that. This means we have
a couple of different Caching Code Paths
in terms of efficiency. We have the least efficient
path, which we call
dynamic relevance. This would be cases like Cascade
particles, or Niagara particles. These change every frame;
we get new data pushed to the rendering thread
from the game thread. Every frame, the Scene Proxy
needs to generate a new set of Mesh Batches. We need to take
those Mesh Batches, generate a new set
of Draw Commands, and then take those
Draw Commands and finally generate
the RHI commands for them. For static primitives, in cases
where things are not changing in terms of the game code
is not sending us new data, but rather, we need to change
how we create those Draw Commands
based on the view, we do a little bit more work.
We cashed the Mesh Batches, but then every frame,
we take the Mesh Batches, generate the Draw Commands,
and then those Draw Commands will get turned
into RHI commands. Then we have
our most efficient path, the path used
by Static Meshes, for instance,
where we have cached the Mesh Draw Commands
at scene add time. Then at runtime
we just need to find -- we gather all of those visible
Mesh Draw Commands, and we issue
RHI commands for them. High-level overview of a
frame with caching enabled -- when we add a primitive
to the scene, if it is static,
we cache the Mesh Batches. Then if the Vertex Factory
does not need the view, if the Vertex Factory
supports Draw Command caching, then we also take
those Mesh Batches and we generate
Mesh Draw Commands at that time
and cache those. There are a couple of places,
for instance, like when we change
the skylight, that is going to change
shader bindings for pretty much everything
in the scene. We invalidate the cache
Mesh Draw Commands. When it comes time
to render the scene, we will need to go over
all of the cached batches and regenerate
those Mesh Draw Commands. Then each frame,
when we are rendering, we figure out
all of the primitives that are visible
in the scene. If they are static,
we compute their LOD, and we add the cache Mesh
Draw Commands to a visible list. If they are dynamic,
then we will go ahead, we will gather
their Mesh Batches, we will generate
Draw Commands from them, and then we take
those Draw Commands and add them
to the same visible list. After that point,
after init views, every subsequent pass only needs
to look at this one list of visible Mesh Draw Commands,
and then draw them. All right, so now we have
this nice, data-oriented layout
of Mesh Draw Commands, and we set all this up so
we could do draw call merging. What does that look like? On D3D11, we are using
Draw Instance Indirect. That means that the only thing
that can change between merge draws
is the InstanceID; we cannot change shaders, we cannot change bindings --
nothing else. But it does mean that we can
do Dynamic Instancing. It is pretty easy
to implement at this point, with the architecture
that we now have. In the future, on D3D12,
for instance, Execute Indirect allows us to change state
in between instance draw calls. That is going
to actually allow us to take these big lists
of Mesh Draw Commands and make far, far fewer API
calls in order to render them. Well, let us take
a closer look at -- I should mention
that we have not actually done the D3D12 path yet. That will be a future endeavor
that we will embark on. For now we have done the
D3D11-style Dynamic Instancing. Let us take a closer look
at Dynamic Instancing. Now that we have these
Mesh Draw Commands, we can robustly merge them, because again, they are
just low-level RHI state. All we need to do is compare
the data that exists on there, and if they are all the same,
then we can merge the draw calls and change it into
an Instance draw call. This has the really
nice property of, content creators do not have
to worry about it; they do need to go and create
a special Instance Static Mesh and try to combine them, and worry about,
what about culling, and how do I balance draw calls
versus culling efficiency, and so on?
It just works. There is one downside,
of course. Assigning those state
buckets is slow. Doing all those comparisons
does take a good amount of time, and we do not want
to do that every frame. Again, let us cache this
to add the scene. When we generated
the Draw Command to begin with, at that point,
what we can do is, we look up into
a cache of state buckets and we see which state bucket does this Draw Command
belong to. Then this lets us group,
early on, which Draw Commands
can be merged together. Then the actual
merging operation is just a data transformation
on the visible list. We have this big list
of visible draw calls, and any command
in the same bucket, we just replace that
with a single Mesh Draw Command. We have our assorted list
of Mesh Draw Commands, and as we go through them, we are looking up
what state bucket they are in. As long as one
is in the same state bucket, we just keep it
in a single instance draw call. The output it at merging pass is just the smaller list
of merged Mesh Draw Commands. Now that we can merge Mesh
Draw Commands, that is great. How do we make sure they are
as effective as possible? One of the things
that we needed to do was get rid of any sort of
per draw bindings that would have broken up
Draw Commands into multiple sub-commands. For instance, anything
that was going to be bound per pass,
per primitive, we really need to make sure
that does not happen. In order to support this,
what we did was, we created a pass
Uniform Buffer frequency, like a per pass place
to store data. For instance, things like
in the Base Pass, we potentially sample
the debuffer Textures; we potentially sample
the eye adaptation Texture, stuff like that,
the fog parameters -- those things do not change for
the entirety of the Base Pass. Why set them
on every single draw call, we just set up a single
Uniform Buffer for the Base Pass for this frame,
and we render it. Then we have some remaining
per draw call bindings. We have the Global Constant
Buffer values, we have the Primitive
Uniform Buffer. This changes on every single
primitive that is in the World. If we do not change the way
we store primitive uniform data in some way, this is trivially
going to break apart all of our merge draw calls; we are not going to be able
to merge anything. Precomputed Lighting --
so again, this is the lightmaps that the primitive
need to render with, and the distance call data,
basically. As you are distance
culling or LOD fading, different primitives
can be in a different state, and so that can break
merging effectiveness. What we want to do
instead is, we want to upload
all of this data into a single scene-wide buffer
that we can then look up into, based on the InstanceID
in the shader. That way, when we merge
the draw calls, it does not matter if we render
them individually or merged; no matter what, we are going
to be able to look up and get that data without changing
the bindings for the shader. In order to support this,
there were a number of cases where we needed
to do something like this. We implemented
sort of a generic -- think of it like a GPUT
TArray implementation, so dynamically
resizable Array on GPU, where when you want to
insert data from the CPU, we kind of track,
add updates, removes. Then before we render, we launch
a couple of compute shaders to shuffle data around
as needed to make this possible. This means we did
not want to, every frame, just completely upload
the entire primitive buffer, so this lets us
just track deltas and perform
only those operations. One result of this is,
when you are accessing the data in a shader,
you need to use Get Primitive Data.
Before, you could just directly access
the Primitive Uniform Buffer, but now we are going through
this indirection where we need the PrimitiveID. Anything that was going
to access the data needs to go through this Get Primitive Data
helper function. Again, this is only used
by supporting Vertex Factories, and it is abstracted by this Get Primitive Data method,
so it is not too bad. But how do we actually get
the PrimitiveID in the shader? All we have is the InstanceID. We could do something
like have a buffer that we bind per
draw call with the primitive IDs,
and index into that. But adding a single
Global Constant Buffer value per draw actually increased
the Base Pass drawing time by 20 percent. We did not want to eat
that overhead of just having to update a single parameter
every draw call. Fortunately, the input assembler
has a much faster path for it. We can just Set Stream Source
with a dynamic offset. This lets us set up
the PrimitiveID vertex input at per-instance frequency, and then after
draw call merging, we build the PrimitiveID's
buffer out of the list
of Mesh Draw Commands. All right, so we did all
this work to merge draw calls. Let us take a look
at the result; what does that
actually look like? I have this map
called a GPU Perf Test. It is actually -- we just
grabbed one of the cities from Fortnite's Save the World,
probably a year or two ago, and used this as just
a static representation that we could use to iterate on
and test rendering performance. Initially, primarily intended
as GP performance, but it was basically
a way for us to have a static profiling set
for optimizing Fortnite. Taking a look at this,
looking at the Depth Pass, we were able to reduce
the number of draw calls with merging by more than 2X, and same thing with Base Pass
around a factor of 2. Depth Pass ends up being
a little more efficient, just because there are
a lot more cases where you can share
the same shader for Meshes that are not alpha-masked, and can just use
the depth-only drawing policy. But that was not nearly enough
to really tell us that, like, 2X is good, but we really wanted
to see how it is scaled. We did the programmer
art version of trying to make
a more complex scene. We took the scene and
we duplicated it three times. Then we removed distance culling to just render things
really far out. The results of that, we ended up
with a Depth Pass that -- we ended up basically
drawing around 7400 draw calls per frame. We were able to,
with draw call merging, reduce the Depth Pass by about
10X in terms of draw calls, and the Base Pass
by closer to 5X. Now all that said,
let me caveat that with, this was,
by far, an optimal case. We literally just
copy-pasted content, so it was pretty much perfect
for Dynamic Instancing. But it is sort of proving that if you do make
a modularized static scene, then this code can do
a really good job of optimizing that
on the back end. Some performance
gotchas to keep in mind -- things that will break
draw call merging -- lightmaps that make
small Textures. Basically, if you are using
static lighting and a lot of your primitives
end up in different lightmaps, then we are not going to be able
to merge draw calls where they do not use
the same lightmap. You can tweak the radius
that we use to pack primitives into the same lightmap Texture,
and that can help with it. It is kind of a trade-off
between Texture streaming efficiency
from memory of lightmaps, and draw call efficiency of how
much you are able to merge. Using vertex painting
on instances in the World will also
break draw call merging, because those exist in their
own independent vertex buffers; they need to be bound
separately. The Speed Tree Wind Node -- again, if you use
the Speed Tree Wind Node, we do not support merging. There is no good
technical reason for that, we just did not get to it. That could be improved
in the future. If you are still using
the Legacy Sparse Volume lighting samples
for lighting dynamic Objects with static lighting --
that will not work, either. However, the newer
volumetric lightmaps work perfectly well
with draw call merging. All right, so before, we were
looking at just draw calls; how well did merging work in terms of reducing
the number of draw calls. But what does that actually mean
in terms of milliseconds, because that is what we actually
care about at the end of the day? We tested it
on PlayStation 4, it is just nice to be able
to use a console platform where we do not have to worry
about the OS getting in the way, we do not have to worry
about differences in hardware, just a very consistent platform
to be testing on. With the base GPU Perf Test,
with the old code in 4.21, the Depth Pass was running
in about 2 milliseconds, and the Base Pass, 3.2. The new Mesh
Draw Command pipeline with Dynamic Instancing, depth drawing
to 0.3 milliseconds, and Base Pass, 0.4 milliseconds,
so much, much faster, six to seven times faster. In our programmer
art larger scene, the old path was running at around 15.7 milliseconds
for the Depth Pass, and 27.8 milliseconds
for the Base Pass, again, issuing about 7,400 draw calls.
The new path is now running in 1.2
milliseconds for the Depth Pass, and 2.4 for the Base Pass --
so massively faster, right? More than 10 times faster.
But let me caveat that. In this test case,
it worked awesome. These are, by far,
best case results. For one, this is
a heavily modular scene with very limited Mesh
and shader variety. Also, the speedup I was showing, that is only the Mesh
drawing part of the code, and that may or may not be
on your critical path for a variety of reasons. For instance, I am not talking
about init views, which is the visibility part, so if visibility is still
dominating your draw time, then taking the Mesh drawing and making that smaller
might not help you. If you are bound by --
oh, another reason, we were already paralyzing
the render fairly heavily. If you have
a lot of task threads that are not doing
other useful work, we were already using them
very effectively. Even though those go away,
you still might not see an actual frame
rate benefit from it. Also, your content,
as well as ours, what we have been
shipping in Fortnite and all of our demos
and other Projects is, we are optimized
for the previous renderer, so our artists were
very carefully making sure that they kept draw
call counts down. These benefits really start
to show up as your scene scales up by factors of three, four,
five of what they are today. Some casualties
from the change, things that no longer work -- all of the deferred primitive
update mechanisms that were in the renderer
for efficiency sake -- those had to go away, because they are just
incompatible with the idea of caching all of that data
when we create the scene. These are things like,
if a primitive moves but it is not visible, we would never bother updating
the Primitive Uniform Buffer for that primitive.
We cannot do that anymore, because we need to update
some GPU data so that DXR can trace
a ray to it, or so that we can build
a Mesh Draw Command that can reference it. Also, if you have any Materials that are using
custom expressions, they can pretty much type
whatever HLSL code they want, and they might be grabbing data from the
Primitive Uniform Buffer. If they were, you need
to make sure that you go in and change
those custom modes to use the Get Primitive Data
accessor function. This is something
where if you update to 4.22, and some of your Materials
are not compiling anymore, there is a good chance
that this is why. Go and look
for custom expressions that might be breaking
the new rules. The forward renderer can now
only support a single global planar
reflection, because, again, to do multiple
global planar reflections, we would have had to take into
account per view information, in this case. I already talked a little bit
about this, but current UE4 Projects
may or may not see benefits. I just want to caveat,
you might go home, you try 4.22, and you are, like,
where is that 10X speedup? Well, here is why you might not
be getting a 10X speedup just by updating 4.22. But that said, we do have
a couple of testimonials, so when we release
the preview release, we had some developers
on Twitter who were talking. Joe Wintergreen mentioned
that in his Project, 4.22 ended up saving
about 1000 draw calls. The Shores Unknown guys
were letting us know, they had a scene
that was around 18,000 draw calls with 4.22,
now it is around 2.3 thousand. So their scene
that was running at 30 FPS is now running at 60
FPS, just by updating to 4.22. Depending on
how you build your scene, you might get a massive speedup,
you might not. But at the end of the day, I think we have managed
to allow the renderer to scale much better
with a larger set of draw calls, and so going forward, you will be able to build
larger, more complex scenes. There was one caveat that I did
forget to mention around draw call
merging as well. That is in 4.22, we do not support draw
call merging on mobile. There is no technical reason
for that, we just did not
get to it in time for 4.22. We will have it in for 4.23. The simple reason for that is,
there are different passes in the mobile renderer
versus the deferred renderer, and we needed to go in and do
some of the same transformations of pulling out data that
we were binding per draw call, and make sure that we put them
into per pass buffers, or put them into
global scene buffers that we can then access
per primitive. That is it.
That is an explanation of the new Mesh
drawing pipeline in UE4. Thank you for
coming out to the talk; I hope it was useful. I hope you learned a little bit
about why we did it. I hope when you update to 4.22, you do get a massive
improvement in frame rate. If not, add more Meshes to your
scene. Thanks. [Applause] ♫ Unreal logo music ♫