Since I started working on this Minecraft
clone, I had a few goals in mind, I wanted it to be a finished product and work multi-player,
I wanted it to have nice shaders, and I also wanted it to run well on older hardware. So today I will finally talk about both the
smart optimizations and simple tricks that I have made for this project, I will also
answer the question if many draw calls are bad by reducing 3600 draw calls to only one,
the answer might just surprize you. We will start from around 20 FPS but with
a good rendering foundation, and I will apply 3 optimization techniques on top, including
a custom GPU geometry memory arena, to see how far we can get, so let's go. The first thing that I wanted to optimize
very well was the block renderer. Because that's the fist building block of
Minecraft. All of the other algorithms and datastructures
that will optimize the game will come on top. So let's start with this and than I'll add
other optimizations. This is the geometry that I want to render,
as I explained in the first video, it is important to not render faces that are hidden by other
blocks, so you actually don't want to draw an entire cube all the time, but rather faces
of cubes. And this lead to a very interesting optimization. So normally when rendering you would send
the geometry that you want to draw to the gpu, like for example I would send the shape
of this face, and than you would use what is called a uniform to specify a position
for the entire object in the world. But we can do better. So what I realized however is that in Minecraft
I always have the same geometry, one of 6 possible faces, the only thing that differs
is the position and the material. So I flipped the rendering process around. The process works like this: I can only render
quads of predefined shapes, and for each quad I have to specify the shape, the texture to
apply, the position, the torchlight the sun lights, and some flags. Now for the shape and the texture I can just
use a 16-bit index for each getting me to one int so far. For the positions I need a full int for the
x and z direction but for the y the build limit is very small so I don't really need
the top 16 bytes. For the lights, the maximum light value is
15 so I just need 4 bytes for each light type, that can be placed here, leaving me with another
8 bytes for the flags. So I managed to fit the entire information
about one face in just 4 ints, remember this because I will return to this number in a
seccond, and keep in mind that usually people would waste that only to encode position,
I managed to encode both position shape material and even some flags here. Now this data is sent as a vertex array but
I am doing instance rendering. So I am drawing 4 vertexes, for each face. And this data is configured to be the same
for the entire face, so nothing changes per vertex except the vertex id, that's why I
said that I flipped the rendering process around. And because I managed to fit all of the data
in only 4 ints, this actually means one attribute or one cache line so this can only be extremely
good for performance. I calculate the actual shape and the normals
in the vertex shader. This is just a matter of putting all of the
possible geometry in a buffer, than all I need to do is to index the correct position
taking into account the face shape and the curent vertex id. Now I calculate the normals on the fly and
this helps with the dynamic geometry, and to be fair I don't think it costs that much
to calculate. I will mention that I didn't use a geometry
shader because they are very slow apparently, and it turns out that I didn't need it in
the end. And to animate the geometry is preety easy,
I just use a different face index to specify that I want to animate that face and than
I just apply some functions in the vertex shader. This gets us to here, at around 20 FPS at
a render distance of 60 by 60 chunks. Now there is one last hidden secret here that
I have found to be very interesting and it has to do with the index buffer or the way
you draw your geometry. You see I never quite got the point of the
index buffer, Let's say we have a cube, with an index buffer you only have to send 8 vertexes
instead of 36, and this seems nice except that in practice once you need a different
normal for each face you actualy still have to send 24 vertexes. And in my case an index buffer would't help
me in any way because I am not even sending different data per vertex. And here is why optimizations are difficult,
because they have many hidden secrets. So apparently whenever you use an index buffer,
each vertex that you reuse has to be processed only once, in the best case. And this also applies when using things like
triangle strips or triangle fans. So instead of rendering 2 triangles, I render
a triangle fan, and this means that the vertex shader has to run only 4 times instead of
6, reusing the shared vertexes. And this gets us to the present moment at
around 30 fps. So now let's apply some other 3 optimizations
but before that I want to remind you that it is very important to always measure and
have systems to help you with that. In the episode about shaders, I showed this
tool that shows how much time each system takes. Now I also added this tool to show me how
chunks are loaded and take a look at that, some chunks get recreated multiple times. This has to do with the light system updates
and it clearly is a thing to optimize but I hever would have noticed it without this
tool. Ok now, the first 2 optimizations are related
in a way and they have to do with overdraw. I have an expensive pixel shader so If I am
drawing a nice thing here and then another thing comes on top, I waste a lot of time. This is called overdraw. How do I know if my pixel shader is expensive? I test it of chourse, If I disable fancy shaders,
we get from 30 FPS to something like 45. Now a trick to reduce overdraw is sort the
chunks, from the closest to the farthest, and this works because things that are closer
to the player are more lickely to end up on the screen. And luckily, we already have the chunks sorted
because we need them to be sorted when drawing the transparent geometry. So we can have this optimization for free. Now this doesn't boost the FPS very much,
however, maybe only in some cases. so let's try something even better. Z pre-pass. This is a very easy optimization and I want
to make a tutorial about it so subscribe to not miss that out. But basically, I render the entire geometry
first, but only to the depth buffer, and than I render it again but this time I change the
depth test so that it will only render fragments that end up exactly where they are in the
depth test. This reduces overdraw to 0, but at the cost
of having to process the geometry of the schene twice. And if the complexity of the shcene is more
expensive than the pixel shader this can actually decrease the fps, and on its own it does seem
to increase the FPS but when combined with the geometry sorting it does nothing and it
even decreases is at high render distances so it has to go. And finally we get to what is sometimes called
unified geometry. But I will quickly mention that as last time,
you will be able to vote on what should the next Minecraft clone video be on so make sure
to subscribe fast to not miss that pool. Ok so you probably have heared that issuing
many draw calls and having many VAOs is bad for performance. And a firend of mine that is very good with
GPU programming told me that nowadays that's not true. But than, I saw that there are mods that allow
Minecraft to have insane render distances, at better FPS than my clone, that felt personal. So I tried to optimize this because right
now I have one separate VAO for each chunk so this schene binds 3600 different VAOs and
issues 3600 draw calls. So to optimize it I created a single unified
geometry pool. Inside I store all of the chunks data and
this also meant that I had to create an allocator for it. This fortunatelly wasn't that difficult. I just have a linked list on the CPU to encode
the used memory blocks. and an unordered map that points to the list
elements for fast acces. Finally, I have to bind only one VAO and issue
one big indirect draw call using glMultiDrawElementsIndirect. And after this optimization, the performance
improvement is... none. I'm serious it is literally 0. So how is that possible? We're talking reducing 3600 draw calls and
VAO binds to only one and no FPS difference. Well this are the things to consider. First of all, yes it is indeed a small but
visible improvement on the CPU usage when using only one draw call. The FPS doesn't change tho because the CPU
is not the bottleneck here but maybe for a slowe CPU this will make a difference. Also, for an older graphics card or a less
powerfull one, this could still make a difference. Now for a modern GPU the optimal way to do
things is to bind your stuff first and than issue draw calls. If the things you are drawing are very small,
the GPU will render them before you have time to send new data, waisting time waiting. But in a normal use case that won't happen. And since I use bindless textures, I don't
bind any resource in between draw calls. So there you have it this is the conclusion,
and if you are not impressed, you have to keep in mind that there are still many important
optimizations that I still have't talked about like frustum culling or draw chunks only in
a circular radius, or optimizing the very expensive screen space reflections that I
have right now, and also that this render distance is equivalent to the max 32 render
distance in vanilla Minecraft, so running Minecraft at max render settings with shaders
on at almost a playable FPS is a good achievement but again there are many things to be added
and many things to be talked about in the next videos. So don't forget about voting for the next
video and until then check out another video from my channel. See you!