Video games have spectacular graphics, capable of
transporting you to incredibly detailed cities, heart-racing battlegrounds, magical
worlds, and breathtaking environments. While this may look like an old western train
station and locomotive from Red Dead Redemption 2, it’s actually composed of 2.1 million
vertices assembled into 3.5 million triangles with 976 colors and textures
assigned to the various surfaces, all with a virtual sun illuminating the scene below.
But perhaps the most impressive fact is that these vertices, textures, and lights are entirely
composed of ones and zeroes that’s continuously being processed inside your computer’s
graphics card or a video game console. So then, how does your computer take billions
of ones and zeroes and turn it into realistic 3D graphics? Well, let’s jump right in.
The video game graphics rendering pipeline has three key steps: Vertex Shading, Rasterization,
and Fragment Shading. While additional steps are used in many modern video games, these three core
steps have been used for decades in thousands of video games for both computers and consoles and
are still the backbone of the video game graphics algorithm for pretty much every game you play.
Let’s begin with the first step called vertex shading. The basic idea in this step is to
take all the objects’ geometries and meshes in a 3D space and use the field of view of the
camera to calculate where each object falls in a 2D window called the view screen, which
is the 2D image that’s sent to the display. In this train station scene, there are 1,100
different models and the camera’s field of view sections off what the player sees, reducing the
number of objects that need to be rendered to 600. Let’s focus on the locomotive as an example.
Although this engine has rounded surfaces and some rather complex shapes, it’s actually
assembled from 762 thousand flat triangles using 382 thousand vertices and 9 different
materials or colors applied to the surfaces of the triangles. Conceptually, the entire train
is moved as one piece onto the viewscreen, but actually, each of the train’s hundreds of
thousands of vertices are moved one at a time. So, let’s focus on a single vertex. The
process of moving a vertex, and by extension, the triangles and the train, from a 3D world onto
a 2D view screen is done using 3 transformations. First moving a vertex from model space to world
space, then from world space to camera space, and finally from the perspective field of view onto
the view screen. To perform this transformation we use the X,Y, and Z coordinates of that
vertex in modeling space, then the position, scale, and rotation of the model in world space,
and finally the coordinates and rotation of the camera and its field of view. We plug all
these numbers into different transformation matrices and multiply them together resulting
in the X and Y values of the vertex on the view screen as well as a Z value or depth, which
we’ll use later to determine object blocking. After three vertices of the train are transformed
using similar matrix math, we get a single triangle moved onto the view screen. Then the
rest of the 382 thousand vertices of the train and the 2.1 million vertices of all the 600
objects in the camera’s field of view undergo a similar set of transformations, thereby moving
all 3.5 million triangles onto a 2D viewscreen. This is an incredible amount of matrix
math, but GPUs in graphics cards and video game consoles are designed to be triangle mesh
rendering monsters and thus have evolved over decades to handle millions of triangles every few
milliseconds. For example, this GPU has 10,000ish cores designed to efficiently execute up to 35
trillion operations of 32-bit multiplication and addition every second, and, by distributing the
vertex co-ordinates and transformation data among each of the cores, the GPU can easily render the
scene resulting in 120 or more frames a second. Now that we have all the vertices moved onto a 2D
plane, the next step is to use the 3 vertices of a single triangle and figure out which specific
pixels on your display are covered by that triangle. This process is called rasterization.
A 4K monitor or TV has a resolution of thirty-eight forty by twenty-one sixty,
yielding around 8.3 million pixels. Using the X and Y coordinates of the vertices
of a given triangle on the view screen, your GPU calculates where it falls within this
massive grid and which of the pixels are covered by that particular triangle. Next, those pixels
are shaded using the texture or color assigned to that triangle. Thus, with rasterization,
we turn triangles into fragments which are groups of pixels that come from the same
triangle and share the same texture or color. Then we move on to the next triangle and
shade in the pixels that are covered by it and continue to do this for each of the
3.5 million triangles that were previously moved onto the viewscreen. By applying the Red
Blue and Green color values of each triangle to the appropriate pixels, a 4K image is formed
in the frame buffer and sent to the display. You’re probably wondering how we account
for triangles that overlap or block other triangles. For example, the train is blocking the
view of much of the train station. Additionally, the train has hundreds of thousands of triangles
on its backside that are sent through the rendering pipeline, but obviously don’t appear in
the final image. Determining which triangles are in front is called the visibility problem and
is solved by using a Z-buffer or Depth Buffer. A Z-Buffer adds an extra value to each of the
8.3 million pixels corresponding to the distance or depth that each pixel is from the camera.
In the previous step, when we did the vertex transformations, we ended up with X and Y
coordinates, but then also got a Z value that corresponds to the distance from the
transformed vertex to the camera. When a triangle is rasterized, it covers a set
of pixels and the Z value or depth of the triangle is compared with the values stored in the
Z-Buffer. If the triangle’s depth values are lower than those in the Z-buffer, meaning the triangle
is closer to the camera, then we paint in those pixels using the triangle’s color and re-place the
Z-buffer’s values using that triangle’s Z-values. However, let’s say a second triangle comes along
with Z values that are higher than those in the Z-buffer, meaning the triangle is further away.
We just throw it out and keep the pixels from the triangle that was previously painted
with lower Z-values. Using this method, only the closest triangles to the camera with the
lowest Z-values will be displayed on the screen. By the way, here’s the image of the Z or Depth
buffer, wherein black is close and white is far. Note that because these triangles are in 3D space,
the vertices often have 3 different Z values, and thus each individual pixel of the triangle needs
its Z value computed using the vertex coordinates. This allows intersecting triangles to properly
render out their intersections pixel by pixel. One issue with rasterization and these pixels is
that if the triangle cuts at an angle and passes through the center of the pixel, then the
entire pixel is painted with that triangle’s color resulting in jagged and pixelated edges.
To reduce the appearance of these jagged edges, graphics processors implement a technique
called Super Sampling Anti-Aliasing. With SSAA, 16 sampling points are distributed across a single
pixel, and when a triangle cuts through a pixel, depending on how many of the 16 sampling
points the triangle covers, a corresponding fractional shade of that color is applied to
the pixel, resulting in faded edges in the image and significantly less noticeable pixelization.
One thing to remember is that when you’re playing a video game, your character’s camera view as
well as the objects in the scene are continuously moving around. As a result, the process and
calculations within vertex shading, rasterization, and fragment shading are recalculated for every
single frame once every 8.3 milliseconds for a game running at 120 frames a second.
Let’s move onto the next step which is Fragment Shading. Now that we have a set
of pixels corresponding to each triangle, it’s not enough to simply paint by number to color
the pixels. Rather, to make the scene realistic, we have to account for the direction and
strength of the light or illumination, the position of the camera, reflections, and
shadows cast by other objects. Fragment shading is therefore used to shade in each pixel
with accurate illumination to make the scene realistic. As a reminder, fragments are groups of
pixels formed from a single rasterized triangle. Let’s see the fragment shader in action. This
train engine is mostly made of black metal, and if we apply the same color to each of its
pixel fragments, we get a horribly inaccurate train. But once we apply proper shading, such
as making the bottom darker and the top lighter, and by adding in specular highlights or shininess
where the light bounces off the surface, we get a realistic black metal train. Additionally, as the
sun moves in the sky, the shading on the train reflects the passage of time throughout the day,
and, if it’s night, the materials and colors of all the objects are darker and illuminated from
the light of the fire. Even video games such as Super Mario 64 which is almost 30 years old have
some simple shading where the colors of surfaces are changed by the lighting and shadows in the
scene. So, let’s see how fragment shading works. The basic idea is that if a surface is pointing
directly at a light source such as the sun, it’s shaded brighter whereas if a
surface is facing perpendicular to, or away from the light, it’s shaded darker.
In order to calculate a triangle’s shading, there are two key details we need to know.
First, the direction of the light and second, the direction the triangle’s surface is facing.
Let’s continue to use the locomotive as an example and paint it bright red instead of black. As
you already know, this train is made of 762 thou-sand flat triangles, many of which face
in different directions. The direction that an individual triangle is facing is called its
surface normal, which is simply the direction perpendicular to the plane of the triangle, kind
of like a flagpole sticking out of the ground. To calculate a triangle’s shading, we take the
cosine of the angle or theta between the two directions. The cosine theta value is 1 when the
surface is facing the light and when the surface is perpendicular to the light it’s 0. Next, we
multiply cosine theta by the intensity of the light and then by the color of the material to
get the properly shaded color of that triangle. This process adjusts the triangles’ RGB values
and as a result, we get a range of lightness to darkness of a surface depending on how its
individual triangles are facing the light. However, if the surface is perpendicular or
facing away, we don’t want a cosine theta value of 0 or a negative number because this would
result in a pitch-black surface. Therefore, we set the minimum to 0 and add in an ambient
light intensity times the surface color, and adjust this ambient light so that it’s higher
in daytime scenes, and closer to 0 at night. Finally, when there are multiple light sources
in a scene, we perform this calculation multiple times with different light directions, and
intensities and then add the individual contributions together. Having more than a few
light sources is computationally intense for your GPU, and thus scenes limit the number
of individual light sources and sometimes limit the range of influence for the lights
so that triangles will ignore distant lights. The vector and matrix math used in rendering
video game graphics is rather complicated, but luckily there’s a free and easy way to learn
it and that’s with Brilliant.org. Brilliant is a multidisciplinary online interactive education
platform and is the best way to learn math, computer science, and many other
fields of science and engineering. Thus far we’ve been simplifying the math behind
video game graphics considerably. For example, vectors are used to find the value of cosine theta
between the direction of the light and the surface normal, and the GPU uses the dot product divided
by the norm of the two vectors to calculate it. Additionally, we skipped a lot of detail when it
came to 3D shapes and transformations from one coordinate system to another using matrices.
Rather fittingly, Brilliant.org has entire courses on vector calculus, trigonometry, and 3D
geometry, as well as courses on linear algebra and matrix math. All of which have direct
applications to this video and are needed for you to fully understand graphics algorithms.
Alternatively, if you’re all set with math, we recommend their course on Thinking in
Code which will help you build a solid foundation on computational problem solving.
Brilliant is offering a free 30-day trial with full access to their thousands of
lessons. It’s incredibly easy to sign up, try out some of their lessons for free and, if you
like them, which we’re sure you will, you can sign up for an annual subscription. To the viewers
of this channel, Brilliant is offering 20% off an annual subscription to the first 200 people who
sign up. Just go to brilliant.org/brancheducation. The link is in the description below.
Let’s get back to exploring fragment shading. One key problem with it is that the triangles
within an object each have only a single normal, and thus each triangle will share the
same color throughout the triangle’s surface. This is called flat shading and
is rather unrealistic when viewed on curved surfaces such as the body of this steam engine.
So, in order to produce smooth shading, instead of using surface normals, we use one normal for
each vertex calculated using the average of the normals of the adjacent triangles. Next, we
use a method called barycentric coordinates to produce a smooth gradient of normals across the
surface of a triangle. Visually it’s like mixing 3 different colors across a triangle, but instead
we’re using the three vertex normal directions. For a given fragment we take the center of each
pixel and use the vertex normals and coordinates of the pre-rasterized triangle to calculate the
barycentric normal of that particular pixel. Just like mixing the three colors across a triangle
this pixel’s normal will be a proportional mix of the three vertex normals of the triangle.
As a result, when a set of triangles is used to form a curved surface, each pixel will be part
of a gradient of normals resulting in a gradient of angles facing the light with pixel-by-pixel
coloring and smooth shad-ing across the surface. We want to say that this has been one of the most
enjoyable videos to make simply because we love playing video games and seeing the algorithm
that makes these incredible graphics has been a joy. We spent over 540 hours researching,
writing, modelling this scene from RDR2, and animating. If you could take a few seconds
to hit that like button, subscribe, share this video with a friend, and write a comment below it
would help us more than you think, so thank you. Thus far we’ve covered the core steps for
the graphics rendering pipeline, however, there are many more steps and advanced topics. For
example, you might be wondering where ray tracing and DLSS or deep learning super sampling fits into
this pipeline. Ray tracing is predominately used to create highly detailed scenes with accurate
lighting and reflections typically found in TV and movies and a single frame can take dozens
of minutes or more to render. For video games, the primary visibility and shading of the objects
are calculated using the graphics rendering pipeline we discussed, but in certain video
games ray tracing is used to calculate shadows, reflections, and improved lighting. On the other
hand, DLSS is an algorithm for taking a low resolution frame and upscaling it to a 4K frame
using a convolution neural network. Therefore DLSS is executed after ray tracing and the graphics
pipeline generates a low-resolution frame. One interesting note is that the latest generation
of GPUs has 3 entirely separate architectures of computational resources or cores. CUDA or Shading
cores execute the graphics rendering pipeline. Ray tracing cores are self-explanatory. And then
DLSS is run on the Tensor cores. Therefore, when you’re playing a high-end video game with
Ray Tracing and DLSS, your GPU utilizes all of its computational resources at the same time,
allowing you to play 4K games and render frames in less than 10 milliseconds each. Whereas if you
were to solely rely on the CUDA or shading cores, then a single frame would take around
50 milliseconds. With that in mind, Ray Tracing and DLSS are entirely different topics
with their own equally complicated algorithms, and therefore we’re planning separate videos
that will explore each of these topics in detail. Furthermore, when it comes to video game graphics,
there are advanced topics such as Shadows, Reflections, UVs, Normal Maps and more. Therefore,
we’re considering making an additional video on these advanced topics. If you’re interested
in such a video let us know in the comments. We believe the future will require a strong
emphasis on engineering education and we’re thankful to all our Patreon and YouTube Membership
Sponsors for supporting this dream. If you want to support us on YouTube Memberships, or Patreon,
you can find the links in the description. This is Branch Education, and we create
3D animations that dive deeply into the technology that drives our modern
world. Watch another Branch video by clicking one of these cards or click here
to subscribe. Thanks for watching to the end!