>>Jon Holmes: I'll start off
with a little bit about me. I have been in the games
industry for about ten years. The majority of that
has been at Rare. I am an audio
programming specialist, which might seem a bit weird
for a performance talk, but audio being one
of those real-time systems, you have to be on the game
for performance. I am a member
of the Engine Team at Rare. Now, the Engine Team
is full of loads of specialists. We have got networking,
we have got physics, we have got all the core
specialties in our Engine Team. But we muck in everywhere. We are currently
working on Sea of Thieves, which is an ever-evolving title. We have been going
for a little over a year now. We have got our next
big update coming soon, and got a little trailer
to show you. ♫ intense orchestral music ♫ I believe the press
embargo has just lifted for some of
the reviews for this, or previews,
so do go online and check out what people
are saying about it just now. What am I
going to cover today? I am going to cover how we kept a consistent
framerate in Sea of Thieves. It is a challenging
environment to do so. I am going to do that
with lots of profiling data. There is going to be
lots of profiling data in this presentation,
through different methods. Then I am going to cover
engineering techniques that we used to gracefully
scale our game systems. We have got a lot of stuff
going on in Sea of Thieves, and we want to make sure
that the quality is delivered. I am going to go into
plenty of technical detail, so I will give some primers
on some of this stuff. Hopefully it
is all spelled out. What kind of problems
are we dealing with here? We are a client/server
architecture. We are a multiplayer game.
Got lots going on. Got many dynamic elements, and we have the ethos
of tools, not rules. We try and give loads
of stuff to the players and see how that emerges
in different behaviors. Very unpredictable,
some of this stuff. We have a large variance
in our scene complexity. You can have a case
where a single player is in a cave looking
at some cave paintings, compared with four galleons
laden with treasure, all battling it out. Maybe somebody
has fallen overboard, there is a bunch of sharks
going after them, large variance
in scene complexity. We are multi-platform
as well. We are talking about PC,
min and low-spec PCs, we are talking about Xbox One, and we are talking about
our server architecture as well, which is a single core
in the cloud. We began development of Sea
of Thieves on Unreal Engine 4.6, which feels like an age ago,
and we actually shipped on 4.10. We were missing a lot
of the optimizations and a lot of the improvements
that came in, especially now
we are talking 4.22. I am going to talk
about our ticking, and how we worked with ticks. I am going to give a quick
primer on what is a tick, in case some people at an Unreal
developer conference do not know what a tick is. We are essentially talking
about a virtual function call, either in native code
or in Blueprint code. This is housed inside
a Tick Function structure, which controls what group
you are ticking in, if you want to tick
asynchronously or not, the frequency of your tick, and any dependencies
that your tick might have. We also have a stage
at the start of the frame where the engine decides what ticks are going
to happen this frame. Part of the reason that
we investigated this early on was that we were spending
more than five milliseconds at the start of our frame, we were trying to hit 30 frames
per second on Xbox One, more than five
milliseconds spent just deciding what to tick. That is over 15%
of our frame budget. There is a little
PIX capture there. We use PIX quite a lot. We use some of
Unreal's tools as well. We use PIX quite a lot. But when we are testing
performance, we approached it in the same way that we cover
a lot of our other testing. Specifically for testing
the worst case is what matters. Does not matter if your average
is really good, if when you hit the worst,
your game basically falls apart. We used a lot of automated
testing to track when someone makes a change, how does that reflect
in not just behavioral, but in terms of performance
as well? We have— Just cannot really
see it on the slide very well— But this is one of our test maps
that we have, where it is simulating
that four galleon, almost worst case—
Our worst case is really bad. This is almost worst case.
Galleons firing at each other. Through
our automated systems, we will get this report from
our build system that we can go and look at and trace,
as people make changes, how does that affect
the performance ongoing? I highly recommend you check out
Jessica Baker's talk later today at 4:15 on our automated
testing process, and pipeline, and culture,
and it is really good. How do we analyze this data? Like I say,
we have a Team City report. Team City
is our online build system. We get a report out of that,
and like I said before, we also use PIX for the CPU
and GPU captures on Xbox One. There will be a lot of PIX
captures in this presentation. We also use WPA, occasionally,
to track larger-scale playtests. But we will also use
Unreal stat files as well. These are really good.
You get the same kind of data out of the stat file
as we do in PIX, just aggregated
over a lot of time. This is particularly useful for
our server performance captures. Occasionally, we use
the in-game stat visualizations, but these typically are more
for memory profiling rather than CPU profiling.
But in this presentation, I am only really going to talk
about these two here. For those of you
who have not seen PIX, PIX is freely available
for you to look at on PC. But here is a frame capture
from Sea of Thieves in our performance test map
that I just showed you. I am just going to talk you through a little bit
about what this means. Probably very similar data
to what the new Unreal Insights
stuff looks like. That looks really cool.
I am quite excited about that. This is what we have been using
for a long time. If I start annotating this, we have got our Begin Frame
here, and our End Frame there. We have the game thread
across the top on core 0. We then have
the rendering thread here and the RHI thread here. Not really going to talk about
any of the rendering stuff. I just wanted to put it
in there for completeness. It was a big body of stats
in this capture. We now have, here is
the pre-physics tick group. We have the during physics
tick group. You can see on the other cores, there is the physics
running asynchronously. Then we have our end physics
tick groups, the post-physics,
and the post update work. We do not really do much
in the post update work. But what is also interesting to
call out is in the post physics, here is some animation
that we are doing. When animation runs,
it runs asynchronously, where you can choose
to run asynchronously. But that will also,
anything you do in a tick group, like the animation here, will hold up that tick group
until the next one can run. This frame
was 43 milliseconds. Pretty poor, really.
Not quite our 33 milliseconds for our 30 frames
per second budget, but Sea of Thieves is
an ever-evolving game. We are always
adding more content, so we have to keep working
at making sure our performance is critical.
But we were a lot worse. Back in 2016, we were looking at
113 millisecond frame times, totally unplayable on Xbox.
We were game thread bound, which might surprise
some people. We had a lot of stuff going on. There was not really
one big thing to tackle. We could not look at one thing and then go, boom, we fixed it,
fixed our performance. That is partly because the way
that we developed our systems is they were designed
in isolation. A sail is a sail, this is
a wheel, this is a cannon, and as they grew, they just got
more and more costly. We could not apply sweeping
optimizations to these things. We could do micro-optimizations, but ultimately it was
very difficult to control our worst case frames. In 2016, we had written
a lot of our core mechanics, our core systems
for Sea of Thieves. Going back to the drawing
board was not really an option. This is data I got out of
a sampling capture during a playtest.
As you can see, hopefully you can see,
the majority of time, nearly 50% of our whole CPU time
on core 0, the game core,
was spent on ticking things. Some of the other time
was spent in physics, and in our networking,
but we soon discovered that that is because
our frame times were so bad that the physics was having
to work harder, and the networking was receiving
more data from the server in a single frame,
so it had a lot more to process. Which meant, really, we should be focusing on
looking at our ticks. What do we do with our ticks?
Before I go into ticking, I want to give
a little primer on CPU caches. Specifically,
this is on the Xbox One. The Xbox One
has two CPU modules, each with four cores. I am just going to show
one of the modules here. We have got
our four cores here. We have our main memory here. But you cannot really use main
memory directly from your CPU. Whatever you are doing
on your CPU, that either has to be in
your instruction cache or your data cache. These are both
32 kilobytes in size. These are your L1 caches. But you cannot go
from main memory directly into your L1
cache either. You have to go through
your L2 cache, which is shared
between four cores. Each module
has its own L2 cache. You go in from main memory
into your L2 cache, and then either
into your instruction cache or your data cache. If you then mutate stuff
in your data cache, it needs to read it back out
to your L2, and then back out
into your main memory. This is a problem for us,
because doing so, you have to do it
on all the cores. They all share the L2 cache. You are talking about
quite high amounts of latency between accessing
these sections of memory. It is a hierarchy,
so if something is in L1, it must be in L2 as well. To give a bit more
of context about this, consider these two
bits of assembly, where we are moving some memory. We are moving an integer
out of RCX, an offset into RCX, moving an integer
into a register, and then we are adding those, adding that integer
to another integer. If we consider where
that could have been in memory, I am going to borrow a diagram
from Mike Acton, his data-oriented design
in C++ talk. I do not mean to plagiarize,
but it is a really good talk, and the diagram is excellent at
showcasing what this looks like. If the data is in L1,
it is really quick. But if the data is in L2,
it is a little bit slower. Then if we are talking
about the data is in RAM, think about what the CPU
could be doing at this point. If your CPU
is out of order execution, it might be able
to do some stuff, but you cannot guarantee there
will be enough instructions in its pipe
to re-order some stuff. If it is in order, you are guaranteed
to be waiting that long before you can do
your next instruction. Now, that is important
with data. We are doing a data access here. But what happens
if these are not in your cache? These are the addresses of the
instructions you are running on. They could be in the cache, and if you are linearly zipping
through your instructions, they probably will be. Most modern x64
CPUs can pre-fetch. They can predict
where you are going, and will pre-fetch for you. But if you branch,
if you do a jump, or if you call
a virtual function, it cannot know where that
virtual, where that memory is. You will incur
one of these latencies. And code is small, right?
Not really. Some of this code is pretty big.
I did a little bit of a, hopefully
you can read some of that, a little bit of working out how
big some of these functions are. I have got some of our own
functions from Sea of Thieves, and some functions from Unreal, and this is as a percentage
of your L1 cache. Doing a tick,
just the exclusive size of the tick function for U
Character Movement Component is over 12% of your L1 cache. That is not the callers
and the callees, that is just the function. I did start looking through
the hierarchy, the call hierarchy, but I stopped
at around about 30, nearly 35% of your L1 cache. It is quite a lot of code
that you go through. We have obviously added
a lot on top with our game code on top of Unreal Engine's code. What happens
when this hits scale? Here is what our
individual ticks look like. We have got our timeline.
Let us say we tick a sail. We have never ticked
a sail before. We bring in the instructions. Cool, that is fine.
We expect we have to do that. We have never ticked
a sail before. Then we tick a compass. Just figuring out
what direction north is. That is fine,
we have never done one before. We will bring that
into the instruction cache. Ooh, another sail.
That is cool. That means that is already
in the cache, hot in the cache. We can use that code again. Cannon, we have never seen
one of those before, so we have got to bring that
into the instruction cache. Oh, we have got
to tick a barrel. But our instruction
cache is full. That means
we will get rid of the compass, because we have
never ticked that, we would not tick that
for a while. We probably will not
tick that again, that is fine. Oh, another barrel. That is cool,
it is instruction cache. Whoops, we have hit a compass.
We have just evicted that, so we need
to bring that back in. We have not ticked the sails
for a while. Let us get rid of those.
Now we can tick the compasses. Cannon. Cool,
that is already in the cache. You can see
where this is going. Oh, sail, yeah, we have got
to evict something else now. Then another barrel. This is what our frames
were like, basically. We had so many ticks, and they were all interleaving
with each other. What we wanted to do was aggregate them
together like this. Now, this particular problem
is explained in Scott Meyer's
excellent CPU caches and where you can
find them talk. He talks about this as definitely not
a theoretical problem, his example
was quite theoretical, but we actually hit this,
and hit this really hard. How did we aggregate
these ticks? Actors and components
have their ticks disabled, seems straightforward. Then we register them
with a collection. I will show you some code
for this in a bit. But the collection
is essentially what houses a bunch of Actors
or components, or anything, really,
from within them, and it is keyed off
a U class type. Unreal's reflection
was really useful here. It is very powerful
and very fast to look up, as well, so really good for
keying off for our collections. That collection then has a
single tick function inside it, giving us all the benefits that
Unreal's ticking system has, with running asynchronously,
choosing a frequency, dependencies,
all that kind of stuff. As I said,
this gives us better instruction cache coherency. But with everything
grouped together, it means we can also reduce
our unnecessary work. I will show some more
of that soon. With everything being
in a tight loop, if you are updating
in a loop at this point, we can do some
single-instruction multiple data optimizations. The compiler might even be
smart enough to help us out here as well. I will talk
about this later, context-sensitive
prioritization, sensitive to context. It is a method that we have used
quite heavily in our game. Got a little quote for you from one of our
principal engineers at Rare. "One could argue the thing
that basically saves this game is the fact we are doing a lot
of the same thing all the time." I am going to go through,
I say almost real world example. I have had to edit it
a little bit for the slides, but it is basically a problem that we have
had to solve in the game. As you can imagine,
Sea of Thieves, a lot of water. We have components
for caching where the height of the water
is at that component's location. Which then means that systems
do not have to keep looking up at particular places.
It avoids multiple queries, and the queries
are not exactly cheap. This looks like this. We have got
global accessor here, where we are getting
some kind of interface, which allows us to query the
height at a particular location. Then we have our call
to getting the height and that is it. We cache the height afterwards,
not very much going on. Not really very much
to actually optimize either. Maybe we do not need
to optimize it. Maybe we do. But what does this look
like in our frame? We tested this out
with 100 components in our four-ship scene. Here is where
the ticks land in the frame. We are quite lucky here. Some of them are quite
tightly grouped together, but they are not all
in a straight line, so there have been
some interruptions going on. This took 1.16 milliseconds. I might think that is already
for 100 components, but we have tweets of people
with 100 treasure chests on their ship.
As soon as that sinks, that is 100 items on top of all
the other items in the water. It is not inconceivable
to think of maybe 100, 200, 1,000 maybe. Consider that
is just water height, you could do a lot
in 1.16 milliseconds. What does it look like
when it is aggregated? We have got our aggregation,
this is the function that gets called in place
of all our individual ticks, where we have our array
of components we have collected, and we are calling
the tick functions manually. Not really anything special. The way
that we aggregate them, typically we do this
in begin play and end play, but you can do this
whenever you want. We unregister the tick function if we are going to aggregate
these things, and then we register it
with our collection. We do a bit of a lazy evaluation
here with a few of the first thing
to be collected, you end up creating
the collection with the tick inside it, which is what
that lambda is for. Then on end play,
we unregister the tick. Unregister the component.
What does that do? It is a reasonably
simple change. We have not changed
any logic yet for our tick. What does this look like?
Same hundred components. It is now tightly packed
inside our frame. But what has that done
to the time? We have already a 1.3 times
improvement. Massive, yeah. But we could
do better than that. I talked about
identifying unnecessary work. This is a zoom-in of that
aggregated tick. We have got our getting
the global service, and we have also got our query. Then we have got it again,
and then again, and again. You will notice that
the very first one in that tick, in that group, is huge compared
to all the subsequent ones. That is the benefit of the,
well, probably the benefit
of the instruction cache. The first one is long because we have had to wait
for the instructions to come in. Then we are hot
in the cache now, so everything is much faster. But how can we now logically
change this to improve the speed?
So, we have got our tick, and that is the component
tick, internally. Let us just get rid of all that. Now we have
a more optimized tick. We have got rid of the component
tick altogether, and we have got
a single tick that, instead of looping through
the components calling tick, we take out some of the work. We have got our global state
that we are accessing. We only do that once now
per group of these things. Then we do a batched query rather than calling
get water height individually. We break out all the locations
of the components into an array, because that is all we need
to figure out the water height, and what plane of water
we are interested in. Then when we are finished, we get that data out and assign
it back to our components. The important part of
this bit here is this call here, to building the query.
This is a bit more involved, and actually, you might think
this would be slower, because we are doing
a whole bunch more work. But the work is not too much. We are just going off getting
the data we need for this query, and putting it into data
of some sort that is easily usable by our service. This actually means that
we can do four height queries at once rather than one.
And if we had individual ticks, we would never have been able
to do that. Like I said,
the compiler can auto-vectorize some of this work.
It does not in this case, but if you can help it,
it can auto-vectorize. For example,
this section of code here takes two arrays of floats and adds the elements
together individually. Your compiler will happily
take this and decide I could do four,
I could maybe do eight. It can happily optimize
this for you. You do not have to worry about
writing hand-vectorized code in this case.
What does that do to our time? It is really small now
in the frame, quite difficult
to see, actually. This has actually
got us down quite a bit. 5.2 times more fast, faster. This is on Xbox One.
All these captures, all the captures
you will see throughout here, unless otherwise stated,
are on Xbox One. But how does this compare
with Xbox One X? We still get the same
kind of improvements, but it is already a little bit
quicker on Xbox One X anyway, because the CPUs
are slightly faster. Then how does this scale? As I said,
we have got 100 components. What happens if we had 500
components, or 1,000 components? As you can see here,
we are still scaling linearly with our aggregated
number of components, but far better than we scale,
you see, we kind of jump off a cliff
at 500. Clearly,
this is having a good effect. But we are still
scaling linearly. What happens if we had
10,000 components? Maybe that is unreasonable,
maybe it is not. And in emergent behaviors
in our games, it is really hard to quantify
what our worst case could be, so scaling gracefully,
let us consider the different workloads between our
different platforms we have got. Xbox One,
we have got eight cores. We have a fixed platform, we know
what we can do on that one. High-end PC,
we may have loads of cores, and probably
very fast cores as well. We probably do not worry
too much about that. Min spec PC,
we might only have two cores, and they might be slow cores. On server,
we only have one core. But what do they have to do?
The Xbox One and the PCs, they are displaying
somebody's view into the world, whereas the server is the
authority over the whole view. They have got different
types of workloads they have got to do. Maybe we could do
some round robin scheduling. We can decide, let us say we do
a fixed number of ticks each frame,
where the number is capped, and we know that
that is a manageable amount that we can do. This is great.
We have now got fixed costs. We do not scale linearly.
We hit the max very quickly. But we get quite a bit
of increased latency when we do this. It takes a lot longer
for something to see the tick. You probably have
a higher time delta. Your animation might look
a bit jittery at this point. Not really great for quality.
Hopefully you can see this. I have got
the alphabet here. When one of these boxes flashes,
that means it has ticked. I have got two
different types of round robin ticking happening here. We have got one on the left side
is ticking four a frame. The one on the right side
is ticking ten a frame. You can see on the four, it takes a long time for that
to get back round again. It is quite a lot of latency. That looks like this
in terms of ticks, but when you talk about it
in latency terms, we are looking at, like, fifth
of a second to tick everything. It is quite a lot of latency. You are going to notice
that jitter in your animation. If you are ticking your
particle systems infrequently, you are going to notice that, you are going to lose
some smoothness there. This is where we do
our context-sensitive prioritization. We have a bit more context
with all our stuff being aggregated together,
and we do this per group, so you can make sure
that different types of groups of things
are prioritized correctly. We require a reference point
to prioritize from, so not very useful
for our server. Definitely useful
for our clients, where we really want
to focus the quality, and that is where
it is needed the most. For example, sort our ticks
by priority, either distance to player, in this case,
tick only the closest ten, and then we can scale
the priority for things that we have not ticked as they shuffle up
through our list, and that might look
something like this. Well, it does look something
like this. The scaling is actually
the important factor here. We have a table
that we look into, which looks a little bit
like this, where the X axis is
how many frames have passed since we have not ticked, and the Y axis is how much are
we going to scale our priority. Because we are scaling
by distance, the smaller the value,
the higher your priority. What this does
to our latency comparisons, here is our round robin
for reference. But then we start
looking at latency, remembering we have sorted A is the closest,
Z is the furthest away. And you can see that we have got
the same quality as our not doing round robin
at all for the first few, and then the quality
gradually scales down as it goes
further away from you. There is a different view
of it with the latency. The latency does look really bad
for when you are only doing
four a frame, but that is when stuff
is further away from you. If you do not really care
about that kind of stuff, then you can keep
your quality up close. We have different types of,
well, we can apply different types
of priority scaling to this, to change
how we prioritize them. Here is a normal one. But then we can have
a more aggressive one that prioritizes things
further back, or evenly distributes the time
we have got to tick things. Or we could have a more gentle
scaling one, say your stuff is quite, not very sensitive
to being latent, but you really want to have
the quality up close. For comparison,
you can see the latency here, the gentle table throwing a lot
of latency out for far away, but up close, you have got quite
a good quality bar there. This is with
ticking ten a frame. If we only tick
four a frame, for comparison, we have the round robin
in the graph as well, you have still got
a lot of quality up to the closeness
of the things that you are ticking, far away,
obviously we lose some there. For this presentation, I used
the algorithm-visualizer.org, which is fantastic. I highly recommend
checking it out. You can write your code
in the browser, and then visualize it
in the browser. It is where I captured
these videos from. It is really cool. I have talked about
an almost real world example, now I am going to cover a real
world, few real world examples. We have aggregated
our sail updates. This was one of the first ones
we did in our game. It is because
we have a lot of them. We have got seven sails
per galleon, we have got quite
a few galleons on the seas. It was a good,
it is a good test case for us, the first one we did. They have a lot
of responsibilities. They calculate the billowing
based on the wind direction and the wind speed. They update the animations
based on that billowing. They set some
dynamic materials, so lots of different things
to test out for us. Now, the cost of this
before we aggregated them, bit of gratuitous
animation there. These are all the tick functions
that the sails, I think this was 42 sails had. That includes
your Skeletal Mesh animations, the Actor ticks, and some
other components inside there. This came to a total
of 1.539 milliseconds. Which we thought was actually
unreasonable for our sails. We could do a lot more
other stuff in that time. When we looked at this,
showing here an Unreal stack capture
in the Unreal Profiler, I like using this because it gives you
a nice average over a capture, as well as showing your worst
and best frames, and then you can
inspect as well. From the sail's
average case, we are talking half
a millisecond at this point, over two
and a half times improvement. This heavily used
the prioritization methods we were doing, which looks
a little bit like this. We have 42 sails in the scene, but I am not going to ask you
to count them, but we do not have 42 animation
updates here. With the way
that we aggregated them, it also allowed us to do
some work on different cores. We have a case here where some work
is happening on one thread, and then some dependent work can immediately kick off
in another thread after we know we have finished
a few of the updates. Same with the animation here. Another system that we looked at
was the particle systems. We have quite a particle
heavy game, as you can see, the 20 milliseconds we were
spending on our particles is quite excessive.
But simply aggregating them, not changing any logic
whatsoever, simply aggregating them gave us
that kind of improvement. It also improved us at
the start of our frame as well. We went from 786 ticks
down to one. We got a 25% saving at the start
of our frame, which looks
a little bit like that. Some other systems
that we aggregated. Now, not everything
is worth aggregating. Say, you have only got something
that is, there is only one, there is only one of.
A lot of the audio stuff that we have in the game
is local to your client, and we only need
one of those systems, so we do not bother
aggregating that. In some cases, we found with
things that have dependencies, you might want
to pull some stuff out of an aggregate tick,
put it on its own, so it can have
its own dependency. Some time to reflect. What kind of flaws
does this system have? The tick registration
is a very manual progress. Either you forget to do it,
or you do it incorrectly, and you can cause problems.
So, that is not great. By problems, we have
occasionally had cases where, while an Actor is ticking inside
an aggregate tick, it then calls destroy on itself,
deleting itself, and then we continue ticking, but it has removed itself
from the array, resized the array,
and then we crash. There are ways
you can get around that, but if you do not know
that can happen, that is something you have
to be careful of. We also found that because it is
an opt-in, very manual process, does not really force you to
think differently about your data or your systems
that you are working on. We end up still writing
systems as individuals rather than thinking
of them as a batch. I hinted on this before, you cannot have explicit
dependencies between individual things
within an aggregated tick. But if you want that,
if you need it, you can take it out of the tick. You still benefit everything
else that is in that aggregate, but you can take it out
and have it on its own. Some future work. Maybe we
just address all these flaws? But realistically,
aggregating by default is probably going to give us
the biggest bang for buck. Also, I talked a lot
about instruction cache, and not a sausage
about data cache. Data cache is probably
a really good one, where we can go into. But maybe you could help us,
or one of you could help us, two of you could help us,
because Rare is hiring. Rare is a wonderful
place to work. I really, I have been
working there for six years. I love the place. It is a fantastic place out
in the English countryside, really beautiful landscapes
And lots of dogs. If you like dogs,
you can bring your dog to work. I have to show this guy.
Doing a thing with some friends where we have
to take this guy around and show him a good time. We have taken him around
Prague, me and my fiance. I wanted to show him here
for you all. Thank you very much. [Applause] ♫ Unreal logo music ♫