Aggregating Ticks to Manage Scale in Sea of Thieves | Unreal Fest Europe 2019 | Unreal Engine

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
>>Jon Holmes: I'll start off with a little bit about me. I have been in the games industry for about ten years. The majority of that has been at Rare. I am an audio programming specialist, which might seem a bit weird for a performance talk, but audio being one of those real-time systems, you have to be on the game for performance. I am a member of the Engine Team at Rare. Now, the Engine Team is full of loads of specialists. We have got networking, we have got physics, we have got all the core specialties in our Engine Team. But we muck in everywhere. We are currently working on Sea of Thieves, which is an ever-evolving title. We have been going for a little over a year now. We have got our next big update coming soon, and got a little trailer to show you. ♫ intense orchestral music ♫ I believe the press embargo has just lifted for some of the reviews for this, or previews, so do go online and check out what people are saying about it just now. What am I going to cover today? I am going to cover how we kept a consistent framerate in Sea of Thieves. It is a challenging environment to do so. I am going to do that with lots of profiling data. There is going to be lots of profiling data in this presentation, through different methods. Then I am going to cover engineering techniques that we used to gracefully scale our game systems. We have got a lot of stuff going on in Sea of Thieves, and we want to make sure that the quality is delivered. I am going to go into plenty of technical detail, so I will give some primers on some of this stuff. Hopefully it is all spelled out. What kind of problems are we dealing with here? We are a client/server architecture. We are a multiplayer game. Got lots going on. Got many dynamic elements, and we have the ethos of tools, not rules. We try and give loads of stuff to the players and see how that emerges in different behaviors. Very unpredictable, some of this stuff. We have a large variance in our scene complexity. You can have a case where a single player is in a cave looking at some cave paintings, compared with four galleons laden with treasure, all battling it out. Maybe somebody has fallen overboard, there is a bunch of sharks going after them, large variance in scene complexity. We are multi-platform as well. We are talking about PC, min and low-spec PCs, we are talking about Xbox One, and we are talking about our server architecture as well, which is a single core in the cloud. We began development of Sea of Thieves on Unreal Engine 4.6, which feels like an age ago, and we actually shipped on 4.10. We were missing a lot of the optimizations and a lot of the improvements that came in, especially now we are talking 4.22. I am going to talk about our ticking, and how we worked with ticks. I am going to give a quick primer on what is a tick, in case some people at an Unreal developer conference do not know what a tick is. We are essentially talking about a virtual function call, either in native code or in Blueprint code. This is housed inside a Tick Function structure, which controls what group you are ticking in, if you want to tick asynchronously or not, the frequency of your tick, and any dependencies that your tick might have. We also have a stage at the start of the frame where the engine decides what ticks are going to happen this frame. Part of the reason that we investigated this early on was that we were spending more than five milliseconds at the start of our frame, we were trying to hit 30 frames per second on Xbox One, more than five milliseconds spent just deciding what to tick. That is over 15% of our frame budget. There is a little PIX capture there. We use PIX quite a lot. We use some of Unreal's tools as well. We use PIX quite a lot. But when we are testing performance, we approached it in the same way that we cover a lot of our other testing. Specifically for testing the worst case is what matters. Does not matter if your average is really good, if when you hit the worst, your game basically falls apart. We used a lot of automated testing to track when someone makes a change, how does that reflect in not just behavioral, but in terms of performance as well? We have⁠— Just cannot really see it on the slide very well⁠— But this is one of our test maps that we have, where it is simulating that four galleon, almost worst case⁠— Our worst case is really bad. This is almost worst case. Galleons firing at each other. Through our automated systems, we will get this report from our build system that we can go and look at and trace, as people make changes, how does that affect the performance ongoing? I highly recommend you check out Jessica Baker's talk later today at 4:15 on our automated testing process, and pipeline, and culture, and it is really good. How do we analyze this data? Like I say, we have a Team City report. Team City is our online build system. We get a report out of that, and like I said before, we also use PIX for the CPU and GPU captures on Xbox One. There will be a lot of PIX captures in this presentation. We also use WPA, occasionally, to track larger-scale playtests. But we will also use Unreal stat files as well. These are really good. You get the same kind of data out of the stat file as we do in PIX, just aggregated over a lot of time. This is particularly useful for our server performance captures. Occasionally, we use the in-game stat visualizations, but these typically are more for memory profiling rather than CPU profiling. But in this presentation, I am only really going to talk about these two here. For those of you who have not seen PIX, PIX is freely available for you to look at on PC. But here is a frame capture from Sea of Thieves in our performance test map that I just showed you. I am just going to talk you through a little bit about what this means. Probably very similar data to what the new Unreal Insights stuff looks like. That looks really cool. I am quite excited about that. This is what we have been using for a long time. If I start annotating this, we have got our Begin Frame here, and our End Frame there. We have the game thread across the top on core 0. We then have the rendering thread here and the RHI thread here. Not really going to talk about any of the rendering stuff. I just wanted to put it in there for completeness. It was a big body of stats in this capture. We now have, here is the pre-physics tick group. We have the during physics tick group. You can see on the other cores, there is the physics running asynchronously. Then we have our end physics tick groups, the post-physics, and the post update work. We do not really do much in the post update work. But what is also interesting to call out is in the post physics, here is some animation that we are doing. When animation runs, it runs asynchronously, where you can choose to run asynchronously. But that will also, anything you do in a tick group, like the animation here, will hold up that tick group until the next one can run. This frame was 43 milliseconds. Pretty poor, really. Not quite our 33 milliseconds for our 30 frames per second budget, but Sea of Thieves is an ever-evolving game. We are always adding more content, so we have to keep working at making sure our performance is critical. But we were a lot worse. Back in 2016, we were looking at 113 millisecond frame times, totally unplayable on Xbox. We were game thread bound, which might surprise some people. We had a lot of stuff going on. There was not really one big thing to tackle. We could not look at one thing and then go, boom, we fixed it, fixed our performance. That is partly because the way that we developed our systems is they were designed in isolation. A sail is a sail, this is a wheel, this is a cannon, and as they grew, they just got more and more costly. We could not apply sweeping optimizations to these things. We could do micro-optimizations, but ultimately it was very difficult to control our worst case frames. In 2016, we had written a lot of our core mechanics, our core systems for Sea of Thieves. Going back to the drawing board was not really an option. This is data I got out of a sampling capture during a playtest. As you can see, hopefully you can see, the majority of time, nearly 50% of our whole CPU time on core 0, the game core, was spent on ticking things. Some of the other time was spent in physics, and in our networking, but we soon discovered that that is because our frame times were so bad that the physics was having to work harder, and the networking was receiving more data from the server in a single frame, so it had a lot more to process. Which meant, really, we should be focusing on looking at our ticks. What do we do with our ticks? Before I go into ticking, I want to give a little primer on CPU caches. Specifically, this is on the Xbox One. The Xbox One has two CPU modules, each with four cores. I am just going to show one of the modules here. We have got our four cores here. We have our main memory here. But you cannot really use main memory directly from your CPU. Whatever you are doing on your CPU, that either has to be in your instruction cache or your data cache. These are both 32 kilobytes in size. These are your L1 caches. But you cannot go from main memory directly into your L1 cache either. You have to go through your L2 cache, which is shared between four cores. Each module has its own L2 cache. You go in from main memory into your L2 cache, and then either into your instruction cache or your data cache. If you then mutate stuff in your data cache, it needs to read it back out to your L2, and then back out into your main memory. This is a problem for us, because doing so, you have to do it on all the cores. They all share the L2 cache. You are talking about quite high amounts of latency between accessing these sections of memory. It is a hierarchy, so if something is in L1, it must be in L2 as well. To give a bit more of context about this, consider these two bits of assembly, where we are moving some memory. We are moving an integer out of RCX, an offset into RCX, moving an integer into a register, and then we are adding those, adding that integer to another integer. If we consider where that could have been in memory, I am going to borrow a diagram from Mike Acton, his data-oriented design in C++ talk. I do not mean to plagiarize, but it is a really good talk, and the diagram is excellent at showcasing what this looks like. If the data is in L1, it is really quick. But if the data is in L2, it is a little bit slower. Then if we are talking about the data is in RAM, think about what the CPU could be doing at this point. If your CPU is out of order execution, it might be able to do some stuff, but you cannot guarantee there will be enough instructions in its pipe to re-order some stuff. If it is in order, you are guaranteed to be waiting that long before you can do your next instruction. Now, that is important with data. We are doing a data access here. But what happens if these are not in your cache? These are the addresses of the instructions you are running on. They could be in the cache, and if you are linearly zipping through your instructions, they probably will be. Most modern x64 CPUs can pre-fetch. They can predict where you are going, and will pre-fetch for you. But if you branch, if you do a jump, or if you call a virtual function, it cannot know where that virtual, where that memory is. You will incur one of these latencies. And code is small, right? Not really. Some of this code is pretty big. I did a little bit of a, hopefully you can read some of that, a little bit of working out how big some of these functions are. I have got some of our own functions from Sea of Thieves, and some functions from Unreal, and this is as a percentage of your L1 cache. Doing a tick, just the exclusive size of the tick function for U Character Movement Component is over 12% of your L1 cache. That is not the callers and the callees, that is just the function. I did start looking through the hierarchy, the call hierarchy, but I stopped at around about 30, nearly 35% of your L1 cache. It is quite a lot of code that you go through. We have obviously added a lot on top with our game code on top of Unreal Engine's code. What happens when this hits scale? Here is what our individual ticks look like. We have got our timeline. Let us say we tick a sail. We have never ticked a sail before. We bring in the instructions. Cool, that is fine. We expect we have to do that. We have never ticked a sail before. Then we tick a compass. Just figuring out what direction north is. That is fine, we have never done one before. We will bring that into the instruction cache. Ooh, another sail. That is cool. That means that is already in the cache, hot in the cache. We can use that code again. Cannon, we have never seen one of those before, so we have got to bring that into the instruction cache. Oh, we have got to tick a barrel. But our instruction cache is full. That means we will get rid of the compass, because we have never ticked that, we would not tick that for a while. We probably will not tick that again, that is fine. Oh, another barrel. That is cool, it is instruction cache. Whoops, we have hit a compass. We have just evicted that, so we need to bring that back in. We have not ticked the sails for a while. Let us get rid of those. Now we can tick the compasses. Cannon. Cool, that is already in the cache. You can see where this is going. Oh, sail, yeah, we have got to evict something else now. Then another barrel. This is what our frames were like, basically. We had so many ticks, and they were all interleaving with each other. What we wanted to do was aggregate them together like this. Now, this particular problem is explained in Scott Meyer's excellent CPU caches and where you can find them talk. He talks about this as definitely not a theoretical problem, his example was quite theoretical, but we actually hit this, and hit this really hard. How did we aggregate these ticks? Actors and components have their ticks disabled, seems straightforward. Then we register them with a collection. I will show you some code for this in a bit. But the collection is essentially what houses a bunch of Actors or components, or anything, really, from within them, and it is keyed off a U class type. Unreal's reflection was really useful here. It is very powerful and very fast to look up, as well, so really good for keying off for our collections. That collection then has a single tick function inside it, giving us all the benefits that Unreal's ticking system has, with running asynchronously, choosing a frequency, dependencies, all that kind of stuff. As I said, this gives us better instruction cache coherency. But with everything grouped together, it means we can also reduce our unnecessary work. I will show some more of that soon. With everything being in a tight loop, if you are updating in a loop at this point, we can do some single-instruction multiple data optimizations. The compiler might even be smart enough to help us out here as well. I will talk about this later, context-sensitive prioritization, sensitive to context. It is a method that we have used quite heavily in our game. Got a little quote for you from one of our principal engineers at Rare. "One could argue the thing that basically saves this game is the fact we are doing a lot of the same thing all the time." I am going to go through, I say almost real world example. I have had to edit it a little bit for the slides, but it is basically a problem that we have had to solve in the game. As you can imagine, Sea of Thieves, a lot of water. We have components for caching where the height of the water is at that component's location. Which then means that systems do not have to keep looking up at particular places. It avoids multiple queries, and the queries are not exactly cheap. This looks like this. We have got global accessor here, where we are getting some kind of interface, which allows us to query the height at a particular location. Then we have our call to getting the height and that is it. We cache the height afterwards, not very much going on. Not really very much to actually optimize either. Maybe we do not need to optimize it. Maybe we do. But what does this look like in our frame? We tested this out with 100 components in our four-ship scene. Here is where the ticks land in the frame. We are quite lucky here. Some of them are quite tightly grouped together, but they are not all in a straight line, so there have been some interruptions going on. This took 1.16 milliseconds. I might think that is already for 100 components, but we have tweets of people with 100 treasure chests on their ship. As soon as that sinks, that is 100 items on top of all the other items in the water. It is not inconceivable to think of maybe 100, 200, 1,000 maybe. Consider that is just water height, you could do a lot in 1.16 milliseconds. What does it look like when it is aggregated? We have got our aggregation, this is the function that gets called in place of all our individual ticks, where we have our array of components we have collected, and we are calling the tick functions manually. Not really anything special. The way that we aggregate them, typically we do this in begin play and end play, but you can do this whenever you want. We unregister the tick function if we are going to aggregate these things, and then we register it with our collection. We do a bit of a lazy evaluation here with a few of the first thing to be collected, you end up creating the collection with the tick inside it, which is what that lambda is for. Then on end play, we unregister the tick. Unregister the component. What does that do? It is a reasonably simple change. We have not changed any logic yet for our tick. What does this look like? Same hundred components. It is now tightly packed inside our frame. But what has that done to the time? We have already a 1.3 times improvement. Massive, yeah. But we could do better than that. I talked about identifying unnecessary work. This is a zoom-in of that aggregated tick. We have got our getting the global service, and we have also got our query. Then we have got it again, and then again, and again. You will notice that the very first one in that tick, in that group, is huge compared to all the subsequent ones. That is the benefit of the, well, probably the benefit of the instruction cache. The first one is long because we have had to wait for the instructions to come in. Then we are hot in the cache now, so everything is much faster. But how can we now logically change this to improve the speed? So, we have got our tick, and that is the component tick, internally. Let us just get rid of all that. Now we have a more optimized tick. We have got rid of the component tick altogether, and we have got a single tick that, instead of looping through the components calling tick, we take out some of the work. We have got our global state that we are accessing. We only do that once now per group of these things. Then we do a batched query rather than calling get water height individually. We break out all the locations of the components into an array, because that is all we need to figure out the water height, and what plane of water we are interested in. Then when we are finished, we get that data out and assign it back to our components. The important part of this bit here is this call here, to building the query. This is a bit more involved, and actually, you might think this would be slower, because we are doing a whole bunch more work. But the work is not too much. We are just going off getting the data we need for this query, and putting it into data of some sort that is easily usable by our service. This actually means that we can do four height queries at once rather than one. And if we had individual ticks, we would never have been able to do that. Like I said, the compiler can auto-vectorize some of this work. It does not in this case, but if you can help it, it can auto-vectorize. For example, this section of code here takes two arrays of floats and adds the elements together individually. Your compiler will happily take this and decide I could do four, I could maybe do eight. It can happily optimize this for you. You do not have to worry about writing hand-vectorized code in this case. What does that do to our time? It is really small now in the frame, quite difficult to see, actually. This has actually got us down quite a bit. 5.2 times more fast, faster. This is on Xbox One. All these captures, all the captures you will see throughout here, unless otherwise stated, are on Xbox One. But how does this compare with Xbox One X? We still get the same kind of improvements, but it is already a little bit quicker on Xbox One X anyway, because the CPUs are slightly faster. Then how does this scale? As I said, we have got 100 components. What happens if we had 500 components, or 1,000 components? As you can see here, we are still scaling linearly with our aggregated number of components, but far better than we scale, you see, we kind of jump off a cliff at 500. Clearly, this is having a good effect. But we are still scaling linearly. What happens if we had 10,000 components? Maybe that is unreasonable, maybe it is not. And in emergent behaviors in our games, it is really hard to quantify what our worst case could be, so scaling gracefully, let us consider the different workloads between our different platforms we have got. Xbox One, we have got eight cores. We have a fixed platform, we know what we can do on that one. High-end PC, we may have loads of cores, and probably very fast cores as well. We probably do not worry too much about that. Min spec PC, we might only have two cores, and they might be slow cores. On server, we only have one core. But what do they have to do? The Xbox One and the PCs, they are displaying somebody's view into the world, whereas the server is the authority over the whole view. They have got different types of workloads they have got to do. Maybe we could do some round robin scheduling. We can decide, let us say we do a fixed number of ticks each frame, where the number is capped, and we know that that is a manageable amount that we can do. This is great. We have now got fixed costs. We do not scale linearly. We hit the max very quickly. But we get quite a bit of increased latency when we do this. It takes a lot longer for something to see the tick. You probably have a higher time delta. Your animation might look a bit jittery at this point. Not really great for quality. Hopefully you can see this. I have got the alphabet here. When one of these boxes flashes, that means it has ticked. I have got two different types of round robin ticking happening here. We have got one on the left side is ticking four a frame. The one on the right side is ticking ten a frame. You can see on the four, it takes a long time for that to get back round again. It is quite a lot of latency. That looks like this in terms of ticks, but when you talk about it in latency terms, we are looking at, like, fifth of a second to tick everything. It is quite a lot of latency. You are going to notice that jitter in your animation. If you are ticking your particle systems infrequently, you are going to notice that, you are going to lose some smoothness there. This is where we do our context-sensitive prioritization. We have a bit more context with all our stuff being aggregated together, and we do this per group, so you can make sure that different types of groups of things are prioritized correctly. We require a reference point to prioritize from, so not very useful for our server. Definitely useful for our clients, where we really want to focus the quality, and that is where it is needed the most. For example, sort our ticks by priority, either distance to player, in this case, tick only the closest ten, and then we can scale the priority for things that we have not ticked as they shuffle up through our list, and that might look something like this. Well, it does look something like this. The scaling is actually the important factor here. We have a table that we look into, which looks a little bit like this, where the X axis is how many frames have passed since we have not ticked, and the Y axis is how much are we going to scale our priority. Because we are scaling by distance, the smaller the value, the higher your priority. What this does to our latency comparisons, here is our round robin for reference. But then we start looking at latency, remembering we have sorted A is the closest, Z is the furthest away. And you can see that we have got the same quality as our not doing round robin at all for the first few, and then the quality gradually scales down as it goes further away from you. There is a different view of it with the latency. The latency does look really bad for when you are only doing four a frame, but that is when stuff is further away from you. If you do not really care about that kind of stuff, then you can keep your quality up close. We have different types of, well, we can apply different types of priority scaling to this, to change how we prioritize them. Here is a normal one. But then we can have a more aggressive one that prioritizes things further back, or evenly distributes the time we have got to tick things. Or we could have a more gentle scaling one, say your stuff is quite, not very sensitive to being latent, but you really want to have the quality up close. For comparison, you can see the latency here, the gentle table throwing a lot of latency out for far away, but up close, you have got quite a good quality bar there. This is with ticking ten a frame. If we only tick four a frame, for comparison, we have the round robin in the graph as well, you have still got a lot of quality up to the closeness of the things that you are ticking, far away, obviously we lose some there. For this presentation, I used the algorithm-visualizer.org, which is fantastic. I highly recommend checking it out. You can write your code in the browser, and then visualize it in the browser. It is where I captured these videos from. It is really cool. I have talked about an almost real world example, now I am going to cover a real world, few real world examples. We have aggregated our sail updates. This was one of the first ones we did in our game. It is because we have a lot of them. We have got seven sails per galleon, we have got quite a few galleons on the seas. It was a good, it is a good test case for us, the first one we did. They have a lot of responsibilities. They calculate the billowing based on the wind direction and the wind speed. They update the animations based on that billowing. They set some dynamic materials, so lots of different things to test out for us. Now, the cost of this before we aggregated them, bit of gratuitous animation there. These are all the tick functions that the sails, I think this was 42 sails had. That includes your Skeletal Mesh animations, the Actor ticks, and some other components inside there. This came to a total of 1.539 milliseconds. Which we thought was actually unreasonable for our sails. We could do a lot more other stuff in that time. When we looked at this, showing here an Unreal stack capture in the Unreal Profiler, I like using this because it gives you a nice average over a capture, as well as showing your worst and best frames, and then you can inspect as well. From the sail's average case, we are talking half a millisecond at this point, over two and a half times improvement. This heavily used the prioritization methods we were doing, which looks a little bit like this. We have 42 sails in the scene, but I am not going to ask you to count them, but we do not have 42 animation updates here. With the way that we aggregated them, it also allowed us to do some work on different cores. We have a case here where some work is happening on one thread, and then some dependent work can immediately kick off in another thread after we know we have finished a few of the updates. Same with the animation here. Another system that we looked at was the particle systems. We have quite a particle heavy game, as you can see, the 20 milliseconds we were spending on our particles is quite excessive. But simply aggregating them, not changing any logic whatsoever, simply aggregating them gave us that kind of improvement. It also improved us at the start of our frame as well. We went from 786 ticks down to one. We got a 25% saving at the start of our frame, which looks a little bit like that. Some other systems that we aggregated. Now, not everything is worth aggregating. Say, you have only got something that is, there is only one, there is only one of. A lot of the audio stuff that we have in the game is local to your client, and we only need one of those systems, so we do not bother aggregating that. In some cases, we found with things that have dependencies, you might want to pull some stuff out of an aggregate tick, put it on its own, so it can have its own dependency. Some time to reflect. What kind of flaws does this system have? The tick registration is a very manual progress. Either you forget to do it, or you do it incorrectly, and you can cause problems. So, that is not great. By problems, we have occasionally had cases where, while an Actor is ticking inside an aggregate tick, it then calls destroy on itself, deleting itself, and then we continue ticking, but it has removed itself from the array, resized the array, and then we crash. There are ways you can get around that, but if you do not know that can happen, that is something you have to be careful of. We also found that because it is an opt-in, very manual process, does not really force you to think differently about your data or your systems that you are working on. We end up still writing systems as individuals rather than thinking of them as a batch. I hinted on this before, you cannot have explicit dependencies between individual things within an aggregated tick. But if you want that, if you need it, you can take it out of the tick. You still benefit everything else that is in that aggregate, but you can take it out and have it on its own. Some future work. Maybe we just address all these flaws? But realistically, aggregating by default is probably going to give us the biggest bang for buck. Also, I talked a lot about instruction cache, and not a sausage about data cache. Data cache is probably a really good one, where we can go into. But maybe you could help us, or one of you could help us, two of you could help us, because Rare is hiring. Rare is a wonderful place to work. I really, I have been working there for six years. I love the place. It is a fantastic place out in the English countryside, really beautiful landscapes And lots of dogs. If you like dogs, you can bring your dog to work. I have to show this guy. Doing a thing with some friends where we have to take this guy around and show him a good time. We have taken him around Prague, me and my fiance. I wanted to show him here for you all. Thank you very much. [Applause] ♫ Unreal logo music ♫
Info
Channel: Unreal Engine
Views: 17,977
Rating: undefined out of 5
Keywords: Game Development, Unreal, Game Dev, Game Engine, UE4, Epic Games, Unreal Fest, UE Fest, Prague, Unreal Engine, Unreal Fest Europe
Id: CBP5bpwkO54
Channel Id: undefined
Length: 37min 44sec (2264 seconds)
Published: Mon May 20 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.