Automated Testing of Gameplay Features in 'Sea of Thieves'

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

So, I work as a computer games tester, and watching some videos about the matter, stumbled across this. Maybe you will find it interesting. :))

👍︎︎ 1 👤︎︎ u/Bear_of_The_Forest 📅︎︎ Oct 07 2021 🗫︎ replies

Captions

[Music] i'm robert masella i'm a software engineer from rare because we had such bad experiences with testing on our previous projects when we started work on our latest game sea of thieves instead of relying on manual testing we decided to completely change our approach and use automated testing on every part of the code base including on gameplay features which are notoriously tricky to test so in this talk i'm going to walk through what the approach we took our learnings and how we benefited benefited from it so first a few quick details about me i've been a game play engineer at rare for about 14 years and for anyone who doesn't know rare we're a microsoft first party studio we're based in central england and we've got a very long history of game development and just for context we have about 200 people in a studio at the moment so games i've worked on uh our banjo kazooie nuts and bolts all three of the connect sports games and in the last four years or so i've been working on sea of thieves so rare gameplay engineers don't tend to specialize too much uh so on sea of thieves i ended up doing a bit of everything including ai physics character movement animation and just whatever needed doing so for those who aren't what sea of thieves is um it's a multiplayer open world pirate adventure game where players can join in cruise and cooperatively sail around the world follow maps to find treasure steal treasure from other players and fight skeletons uh that's what you can do but if you want you can just you know just play instruments and get drunk on grog you know it's up to the players so since released the game about a year ago in fact exactly a year ago now we've added multiple new updates uh so we've added skeleton ai ships for the players to fight a new set of volcanic islands to explore and a megalodon shark that can appear and attack the players anytime so in this presentation i'll be talking about why before automated tests would be a good fit for sea of thieves how our testing framework worked and how we created tests uh how we optimized our testing uh as we uh during production and then finally i'll talk about the benefits uh we got so first of all why did we decide to use automated testing well sea of thieves is gonna be a very different game for rare uh and some of those differences meant that uh in terms of testing there were going to be some different challenges than we had before so the first one was that this was an open world game our first open world game and it was going to be a very open in terms of gameplay there were very few restrictions really on what the players could do and when they could do it so the challenge here was the all the complexity that added the the way features interacted meant that we'd have to keep checking all those interactions all those features making sure they still worked so the other issue is this was going to be a constantly evolving game this is going to be our first game as a service that we're going to be constantly evolving responding to player feedback and the issue there is as you're constantly changing the code base you've got a lot of risk of causing bug regression regressions uh breaking features that you'd already implemented and then the third challenge was that we wanted to if possible release the game within a week so we could respond quickly to play vbac or add hotfixes that kind of thing so the issue with that was that on previous projects it taken at least two weeks to verify build and with the extra complexity that sea of thieves had a week just wasn't gonna be enough time to have confidence that the build we were putting out to players was actually not gonna be full of bugs so if we look at kind of the the testing process that we used uh previously on on previous games uh if you look at this dev timeline of kind of like from the point where a developer makes a local change on their pc to the point where players kind of get a game update our old process essentially only had one point where we did testing really it was we just got lots of manual testers they got a build that was created from the build system every so often and we just got them to check it and hope that they found all the bugs so the problem with this approach is it's quite slow and a fairly unreliable way of actually finding all our issues so um let's look at the example of kind of the failure of that kind of approach um yeah so in this book an engineer has made a change to the game that meant that the skeleton ai's target memory was broken so that as soon as they lost line of sight with target they would completely forget about them okay i admit this engineer was me i put this bug in the game um during the beta so again i thought this would be a good example to show so who is in action on a test level so the skeleton starts attacking the player but when the player goes around the corner he just starts wandering off so not great this is what we'd expect to see so now when the player goes around the corner the skeleton follows as you'd expect so let's go through kind of the process this bug went through so first of all uh the bug wasn't noticed by the developer as they were developing i me uh so what has happened didn't i did i not test the game properly well i did i did try the skeleton ai on a test level before i submitted my change but it was on an open open area where there were no obstacles so you didn't see this issue so again i just submitted the change um then the testers will test take the latest changes and test it in this case again because the bug was subtle enough they didn't actually notice it and eventually yeah it ends up in a game update and released to players so if we cut to a few weeks later and we see how the bug was eventually fixed so at this point the um the community team are gathering feedback from the the beta players and they notice that uh the players are starting to mention that skeletons seem a bit more a bit more dumb less responsive than they used to be so again an engineer is taken away from their normal work they have to be a science bug and they have to spend time fixing it finding the bug and then fixing it eventually there is a fix that's put into the build the testers then have to spend time verifying that fixed then that fix goes into the players but of course there's every chance that this bug could still reoccur uh because we've got no way to stop it so rather than just kind of scatter shot process of kind of you know adding a bug uh finding the bug fixing it maybe what we do instead is kind of regularly check that scenario and just make sure that it's never in the build so let's say we've got one of our testers manual testers to kind of check that scenario every day or so that wouldn't take too long to do but if in a game like sea of thieves you're probably going to end up with thousands of this and it's going to be like a full-time job of someone or probably more like several people and not particularly you know not a particularly productive job really they're just going to be doing the same mind-numbing checks all the time so instead of wasting a human time a human's time doing this we could just ask the game to do itself right with an automated test then we avoid wasting the the tester's time so the automatic tests also give us a few other advantages that the humans aren't able to do so um first of all we um we could run tests a lot faster with automation um yeah a little bit faster and then uh the second one is that the we can be a bit more precise like the game can test its own game state which uh the human can only really check what's going on by by eyeballing the game and then the the third thing that the the automated test can do is test at a different code level uh different levels in terms of the game they can check individual code function is working correctly the inputs with certain inputs certain outputs come out which again a human tester can't do they can only really play the game in its complete form so we shouldn't just um get rid of all our manual testers straight away though there's a few things that they're a bit better at than an automated test so humans are going to be a lot better at noticing defects to do with the visual and audio uh elements of the game uh humans are also able to use their creativity do exploratory testing and find issues that we haven't really considered yet and humans are just going to be a lot better at assessing the actual game experience you know how does it feel to play so until we get like very advanced ais that could possibly do these kind of more human-like factors we want to have a mix of kind of human and automated testing but the advantage of an automated testing there is that the humans don't have to do so many repetitive checks they can have do more productive things cool so that was why we decided ultimate testing would be a good good fit um now i'll talk about kind of how our test framework worked and how we made tests so see if these was built on top of the unreal engine so you and we used the version that was taken a few years ago so what you see in this presentation might not completely match with the latest version so just bear that in mind so we implemented our test framework by uh heavily modifying the automation um framework that was already there so we took when we took advantage of that we could do things like in the editor well to run tests in the editor all we did was go to the automation tab uh select the test we wanted to run uh and they would all run and they would give you a kind of resort a pass or fail result so we could run our tests in the in the editor like this but we could also run them on built executables and we also the standalone tool which uh our engineers in particular use to kind of verify their latest code changes without ever having to uh build the game or build the editor so the simplest kind of tests we had were unit tests and if you're familiar with well-known test frameworks like any unit these will look very similar so it's essentially a bit of code that registers with the automation system so it can be recognized as a test so unit tests generally check a specific operation on the smallest testable bit of code which generally means testing a code function level so in this example we have a test that's checking you know a very simple math library function um just checking that if we've got two counting and calculating the distance between the vectors and they're e and they're equal then we return zero so most of our tests uh broke down into three stages so first you do the setup uh here where we just create the the vector objects then we don't run the actual operation where we run the the distance the distance function and then we do the assertion where we check the result size as expected so in this case if the test comes through and fails then we will throw an error to the log that will get picked up by the automation system and actually failed test for us so if we had unit tests that cover every code function in the game in theory we have enough testing that covers all the game right but of course sometimes the way units interact can themselves contain bugs so it's a good idea to have integration tests on top so integration tests will generally cover like a whole feature or action in the game so they provide coverage for multiple units that are in that feature and also cover the communication between them so now for unit test fails for example we know straight away that that specific unit is probably going to be the problem uh however if our unit test pass an integration test fails we know there's probably some other issue to do with that feature um probably something to do with uh could be the communication between units or maybe an asset problem something like that so an integration test failure is going to take longer to investigate because it's covering a larger scope so we generally prefer unit tests when we can but they're still very useful to give us that high level signal that something's wrong kind of broadly with a feature so to create integration tests for cfes we just created them as maps within the unreal editor so each map would run a each integration test map would run a um fixed scenario of some kind then report back its results as a pass passive fail to see whether that behavior happened as expected and to do the logic of what happens in these integration tests we made use of the unreal blueprint system so if you're not aware of blueprint it's a node-based scripting system available in the unreal engine um so to follow the flower blueprint you just look for the white line which is the execution line so in this example we just start from the begin play event we delay for two seconds then set the active rotation to all zeros so blueprint was very good for running integration tests as um nodes can be latent which means that they'll pause execution until a certain condition has occurred which is something you tend to do quite a lot in these integration tests so for example this delay node is just going to be delay execution for two seconds so we could have written our integration integration test logic with code but because of later node support and because how easy it is to iterate on blueprint just inside the editor we found it more convenient to use blueprint so as an example of an integration test for sea of thieves let's say we want to create one for one of the most basic actions for a pirate game like sea of thieves it's uh having the player interact with and turn the wheel so specifically we're going to check that when they do that the actual wheel angle of the mesh turns correctly so this is what it looks like in game at the moment okay so if we wanted to go about creating an integration test for this you know one thing we could do we could load up the full world we could have a player stand on the ship and turn the wheel but that's going to be very very slow to load and also you're bringing so many other systems that are going to affect your test you're not really kind of looking specifically at the wheel anymore so instead you can have you know a much simpler version and this is what we end up doing in this example but so we're going to have a player just standing on a platform interacting with wheel and that's essentially all we need so this is the blueprint for the test and you can see it splits up into those three phases like we had before up run operation and check results so i'll zoom in a bit better so you can see so in the first half we start with the begin play event which is on the uh the level blueprint we a lot of the setups already done for us essentially by adding the the wheel and the player but we also need the player to interact with the wheel before he can actually turn it so we do that first then we do the run operation uh which in this case is having the player apply a fake input just to turn the wheel um and then finally we do an assertion and we check that the wheel angle has kind of gone beyond a certain tolerance so we know that the wheel has been interacted with correctly and then we finish the test so this is what the test looks like in action um i've slowed it down a bit and made the angle a bit larger just to show visually what's happening so in reality probably running less than a second and if we go back to the automation window we can see that the test is passing so now we have a test that's checking that whole operation including you know all the units involved and the in-game assets and if we run it regularly we can see if something breaks this feature so it's good practice that a test particularly integration test is made to be robust to all the changes we could kind of expect that still conform to the behavioral contract that we kind of started with so in this case that the player interacts with wheel and turns it uh if you end up kind of relying on something more specific than that you're in danger of kind of relying more on the implementation of the feature and you'll probably find that you're going to have to be constantly reworking your test as you're reworking the implementation so as an example of this if we look at kind of the look at the tests that we just made this is kind of the free kind of broadly kind of code steps that are happening when we run it so first we have a character input handler on the player character that gets a negative input sent to it that input handler then sends the input to the wheel object and then the wheel applies that input to the wheel mesh and then we check the results but and we're doing that all on the same frame which is the danger because if we decide to make a code change which happens a lot where we want to defer something to the next frame so let's say we defer the uh the application of the wheel mesh angle to the next frame so suddenly we're going to be checking the results at a time when the wheel hasn't actually been changed it and if we check then we find yes that's the the test is now failing with an arrow saying that the the angle uh is not correct so the most most straightforward forward way to fix the test is just to add a delay for one frame between the run operation and checking the results and let the test succeed again but let's say that we want to do another code change now this time we're going to add a animation that happens kind of in that stage of the code between when the input handler sends the input to the wheel and before between when we set it on the the wheel mesh and we're not sure how long that animation is going to be and it looks like yep the test is now failing again so how would we fix it at this point so one thing we could do is possibly inspect the animation and find out how long it is but again the problem there is we're kind of looking at the implementation again and we're kind of you know if we take the animation out we have to fix the test again you know maybe we could add a large delay say 100 frames because we we we're not sure that the animation's not going to be any longer than that um and this would probably this would work but you know you've added at least 100 frames to test time now and then maybe expect the animation to be a lot less and okay if you run the test once or twice that's not too bad but we're gonna be running this test hundreds of times a day possibly on our build system so all those extra frames are gonna add up so instead a better solution is just to use a polling version so what we do is we just use a delay until node and that will just again it's a latent node and it will keep spinning waiting for the uh the angle to be correct so again it doesn't really matter what's kind of happening as the player interacts with the wheel it will find the right result eventually which it does with the added animation in our works again and if for some reason the wheel is actually broken what we get is a timeout so in this case it will time out after five seconds because the wheel is broken and we just never see the angle change so this is a little bit wasteful i suppose because we're um you know we're all wasting five seconds uh but we don't expect this to happen very regularly because hopefully the feature is going to be working almost all the time so again we tended to set our test timeouts to be quite high in that for that reason uh because we were felt kind of you know safe doing that so i just hope it won't so sea of thieves is a multi-belay multiplayer game it uses a network client server architecture so we really wanted to make sure that we covered that aspect of the game with our integration test as well so uh integration tests uh we we changed the automation framework that was there at unreal to allow integration tests to pass execution between the server and the client so that we could check both sides of kind of a network communication and we did this could work with kind of built executables of an actual client actual server and it also worked just on the editor with kind of virtual client and server processes so this is what would happen in a typical networked integration test so first the test begins uh on the server so here we would do the setup and say how many clients we want in the test and do the initial kind of handshaking and generally we'd set up some kind of uh behavior here on the server and then we'd switch over to the client and we checked that that whatever we set up on the server has been communicated correctly over the network over to that first client then we'd go back to a server and then depending on the test you might end it here but in this let's say that we want to actually test that that communication or slightly different communication has been sent over to a second client so again the test could support that so we could go back and forth ping-ponging between server and clients but when we're ready to uh end the test we just uh go back to the server and we finish the test so let's um as an example of what a multiplayer integration test looks like for real uh let's uh modify the test that we just did based on the wheel we're gonna look for the same thing the change in the wheel angle but we'll check that it's actually working on a different client on the uh a second client in the scene which is this guy just standing to the side here so what we expect to see in terms of the flow of code and quick network communication is first the player interacts and turns the wheel just like in the previous test example then that when that wheel angle changes that gets communicated up to a server and then finally that server will communicate to it to the other client the observing client so uh this is the blueprint for this i'll probably go a little bit faster because it's a little longer but i'll point out all the interesting parts in terms of the uh the network version so uh we'll start from begin play again then we have a sync client server node which um starts to test with two clients we'll then switch uh execution over to the first client this is going to be the interacting client that client then interacts with the wheel and applies the fake input just as we did previously then we'll switch the observing client so this node kind of does it in a shortcut way but essentially we go to the server first and then and then to the second client so this is client id one this is the observing client that client will do the polling just like on the uh the previous version of the test so when that client then sees that the wheel angle has changed past the tolerance then we switch back to the server and we finish the test so those were unit tests and integration tests but we had a few other test types that i'd like to talk about talk about briefly so we also had asset audit tests so these would check the setup on assets make sure they're kind of compatible with what we expect in the game so we had one of these per a lot most of our asset types had some kind of asset audit test that it picked up on we also had screenshot tests so these look in practice a lot like integration tests but at the end of it they usually took a they took a screenshot where we and then we compared that screenshot with a diff a process afterwards against the last known good screenshot of that test to check and if there were any differences that meant there was a some kind of visual error or rendering bug we also had performance tests these are again similar to integration tests but they ran a bit longer and they would collect data um based to see if we've got some trends or spikes in terms of frame rate memory use loading times that kind of thing then finally we had boot flow tests so these were the closest tests to kind of simulating what it's like to actually run the game because they tested communication between the client server and all our services that are vital to running the game as well and they use they would check under kind of common scenarios such as a client joining a new server and you know registering all the services and that kind of thing so i'm going to talk very briefly about the kind of infrastructure we had it's not really the focus of this talk but um some of it's kind of kind of relevant if you want to learn more then my colleague jafar did another talk last in last year's gdc called adopting continuous delivery so check that out on the vault if you want to learn a bit more so we run our tests as part of a build system which is like a team city uses the team city continuous integration software so team city will go ahead use that build farm of pcs and then allocate them to do various jobs such as building the game and running a group of tests so depending on kind of what kind of test it was and how slow it was and how important it was sometimes uh would depend on how how often we ran the test but generally on average we ran every test we had about every 20 minutes or so so if the build system encounters test failure we would set the bill to red we have screens all around the studio which show kind of the status of the current build and if if that happens we kind of have a bit of information about kind of what test job has failed and who was probably responsible as the last person to uh you know change that something in that area i've blanked out the name in this example because i didn't want to embarrass anyone in gdc so so um to make sure we didn't have a broken build and that team members could uh carry on working uh we have a kind of a three-step process for doing submits uh doing uh submitting changes so the first one was you could only submit changes if the build was green there was no no tests failing we didn't we didn't want to allow people to continue submitting something if the build was already broken and maybe making the situation worse um so that would make it very important if the test was failing it would whoever was responsible needed to fix it as soon as as soon as they could because it was blocking the whole team from committing any more changes next we always expect that each change has reasonable test coverage so reasonable is obviously where there's some kind of a leeway and some kind of you know a grey area if it was a an artist or a designer they generally wouldn't submit a test with their change uh often that then when they submit that asset it would get picked up by an asset audit and it would be checked there but an engineer adding a new bit of code or new feature we would recently expect that they would include a test or unit test integration test with that uh submission so a rare we don't really have well we don't have test engineers who do uh the test coverage for other engineers uh the feature work they've done we found it better if engineers uh decided themselves and took responsibility for the test coverage that would be needed for their change and the third thing is that we ask people to run a pre-commit first so um obviously engineers or developers they need to take some responsibility for their local for submitting changes they need to check that they're working correctly we could have asked them to run automated tests locally uh which that would have taken a bit of time also there's every chance they could forget and every chance that they don't you know take it running all the tests isn't really practical because there's a lot of them so they would have to pick which tests are relevant to their change which is going to be pro to error so instead we um we ask an engineer to or develop sorry we ask all our developers to run a pre-commit first which would mean that they send their change to the build system that wouldn't actually submit it to the main branch it would just uh the build system would go ahead build the game with that current change and then run a test that it decides are related to that change and to see if those and then send the results back to the developer to say you know whether that change looks good enough to submit so here's a summary of kind of the full uh sea of thieves testing process that we ended up with using that dev same dev timeline that i showed you earlier so first of all as i mentioned before when the developer makes a local change they run the pre-commit which runs a set of automated the related automated tests uh so when if the if the developers does this first then there can be have some they can be reasonably sure that when they submit their changes it's not going to affect the rest of the team second of all after uh well after the developer submits a change so the build system is constantly running uh all our tests checking to see if there's intimate issues running some more long-range tests to find trends that kind of thing so because we do this we're fairly confident at almost all the time that our build is good and we can kind of move on to the next stage and um with whatever build we have at the moment so at this point uh once the build system kind of spits out a build every site which we do every so often probably every day then we have the manual testers check the build so this is obviously very similar to what we had in the previous process but because we've done all that automated testing first uh the the manual testers uh are gonna always get a good build to test with so they're not going to get that show-stopping bug that happens all the time where they get they get the build to the day and the build is just completely broken and they waste hours then having to get the next build which has got that fixed and that's not going to happen because we're fairly sure that all the automated tests would have picked up those kind of issues then finally if occasion we don't do this all the time but occasionally especially when we've got new features that we want to get some input on we'll send a latest update to a group of insider players to get kind of feedback on that change before we submit it to all of our player base again similar to what a lot of studios would do again the difference here is we don't we hopefully don't expect them to see actually many bugs because again we've gone through the automatic testing process we've gone through our manual testing process and hopefully they can just concentrate and give us feedback on you know the feature and the the game play itself so but so far i've shown you what our testing looked like when we started full production on cvs next i'll show you how we optimize our testing during development or in other words how we became more pragmatic over time without automated testing so as we entered full production we had a full team now creating tests and we were seeing the benefits of the extra build quality we were getting but also finding the test to be a bit of a burden in several ways so first of all uh expect we were spending a lot more time creating the tests and we expected um we knew we'd be spending some but it was happening we were spending a lot more than we like uh the second issue was that the tests were generally quite slow to run we had uh quite a lot of them and some of them were very slow to run the aim was that the pre-commit process would take about an hour but it was kind of creeping up to like hour and a half uh because it was taking such a long time which meant that uh developers would put in a queue to do their pre-commit which means that it would take a half a day or two a day to kind of get their changes in so that was really slowing down our development the third issue was that our tests uh particularly some tests the long and complex ones were quite reliable reliable excuse me and that meant that a lot of engineer time was put in to kind of figure out why a test would fail occasionally and kind of improving the test maintaining the tests so for those three issues that i mentioned the creation time running time and reliability almost all of them were much much worse for our map-based integration tests so if we look at the running time for example the unit test would run 10th of a second something like that that's a max and then our integration test would take up to 20 seconds so this was because of all the assets these tests generally had to load um initializing network connections took a long time get you know spinning up a new world each time we still wanted to use integration tests but because they were so slow we really needed to find a way to improve them so we could or if possible use a lot less so why did we have so many integration tests uh and we had them for gameplay features in particular so if you imagine kind of the code like our code base as a pyramid with sea of thieves city on top of the unreal engine then unit tests kind of sit uh right at the bottom because they didn't really have dependencies on unreal or sea of thieves so they were nice and fast to run whereas integration tests because they had dependency they needed a built version of the game or the editor they were dependent on all of everything unreal everything in the sea of thieves and we definitely found that the more dependencies on a test the slower the test was to run and less reliable it was so we really wanted to not use as many integration tests as we could as possible uh but unfortunately for game play testing in particular uh we were using a lot of integration tests and the reason for this is because we were building our gameplay code on top of unreal which you know you so we're using classes such as actors and components which are integral to unreal and we thought that if we wanted to get you know a true representation of our gameplay features we really need to use them in the proper context which meant in an unreal map which meant we had to use an integration test but it was obviously very taxing we were creating a lot of these very expensive tests so when we kind of changed our assumption um and we thought you know maybe we could use uh make our test gameplay test run at a lower level where we just checked the logic but didn't have all the dependencies you know we wondered if that would work so instead we came up with a new kind of test which you called actor tests because they're um so name because they used the actor object in in unreal so heavily so actor tests were essentially a unit test for unreal game code they treated unreal engine concepts such as actors and components as kind of first class dependencies so they weren't really unit tests in a strict strict extent just because they had you know some extra dependencies on them but engineers could kind of treat them that way so as an example of an actor test here's a typical game scenario we had during development so in sea of thieves over variety of different skeleton types one of those is the shadow skeleton which so at night time the shadow skeleton becomes all ghost-like and is virtually invulnerable but when the time of day changes then the skeleton changes its state and it becomes more like a normal skeleton and it can be attacked again so we want to make a test that checks that when the time of day changes the shadow skeleton changes from its dark to its light state so we could have done this as an integration test but actually doing as an actor test is a lot more efficient as i'll show you so this is the active test that checks that feature and as you can see it looks a lot like a unit test and behind the scenes there's a little bit more kind of going on in terms of setup there's a minimal world has been created where the actors can live in but for the engineer who kind of creates a test they can kind of treat it like a unit test so like all the other test examples this kind of splits down into three phases so i'll go through what's happening in this test so first of all we in the setup phase we create the shadow skeleton we set its current state to dark then when we run the operation we set the game world time to midday and then we find finally we check the results to check that the state of the shadow skeleton has changed to light as we expect so note this line here where we're ticking the shadow skeleton manually so we have to do this because that's kind of where it pulls the the world time and checks that the you know whether it needs to change state or not now obviously in the real game you would never just call tick explicitly split it explicitly like this you would um expect the unreal engine to you know run it in its own engine loop but again if we if we did it like that then we'd have to use an integration test so um doing like doing it like this is kind of has a disadvantage and it's kind of not using it in its like its true way um yeah it's outside of its normal environment but the major benefit we get is this ta this test runs a lot lot quicker than the integration test so now we had active tests uh we weren't going to completely throw away integration tests that we still gave us useful test coverage that even the active test didn't such as checking the asset setup and checking the integration with the unreal engine but we didn't want to use too many because they're very expensive so we tried to settle on a good kind of rule of thumb of when you when you would use a integration test and when you would use an actor test and what we came up with was uh if you took a feature um uh you could there you could then create an integration test for the golden path of the feature so again pathways like the successful run of the feature and then we'd use actor tests for the edge cases or the failure cases and this kind of gave us we found this kind of gave us a good mix in terms of a balance of running the test the time that took the test to run and giving us kind of a reasonable coke of test coverage that we we could use so as an example there's a feature in the game where players can give items to other players and if we look at where kind of the tests break down for the coverage for this feature so the act tests cover like times when the giving item wasn't successful for example because the player didn't have an item or it's you know they could they can't give every kind of item whereas the integration test covered like the full successful passing of an item so in terms of a in numeric how it breaks down numerically is we we found that the ratio of active tests to integration tests was about 12 to 1 which gave us kind of you know coverage that we were happy with so by using integration tests for only the golden path we reduced their number by a large amount but we still wanted to speed up as much speed up the ones we had as much as possible so one way we found to do that was to merge multiple related features so this broke a general rule of testing that you should only really test one thing per test but again because of the speed up we were getting again we were happy to break this rule so as an example so the skeletons in the game have three different attacks so we could we wanted to have a test that checked each of those worked and damaged the player correctly so we could um you could have had a test per attack and that's originally what i started with but i realized it was just as easy to um have one test that ran all the free attacks in sequence and again this this meant that the initialization of the world loading the skeleton all its animations that kind of thing we only need to do that once so we saved a lot of time so combining the tests only really worked if there were related features but there was one aspect one thing in most integration tests that was common and that was the player and it turned out that player was one of the most expensive objects when we uh used it again it was quite it's quite a complicated complicated object with a lot of animations attached to it and also initializing it kind of in a network scene for the network version the integration test took a long time so in fact in a lot of cases just initializing the player took longer than running the actual test so one thing we used that was a we took advantage of a feature in unreal where you could um world travel so essentially you keep the same player object but move him from uh map to map so we used this uh kind of to to transition the player between uh integration maps and meant that we kind of we didn't have to keep loading the player so in this example the player starts off in a an integration test map a lot like the one we did before with the ship's wheel we then unload that map but keep the player and then load him in another integration test map where he's interacting with the capstone so again we had to be careful here and we had some bugs to do with state leaking from state to state uh some tests to test uh but again we were kind of happy to fix those bugs and sort that out because of the speed up gain we were getting in terms of running the tests so the other issue we had which i i mentioned were intermittent test failures which are when tests succeed almost all the time but very occasionally fail so some level of test fakiness is kind of inevitable when you're running such a large amount of tests continuously and it seems to happen to almost everyone which i'm quite amused by this quote on the google google testing blog so you know it happens to the best of people so um so we looked into the reasons for our intermittent test failures and they were kind of a mix of network issues infrastructure issues and sometimes a state that was leaking from a test test so we would investigate and fix these causes but we found that just stopping the team from checking in when we found one of these intermittent test failures was uh wasting a lot of time quite disruptive so instead what happened was when we we changed the kind of the way the build system did things and when it found a test failure it would immediately run the test again if the test failed for the second time we would actually turn the build red if if it succeeded on that second time we would just keep it green instead now just because we did that didn't mean we completely uh ignored those intermittent test issues we just wanted to kind of concentrate our efforts a bit more so what we did is we kept a record of those intermittent test failures and every week or so we kind of drew up a kind of a list of that kind of worst offending tests and then we ask engineers to just look at those because those are those tests may have genuine issues with the actual test or the actual feature under test which meant that they're worth investigating and the final thing we did to kind of improve our testing was to handle consistently failing tests so these are tests that are generally badly written and just kind of just keep failing so tests like this they just can't be trusted uh because they're just off they're worse often than having no test at all due to the time they waste on the build system and the false information they're giving to the team so what what we did is if a test was failing regularly we moved it to a quarantine area where we'd keep running the test uh but we wouldn't turn the build red anymore if it failed and at the same time we uh told the engineer responsible the test that you're testing removing the quarantine could you please sort it out and then we gave them a few weeks to do that and if they didn't get around to it then we just been the test so this seems a bit harsh but again we're thinking was that if an engineer can't prioritize the time to fix that test it's probably not giving us worthwhile test coverage and it can be recovered later anyway so trigger them in okay so for the final part of this presentation i'd like to talk about the the benefits we got from doing all this automated testing so here's the breakdown of the tests that we currently have in our code base so as you can see 70 of our tests are active tests so it definitely felt like we were right where we thought that this was kind of the sweet spot in terms of the the code level in where to run those tests for gameplay because we were using so many uh we only have five percent integration tests there so again trying not to overwhelm our build system with so many integration tests but they did give us vital kind of high level feedback if something was broken on in a feature in the game about half of those were network tests and screenshot performance and boot flow tests we only have a very small number of those those were by far our slowest tests so we only use them sparingly so if you add all those up we have about 23 000 tests which is quite a lot but i didn't add asset audits because again they they're kind of not quite tests but you know if we had those as well i didn't want to skew the results too much so i didn't add those in there but if we add those as well then we end up with over a hundred thousand tests which is quite a lot um so uh if if we hadn't read all those efficiency savings that i've been talking about there's just no way we could have run those tests in a reasonable way on our build system cool so with all those uh with all that testing we had we were seeing a lot of benefits in terms of build stability and that extra confidence that it gave us so um i should go through the what those things were so the first one we had was the reduced time it took to verify the bill so on connect sports rivals as i mentioned it took about two weeks to verify bill before we sent it out as an update to players whereas on sea of thieves it took about a day and a half now we were confident enough so this was really powerful it meant that we could uh very quickly turn around a hotfix if we needed to just from the kind of the just the current version of the bill uh the other benefit we got is we drastically reduced our manual test testing team uh so we went from 50 that we members that we had on connect sports rivals at the time of release to 17 which at the time of thieves um so what the numbers don't show is kind of the way that we could actually use the smaller number of testers we had a lot more productively again because of all the repetitive checks were being done by automated testing uh the the qa team could work more closely with the rest of the development team and actually kind of feedback and kind of what the player experience was like of playing the current build so the third benefit we had was keeping our bug count very low so on sea of thieves our max bug count over production uh was about 214 whereas on banjo-kazooie not some bolts uh yeah quite older project but um still it's good comparison uh we went to about three thousand um and you can see kind of from the bug the bug trend graph that um how different they were so on sea of thieves because of all the uh automated testing we had uh a lot of issues didn't really make it as far as bugs because they were caught by automated testing earlier on and never made it into the build we also had a kind of a process where we asked developers to look at fixing their bugs before they did feature work and just to keep the build as stable and clean as possible rather than banjo-kazooie nuts and bolts as you can see we kind of let long running issues in the build just build up build up build up until oh no it's a few months before release and we had to fix all the bugs so uh yeah we spent a lot of time doing that and that kind of brings me to my third one is that uh reducing crunch so we this is really important to us at rare we hope that automated testing would help us reduce crunch quite a bit i don't have concrete stats for this unfortunately but anecdotally developers at the studio definitely found that they work less over time and i think the bug trend graphs kind of show what this is so on sea of thieves because we had automated testing highlighting issues all the time there weren't those kind of unexpected moments where you know you have issues crop up that you you didn't expect or have been around a long time and developers could maintain more regular working hours because of that whereas on badger gazoo nuts and bolts when it got to that point when we before release and we had all those bugs there was definitely a lot of crunching going on so so just just to finish up i'd like to go through some bonus lessons we learned on our automated testing adventure so first one is that team buy-in is really important so the most common reason not to use automated testing is obviously the time it takes to do the automated automated testing no you know there's no time to do it we've got to do finish the game so my counter argument would be to that you might actually spend more time uh you might actually spend the same amount of time or less time if you use automated tests because the developers won't be spending as much time fixing reoccurring bugs so we were lucky on the project that we that everyone on the on the team in fact well everyone at rare kind of bought into what we wanted to do with automated testing and went along with us including producers and project managers again if we didn't have that we didn't have that support from that highest level we really probably wouldn't be able to achieve what we did so the second one was that you should really allow time if you're doing automated testing to kind of build up that knowledge in the team again a lot of us were unfamiliar with testing when we started this there's a lot you need to think about in terms of what makes a good test and what makes what you have to do to make your code testable which is quite an important one as well because it's very easy to make code which is it's just so difficult to test the bonus is that you make if you make code more testable it actually makes it cleaner and better as well and the second part of this is that you should make sure that you spend time making sure your infrastructure is robust and stable because you're going to be relying on it a lot um so if this all seems a bit intimidating to get started then i'd suggest not doing what we did and actually just just starting testing on like a small part of your game or your project to begin with just to build it up rather than go all in uh the third lesson we learned was that iterative development testing they don't really mix so if you're still kind of working on your game and the on the actual gameplay and checking to find the fun uh then it's probably not a good idea to do testing at the same time because you'll be constantly having to rework your tests so on sea of thieves when we were working on new features we actually had a prototype branch that we use that didn't need testing so that the developers could work quickly and not have to worry about tests when they're working on those features but even when they worked on the production branch we didn't really do the kind of the test driven development test first style approach because uh again we would probably end up having to rework those tests a bit too often so what the process was more was that we an engineer or a developer would make a change get it working locally and then build the tests to kind of pin down what they've done to make sure it doesn't break in the future and then the final one is that pragmatism is important so we found we had to constantly as you saw we had to constantly change kind of how we were doing our testing as things weren't working for us so we often didn't have it weren't able to do the textbook way of doing things because you know it wasn't working for us so my suggestion my advice is just don't be afraid to change what you're doing and and do what works for you remember that tests are you know testing has a cost so you can't possibly test absolutely everything so if something is trivial maybe don't test it if something is very um if a test you're going to be creating is going to be very costly very complex and hard to maintain maybe maybe don't don't do it and then on the flip side if you so if you're writing some code that looks quite like a complicated logic or something then you know definitely concentrate your testing there so you know it just just remember that you're not going to be able to have perfect test coverage so it's just not really worth trying to do and just to show that we don't have perfect test coverage here's a video of one of my favorite bugs sorry one of my favorite videos from bugs in our game i just love the expression on this guy's face yeah and it goes all the way over the horizon as well uh so so what can i say testing is a journey uh we're constantly improving so yeah cool so that's the end of my talk uh what i've talked about here is you know a big team effort from virtually everyone at rare so i'd just like to thank everyone uh who contributed to our test process and sea of thieves and rare hiring particularly with the software engineers so if you're interested uh check out the website link on the on the page or just come and find me later and that's it does anyone have any questions uh i have a question thanks great talk thank you i have a question about the integration testing you didn't mention mocking at all like did you ever think about like mocking away the server if you did some client tests or do you or mocking some units away when you do some integration tests yeah um so we did a bit of that i guess because the the actor tests obviously didn't have network communication we would have done something similar yes we did so we did we often did things like uh uh we could fake replication so we had a way of doing that so you kind of um replication is kind of the the unreal term for like you know commun sending something from uh server to client so yeah we could fake that essentially to say this has been replicated now do the correct thing on the client when you've picked that up that kind of thing the integration tests again because there were so few of them they were kind of uh they wanted kind of a more kind of a broader view a kind of more high level view so that's why we kind of did actual network communication in those okay but you always kept the units like you always retested the units with the test also in the actual test right i mean yeah sorry we always check yeah you always kept the units when you tested the actor test so you never like so you retested the units as well in the actual test always right i guess um i'm not sure what you mean sorry so you're saying the uh the check do you you mean you're doing unit tests or no it's not such i mean i'm just thinking if you sometimes walk away units you already tested you probably you don't re-test them and i mean that's the kind of statistical point yeah yeah so i suppose you're right there's a bit of re-testing going but we again integration tests i like that right like i showed that diagram where the integration tests contain the unit tests so in a way you are kind of repeating yourself a bit but the unit tests are very useful because they give you that kind of very specific point of failure that you can investigate integration tests are quite high level because they give you that kind of very broad sense of something's wrong it takes a bit hard longer to investigate it but you know there's some it's something you need to look at okay thanks thank you hi um so i was wondering for the pre-commit tests i found it very interesting but is there a way to sort of automate which tests are picked up in that pre-commit so like analyzing what code was changed yes yeah i left that out because i didn't have time and it kind of didn't fit in but yeah so how that works is uh every day when we're running the tests we the tests kind of record what code that um they're touching what assets are touching and we sort of build a map up of that and then when you do a pre-commit you kind of do a reverse look above that and then you kind of find out which tests are kind of were affected by that code change and then that builds your list of tests that we think are related again it's not perfect but it's kind of good in the thought we needed especially for a pre-commit because um yeah again we wanted that was it was very important to us to keep the pre-commit as fast as possible and it's been a kind of ongoing challenge as we keep adding more and more testing you know we can only we paralyze things as much as we can but there's kind of limits to that so you know that that was kind of our latest way of kind of making sure that we were speeding that up enough um but it gave us good enough um coverage to be confident for a developer to submit their changes fairly sure that they're not going to break things for the rest of the team cool is there somewhere that i can read more about automating that process um i could probably put you in touch with someone who can kind of tell you a bit more how that works if you want um just just find me on later or on twitter or something like that awesome thank you and then one more small question is what do you think about testing i mean uh developing with testing as first class concern so instead of sort of bolting it on at the end sort of designing your units for testability so they're more robust and well structured yes so yeah i kind of briefly mentioned that but you mean like when you're actually building the code up and right yeah after prototyping of course yeah so we do again as i mentioned we don't really do it in a strictest test-driven way we just found that the game for gameplay especially it's just a bit too fluid and often you had to you know you're using unreal the unreal engine which is not tested so that's kind of one of the limitations with what we're doing here is that uh unreal engine we're kind of assuming it all works and and yeah it does for the most part right but it's not tested so often and it's not very it's not in a very kind of it's not been built in a way there's a lot of dependencies between everything it's an old it's an oldish engine right so [Music] i'm um unreal now so i'm getting in trouble um so yeah so often you would find it when you're building gameplay you might find oh actually i have to bring this dependency in and things like this so it's often a good it's often good to get things working we just found again being pragmatic that it's it's good to get things working rather than kind of build up the uh your tests and then your unit but it you still kind of it was still definitely when you had tests in mind that you would need to make later uh it still meant that you would really have to think about those dependencies because you knew that the more you added the more you'd have to mock later and things like this so it's like do you really need that dependencies another way of doing it i've definitely found that our co-base was a lot cleaner in sea of thieves than it had been in c previous projects because of that awesome thank you thank you thanks for the talk um were you tempted to parallelize the integration test when you first realized they were taking so long to run and you switched to actor test did you well you were running them all on one build agent after the building oh yeah we we do paralyze them actually yeah yeah so as i mentioned they kind of we do groups of tests so yeah one thing we did was but we could only go as wide as we just have a limited number of agents we only we we have all our agents on site and things like this um so and especially because we're running them on xbox as well uh running on xbox dev kits and that's a little it's not quite as easy to kind of uh again it's a bit more work to set that up and there's you know limits the amount of dev kits we had and things like that so yes we were paralyzing them but even with that with the amount of you know changes and pre-commits people are sending through you know it is just an ongoing battle in terms of keeping that workable all right thank you thank you hi um i was curious you you had the one line about getting buy-in from the rest of the team and the trade-offs with a little bit of velocity for sustainability and all that i was curious how you presented that argument to the rest of the team leadership and if there was any particular data you pointed to i mean that graph of um nuts and bolts and their bug count yeah unfortunately yeah we didn't have that until we finished good job it did work out that way um but uh i guess i think the reality of it really again because we knew this was the kind of game it was i think i think mostly we've had such a bad experience with previous projects i think um i was told i was when i was researching for this presentation i was told that we didn't do some dlc for an old project because it would just been the testing for it would have been too much work which sounds insane but that's apparently the situation it was so i think when the producers are looking at that they were thinking well whatever you think if this this this means we can actually do an update every week if we need to that kind of thing then let's give it a go and i've not really received much pushback from i've not heard any pushback in terms of for our producers that the tests are taking too long thank you so your tests are running is the actor test running essentially without a world there is a world but we kind of we almost do like new object world and and have our own world rather than whatever when you're if you know unreal then there's like especially in the editor i think there's like two or three worlds at the same time or something we just said you need to spit up a very minimal world and use that with the you know with the actors in there and we again it's one of those things that again we're not really not it feels like it's wrong to do because it's not you're not using things in the proper way but if you're just testing the logic of something then it was good enough for what we needed so and but how different is that sort of testing compared to like how much do you need to change the automation system really for to do what you needed to do um change the automation system again i believe this is it wasn't that much work to do it was more a kind of a change of thinking than changing much code in the framework to be honest i think it again i think the engine does kind of support it probably if you talk to epic they were like don't do that it's weird but if you you know we if we if you just do a new world and act a spawn actor on world or whatever then i think it will it will work so and again they meant there was some caveats it meant some things didn't work and you know again it wasn't it wasn't perfect but it was good enough what we needed and you know it's so much faster to run so it's a big game so we had a lot of tests so thanks thank you um i'm curious if you've uh ever used or heard of hypothesis or property-based testing and whether you considered using that and if so why not i no i don't think i have actually um could you over to talk briefly about what that is yeah so um basically the idea is that the um the test system generates inputs for you so um the classic example is you have an um an encode and a decode code function and you just say um the encode function is the inverse of the decode function and vice versa um generate a lot of string inputs where does it fail okay and then it tries to find like the minimal string that makes it fail i see i see um no i just say i don't think anyone uh anyone mentioned trying that um yeah i'd be interesting how that works in a like a in a game game play scenarios yeah think about how you would definitely worth checking out then um a hypothesis is a python library that lets you get a good idea of how it works cool all right i'll look into that that sounds really cool uh hello uh first of all one thing to address the previous concern about buy-in and uh something i find very interesting and good to see from your perspective how you outlined how you've avoided crunch potential yeah again not completely i'll get in trouble if i say that we've completely you know abolished crunch but produced a lot better definitely yeah but like for me it's like the reduction of crunch um the more reliable production timeline trumps uh moving fast in break shed because you will accumulate technical debt that you will have to fix near yes and your producers will like you for having a reliable practice on that line besides bug accumulation and technical accumulation uh the other typical issue i see for productions to fail in the end is performance and that is you build your systems you build systems upon systems upon systems and then content hits and hits the fan sorry uh did you do anything to accumulate to account for content hitting later on to the systems and your performance testing did you do anything so yeah as i mentioned we have the performance tests and those would do things like you would have the whole world a client in a world sorry in the actual full um c um and we'd collect data for five minutes or something like that travel to every island you know those kind of things and then we have a we have graphs that we'll monitor we'll have we'll have up the engine team have that up on screens and theirs and they can see in real time that something has happened and it's trended ripped down and then they'll often kind of jump in quickly find what has changed that recently the more difficult ones are when you've got us kind of a very slow downward trend when it's like okay we're just kind of adding content and at that point often someone has to be assigned to some kind of optimization task to kind of get yeah just just get it a bit above get us a bit above water again and again that's just kind of the way it is with a continually evolving game like this so yeah but had you any kind of like budgeting system for that to like find these slow trends and go like okay maybe this is something that uh is okay because we're still within budget um oh you mean yes so we we do give you mean like we give the artist budgets in terms of uh polis and things like that you mean um yes and in terms of memory consumption yeah yeah we have that as well yeah yeah but then again we're always we're always pushing it as much we can and then obviously that's a bit harder to do with when we add stuff in terms of gameplay and things like that um because well in terms of you know general performance uh rather than just kind of performance uh caused by the complexity of the world so that's something that's a bit more nebulous it's a bit hard to kind of give strict guidelines on oh and just one quick question to follow up on that yeah i think we're almost out of time but did you have anything to stabilize the performance testing um stabilize it that's because performance can be quite noisy oh okay i think i think we just ran out over a long time i believe i can probably find out for you um if we did anything more than that cool thank you

Info

Channel: GDC

Views: 26,053

Rating: undefined out of 5

Keywords: gdc, talk, panel, game, games, gaming, development, hd, design

Id: X673tOi8pU8

Channel Id: undefined

Length: 65min 49sec (3949 seconds)

Published: Thu Sep 23 2021