Unite Berlin 2018 - Unity's Evolving Best Practices

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello hello hello yes well actually start now thank you thank you thank you thank you thank you again thank you particularly to the people who've been waiting for over an hour for this talk I hope I can live up to that that level of expectation let's get right into it this is gonna be a long one once again I see a lot of familiar faces in the crowd this is my third year running giving one of these optimization talks that unite so thank you for coming back for those who are new my name is Ian I am the lead developer relations engineer for unity I'm on the enterprise support team and what my team does is we visit our customers and we try to help them solve problems now most often these problems are performance related because of course folks want their applications they want their games to run as best as possible on their target Hardware and once a year I try to gather up all the things that we've learned and turn it into a talk to give back to the unity community so this year I'm gonna go over four distinct topics I'm gonna start with one that's near and dear to my heart scripting performance because that's how this series of talks got started but I'm not going to give you just a bunch of tips about how to test your scripts on and how to make your scripts faster I actually want to take a specific example and use it to dissect how to actually performance test things in general after that we'll go back to talking about unities internal systems focusing primarily on transforms and then talking a bit about audio and the various ways we can play animation in unity but first an important message who here thinks the next slide is going to say please profile it I tricked you actually the most important thing is whenever you hear a piece of performance advice do not believe it out of hand including when I say it especially when I say it I'm just some doofus on a stage okay I don't know your use case I don't know how your data is structured I don't know what systems you're using I don't know what hardware you're running on so the most important thing is now you're all correct the most important thing is to understand why I'm advising you to do something think about that understand it and then apply that and profile it to find out how your game how your application reacts when you're applying that Optima and advice so to illustrate who remembers this slide this is from my very first unite talk thank you on this slide I'm telling you that si went when C the C sharp AP eyes are comparing two strings they are always doing locale specific conversions to ensure that different characters can match different characters when they're from different cultures and this is slow if you look closely the specific advice that I'm giving you is to not to use the string equality operator in particular I'm saying don't use string dot equals pass the ordinal enum industry not equals to force an ordinal comparison this is wrong I'm sorry I've been fooling you for two years actually this is not entirely wrong let's let's add some notes of nuance to that in the specific case of string dot equals this advice is wrong if you closely read the best practices for using strings guide on MSDN they tell you that all string comparisons see in c-sharp our culture sensitive except for string equals but don't take my word for it let's not even take microsoft words for it let's test it so how are we going to do that when testing anything there's a couple of things you have to consider first and foremost consider your inputs consider how the system how the code you are testing reacts as you change its inputs how does it react to highly coherent data data that's a look that's located serially in memory how does it how does it handle cache incoherent data what hardware is it running on you know if you have a heavily multi-threaded system and you're running it on a system with more cores versus fewer cores how does the system react what are the scaling parameters effectively what exactly are you measuring with your test harness so let's let's consider this specific example of string equals the mono source code is the mono is open source at least unities version of mono is open source it's on our github and if you opened up string dot CS from that github and look at string equals this is what you would see a very simple little function and all it really does is a couple of small small checks before passing control to a function called equals helper that is a private function you can't call it directly without reflection the interesting thing here is down at the very bottom of this function before we go into equals helper there's a small optimization you can see that it's testing to see if the two strings are not of equal length because by definition if the two strings are not of equal length they cannot be equal in terms of content now how does the equals override that accepts an enum differ from this well oh sorry first what does equal help helper do that's always good to know it's mostly a bunch of checks that are not that interesting when you get actually down to the very meat of it it pins the two strings using unsafe code it pins them to specific memory addresses and then it walks through the strings four bytes at a time so it cast them two long integers and so it compares comparing four characters at a time in your strings it does that three times and then it steps through the loop of the stride of 12 bytes why 12 bytes specifically now according to a comment in the code this is more efficient on AMD processors I have no idea if that's true or not but if someone else wants to do a performance test I'd love to hear the results now the other the other overload of string equals that takes a an enum comparison type the checks at the at the start of that function differ slightly so we still see the the reference equality checks but we have a couple other tests just make sure the enum is in range then what happens after that does it go directly into into a equals helper no first we check to see if you've passed it in aiming them before any of the culture sensitive checks if so we forward control to string compare it's only after we do do all these checks on the enum that we finally get to the ordinal helper and here's where we see the strip the length check and we call T equals helper so we can immediately see that this code might actually be a little bit slower we can begin to understand why it might be slower but that's not all there's a bunch of ways of comparing strings we don't just want to check equality comparisons we want to we want we want to consider all the code that we could possibly use so I've said that string compare is slow and if you look inside string compare you will see very similar code to what you saw in string equals but in the ordinal code path after a giant switch statement you will see a compare ordinals helper just like you see in this function here but there's another function that I've Illustrated on this slide called compare ordinal it skips all those switches that giant switch statement does a few basic checks and and then for its control down to compare ordinal helper compare ordinal helper is very similar similar to equals helper but it goes character by character because it's actually looking for ordering differences but there's an interesting micro optimization in here that might point out a different test I've highlighted on this slide and that's when the first two characters are equal so now we have two interesting test cases what happens when we bypass the length track so I want to have two strings with random content with the same lengths and what happens if we have two strings with random content the same lengths but the first character is equal both of those might defeat some of the micro optimizations we see in these in these methods there's actually another interesting overload to compare ordinal there's one that takes a bunch of indices as arguments that performs some very basic checks and then forwards control directly down to native code according to a comment this goes into calm so I definitely want to see how performance that is or non-performance that is so we're gonna look at four different cases for input we're of course going to use the worst case where two strings are equal because we have to compare every character in the two strings we're gonna look at the two the two edge cases that I described and then we're going to look at what I would consider the average case so working what we'd expect to see when comparing two strings in the wild so here we're gonna have two strings different lengths do you know different content randomly generated and I've actually fudged this test to ensure that the lengths are never equal what's the data that's a lot of numbers first thing you can see at the very top of the slides that string equals is the clear winner when running on Windows under aisle two cpp with this 3.5 scripting runtime string the string equals with the ordinal that ordinal enum Pasadena's roughly twice as slow which is you know we saw roughly twice as much code so we can understand why that might be the cost of culture sensitive comparison we can see in the string compare line that's ten times slower than any of the other methods on here but interestingly if you pass the ordinal type to string compare or if you call compare ordinal directly or if you call the compare ordinal that actually passes control down to calm we don't see a significant performance district difference these could actually just be because they're so close these could actually be stochastic differences in my test where we're just looking at slight variances in the processors performance for reference I've included a simple hand coded example this is actually just a little function with a length track and then it does a while loop over each character in the string we can see that that in most cases is less performant but because I pulled the length check up to the top and omitted all the others there are a few cases where it approaches the forms of string equals but never quite matches it if I switch over to the four point six scripting runtime the numbers don't change all that much the main thing is string equals remains for the average and the worst cases it gets a little bit worse than those two edge cases that I mentioned the big difference is actually on string comparisons on string comparisons we see a substantial up to 50% performance improvement on most of our cases just by switching scripting on time there's actually one really interesting case if you look very closely at this slide on the compare ordinal row there is one column where it is faster than string equals that's when you have random content for the identical lengths I can't fully explain this but it'll be interesting to dig into why so if you're trying to order if you're trying to compare two strings and you know their lengths are equivalent you might want to try using compare ordinal on the 4.6 ripping on time it's a really tiny micro optimization probably not worth making here's a graph so what this graph shows you is the relative performance of different functions with different inputs on different platforms it's kind of hard to read from all the way in the back row the four the four sets of bars closest to me sorry on the right depending which side of the state you're on on the right-hand side of this graph are the worst case this is where the strings are equal the four bars in the left-hand side of this graph are the average case where the strings are random with random lengths what we can see in both cases is that string equals remains the best but we do see Illustrated that performance improvement as we move to four point six print tripping on time and in many cases as we switch from mono to Isle to CPP so immediately we have validated that yes string equals is the best way to compare strings you've all learned a little bit about that today but if you're doing a lot of string comparisons you can see you can see that you may want to consider switching to the four point six scripting runtime in addition if you are comparing two strings you and you'd notice they're going to you want to compare them ordinal e instead of passing the ordinal comparison you could use string to compare ordinal in those tests you saw a tiny tiny tiny tiny tiny performance difference but for the other string api's we still want to use compare ordinal now this is on one platform this could actually change across different platforms so I took this test and I ported it to an iPad Mini 3 a much older device with a much lower clock rate as you might expect all the numbers get bigger now it's hard to actually compare performance across different platforms so what I did is I tried to normalize the results not gonna allow you to digest this slide we're gonna go straight to the graph so what I've done here is I've taken the the result when the strings have varying and random content and normalized that as Performance Index one for a given platform I then divided all the performance results for that platform by that baseline so we can here here we can compare the differences in performance between different API s in a relatively platform-independent matter what we can see here is that the the performance differences between the api's don't really change depending on different platforms for this specific type of test for this specific API another interesting thing that I found while I was doing this I pull I pull since I had it on an iPad I pulled it into instruments and looked at what was going on inside my string comparison function I was actually surprised at how slow it was and what I found was it appeared to be that 50% of my time was going to this null check function so when you cross compiled code of the aisle to see VP if it's in if we're invoking an instance method we add a null check to make sure that we're not invoking a method on a null object this in this case appears to be taking up a large amount of time and I'd like to be able to eliminate that because I can at least believe that my code is perfect all coded leave their code is perfect so what I want to do is disable that you can do that in IELTS CPP the facility is not built into unity though inside your unity install folder there is an aisle to cpp subfolder inside that aisle to CPP subfolder you'll find il to CPP set option attributes es you can drag this into your project and you'll get access to the aisle to CPP set option attribute you can decorate a type or you can decorate a method with this attribute this attribute allows you to disable these this automatic generation of null checks it also allows you to disable the automatic generation of array bounds checks so if you're indexing to raise multiple times this can help speed it up if you decorate a type with this it will actually remove these tracks from all instance methods on that type and if you put it whereas if you decorate a method it's just that one method now what's the benefit of this it's actually quite small when you test it I was surprised again when I added this just to my string comparison method I've got about 15 percent faster not as much with the profile I would have led us to believe but still a nice little win for just adding one attribute to one method all right let's go into the core of unity let's start with the transform system so we're not talking about megazord transformations we're talking about the Unity game object transform most people don't think of this component they just think of it as like a model where you input some data and unity uses that data and that's about it you structure your scenes with it it's actually quite important from a performance perspective and to understand why we're going to take a walk down memory lane so we're gonna start with the olden days of unity 5.3 and older so originally whenever you created a transform regardless where there was a root or yuri parented it the object representing that transforms data could be anywhere allocated anywhere on unities native heap this meant that when iterating through a transform hierarchy linearly we were not iterating linearly through memory we had no cache coherence that means done iterating over that processor hierarchy linearly we were actually stalling the processor repeatedly waiting for it to fetch data from main memory this is no good but why so you might be thinking okay in my code I don't really iterate linearly over a lot of transforms right no so there is a thing inside of unity called the on transform change message you cannot be blamed for not having heard of this because it's not exposed to you this is not a thing that you can intercept on a mono behavior this is an internal message this message was broadcast anytime you change the position rotation or scale of a transform what this and I want to emphasize every time that you change the position rotation or scale what would happen was this this message would it would go from the transform that was changing to all the components on the game object attached to that transform as well as all of its children all the components they would check to see if they were interested in transform updates and if they were they would update themselves synchronously this meant things if you had a rotating cube and he moved that cube around that cue was updating its axis aligned bounding box every time he changed the transforms positional rotation this is a major performance problem in older versions of unity one of the most common piece of advice pieces of advice I gave people when dealing with this problem was to batch up your transform updates add up all the changes of position their rotation for a given transform over the course of a frame and apply them at the end of the frame or apply them after you finished you know calculating your updates the other thing you could do is because often you know things like AI and character movement controllers would have to update both the position and the rotation of a transform in the same frame in five six we introduced the set and rotation API to allow you to do that simultaneously and eliminate one of the transform change messages but this is not really a full solution to the problem first we want to make it more efficient to broadcast this on transform trans message so we did in immunity five point four is we introduced the thing called a transform hierarchy structure again this is not something that's exposed to you this is something that's entirely internal to unity we create a transform hierarchy structure for each root transform in your scene this is important there is one transform hierarchy structure for each root transform in your scene that transform hierarchy structure contains multiple contiguous data buffers these data buffers represent the data of that root transform plus all of its children they're positioning their rotations so on that meant that when broadcasting on transform change message at least iterating over the transform data remain linear and we could have we can improve our cache coherence but again it's not a full solution to the problem the core problem here is that updates were still synchronous we had to break that because we had we wanted to enable asynchronous updates not just for performance reasons but because this was necessary to make unity a more asynchronous programming friendly engine so we could introduce things like the ECS in the future yes this is five-four we're already thinking about the UCS so what did we do there's a new system called the transform train of dispatch I say new it was introduced in unity 5.4 this system allows us to delay updates to transform so I'm going to describe how in a moment the key thing is this was such a major change we did not migrate unity over to using it all at once renderers were the first system to move in unity five point six and then we completed the migration over the unity 2017 cycle the most important one on this list is going to be physics I'm going to talk especially about that one in a moment and I'm very happy to announce that as of unity 2018 dot one transform trainer is dead oh don't applaud yet you haven't gone to the good part so Howard transform structure as I describe it transform hierarchy structure contains contiguous data buffers governing all of the transforms in a given hierarchy now this does mean that it contains all the rotation and position data but more interestingly we've what we're going to do is we're going to allow each system that is interested in transform traders to subscribe for those changes to do that we represent each transform with a bit mask two bit masks actually an interest bit mask and a dirty bit mask now when we attach a renderer component to a transform that renderer component can go into the transform hierarchy structure find that one transforms interest mask and set the renderer bit in the interest bit mask then when we want to dirty a system so when we've moved that transform around we can set the dirty bit mask or the dirty bit for all the entry systems that are interested in that one transform now here's another interesting detail we don't go through every single transform in the entire scene every time we want to find out what's been dirty that would be very expensive so instead the transform train of dispatch tracks changes tracks dirty hierarchies this is important if you have a transform hierarchy and you change one transform in that hierarchy that hierarchy will be entered into a list of dirty hierarchies when a system of poles for changes the transform change dispatch goes through every transform in that hierarchy looking for dirty hierarchies examining the dirty bits of the system that is that it asking then returns the list of dirty transforms to the system that's making the request and the system can update itself glassy-eyed stairs this is a lot of background information alright let's bring it back down to earth first thing because these are contiguous data buffers you do probably want to ensure that they are sized appropriately when you're going to be avoiding them obviously if you try to repair it to transform into a hierarchy that has no space we have to reallocate the buffers grow them copy all the data from the old buffers into the new buffers not great so the other thing you can do is when creating a new game object remember that a new game object will by default be a transform route this means we're going to allocate a transform hierarchy structure and all of its associated data buffers if you then immediately reap that transform into a hierarchy that allocation is wasted that is wasted work if you want to avoid that you can just pass in a parent when instantiating a game object simple it's actually pretty good optimization I talked about this last year now let's go into the fun stuff how does the actual structure of your hierarchy change things which when it when governed by transformed changed dispatch well remember that transform train of dispatch operates on a list of dirty transform hierarchies not transforms on hierarchies this means that we have to walk every transform in a dirty hierarchy so if you have a single transform at the root of your scene and you've changed one thing then every system that wants to update has to go through every single transform in your scene this is not great the more you split your hierarchy the better you make unities ability to track changes at a granular level the fewer transforms we will have to examine when looking for dirty for updates to transforms more importantly these changes are not single-threaded we use unities internal job system to multi thread the checks for dirty transforms the problem is the basic multi-threading unit the thing that we split onto multiple threads is transform hierarchy pointers so if you have one or two transform hierarchies in your scene you have defeated unities ability to multi thread the checks for the dirty transforms let's test that to tests test cases I'm going to call them the parented case and the unparent did case in the parented case I'm going to create 100 route game objects with nothing else on them each of those route game objects will get a thousand empty game objects inside of them just for spare data and then one rotating cube the classic Unity performance test then to show you the the benefit of unparent in your transforms I'm going to create 100,000 empty game objects I just put 100 rotating cubes at the root what's the difference well again I turn to my old friend the iPad Mini 3 and in an aggregate basis over 10 seconds they consume this amount of CPU time 20 times as much in the case where I have a hundred routes as opposed to a hundred thousand routes 20 times just from rotating cubes and that's not just on on the main thread that's on the worker threads as well so you're running your core is pretty warm if you're doing all this extra work now you might say ok it's 500 milliseconds across ten real-time seconds that's not all that much cpu time let's consider it on a per frame basis what if I told you you could get 1.5 milliseconds per frame back just by changing your transform hierarchy that changed the equation the nice thing is this is just from updating renderers and you can actually see this in the unity profiler the profiler marker in 2018 dot one is post late update update all renderers do remember that that markers do change between versions so you may need to do a little bit of digging to find this that's a pretty substantial change just for changing the way your scene is structured ok physics so in older versions of unity physics physics updates were synchronous you move to transform with a Collider attached to it we immediately updated the physics scene to ensure that Collider change was was reflected in the physics world this could of course create performance problems and often did this is one of the reasons that we often ask people to batch up the Train just their transforms I was to avoid repeatedly entry indexing the physics world doing a bunch of expensive calculations what we did was in unity 20 17.2 when physics migrated to use transform change dispatch we added the capability to delay updates to the physics scene because imagine you know we don't want to force all projects to use behavior you've already written your games to assume that that physics updates are synchronous so if we suddenly trained it so that you have a transformer the collider you move the transform we don't know the physics scene and you passed array you don't get a hit where you expect to hit that's no good we're gonna break it a lot of existing games how did we solve that we introduced a new setting called auto sync transforms my colleague Kari is going to be speaking more about this on Thursday when this setting is true and it defaults to true in 2017 to through 2018 - it forces effectively what appears to you to be the same legacy behavior any time you perform a physics query raycast Sphere cast whatever the physics system will query the transform trainer dispatch to see if there's been any updates what it's false that is not the case we are the physics system will only automatically update before running a simulation step now you can force it to run update if you actually want to move all your transforms around then force physics to update and then perform a bunch of ray casts you can do that with the physics sync transforms API how is this going to change in terms of performance let's test it so I'm going to create two cases the first case I'm gonna call the batch case this is what I'm gonna call the good case this is where I do what I just told you just told you to do to do all of your transform trainers at once and then allow physics to synchronize when doing when doing ray casts so in theory when I'm doing all of these ray casts there should be a very low cost from the transform trainers dispatch because we're only updating with physics scene in theory once in the immediate case we're doing what's you see in more classic ways of building unity games where you have a lot of disparate amount of behaviors moving transforms around and casting Ray's all over the place so I've intercalated the to the update to a transform and the physics query this in theory should force physics to update every time I call ray cast and it should also cause us to call invoke the transform train dispatch every time i call ray cast in theory that's slower right yes actually it is there's a substantial difference in each of these cases the interesting thing is that whether you're you have a whole bunch of transforms parented to one another or you have unparent unparent of transforms there's not a big difference in the cost of the ray casts when you're in immediate mode when you're mixing transform updates and ray casts most of the cost comes from updating the physics world itself suddenly when you actually batch up your updates when you when you only update the physics world once instead of multiple times we see a 50% or a 75% performance improvement just in this simple case so Batchelor transform changes use transform that sync changes one note in 2018 dot three auto sync transforms is going to default to false so you are going to need to adapt your code to this different behavior all right let's switch over to talking about audio this is an unusual thing to talk about in a performance talk I realize most people don't think of audio as a performance problem in their game you usually don't see a lot of CPU time going to audio I heard some people sucking their teeth in the audience so I know some of you have seen this before all right let's review internally unity uses a system called F mod F mod runs on its own threads and those threads are responsible for decoding audio and mixing audio together they actually run those mixers that you design and every frame unity iterates over all of the active audio sources in your scene every single audio source what it's doing is it's gathering a bunch of parameters together and it's doing things like calculating the distance between the audio source and the audio listener for volume attenuation it packages that up into what we call a voice and it's sent we send that to F mod F mod then processes the voice decodes the audio clip if necessary mixes it together multiplies it by the volume parameter you've passed in and then it mixes a certain number of the most important voices together generally this is the loudest voices and it uses this to produce a final waveform that submits to the audio Hardware now the number that it actually chooses is governed by an audio setting this is the real voices setting and the the key thing here is that all this is done in software you know you does not use audio Hardware decompression or audio hardware decoding so if you have a bunch of mp3s that are being decoded your background thread it might be churning away just decompressing those mp3s every voice that you play every audio source that is active is going to be evaluated and mixed by F mod there's another trap here though it's not just the fact that this voice is active now if you set volume to zero you might understand that yes that's just a parameter so our packing moving up sending it to F mod and allowing F mod to handle it the problem is people see a mute checkbox on an audio source and they assume that if this mute checkbox is set unity will avoid doing any computation on this audio source that is wrong all this mute check what check box does is clamp the volume parameter to zero when we are submitting this voice to F mod which means that if you have a bunch of muted voices in your scene you are paying a lot of extra a lot of extra CPU time to have no effect on your user and this is not visible directly in the audio see in the unity CPU profiler what you will see it and the unity CPU profiler is just the main thread cost this will show up as audio system update this is the cost of up of iterating over those audio sources calculating their parameters what if you actually want to see the cost of audio decoding scroll down go to the audio profiler there are two little they're a couple of little lines in the audio profiler what they tell you is how much CPU time roughly is being consumed by by F mods background threads now there's a couple of different things that that might change the performance of audio and we're going to test them all so what I'm going to do is I'm going to take an identical four minute piece of music I'm gonna copy it four times and compress it with the four different major compression formats that unity supports PCM adpcm Vorbis and mp3 I'm going to have these all set to compressed in memory in their load type so that we can be ensured that we're incurring the decompression cost for any active voice and then I'm going to vary the number of active audio sources in the scene at that point I'm in order to again eliminate stochastic differences because especially with a variable bitrate compression like mp3 and Vorbis the amount of CPU time we have to spend per frame varies varies widely I'm going to capture 10 seconds of CPU time and sum up all the time that is consumed by both the main thread and the audio threads on F on F mod or audio system update what do we see first on the main thread we can see that we have a roughly linear scaling scale as the number of clips that were playing back increases in fact what we can see is at 500 Clips we are consuming roughly 20% of our CPU time on the main thread just updating these audio sources it's quite a bit of time but on the background threads we see a very we see a very different story the background threads it's not the number of clips that is the key performance indicator it's actually the compression format on the background threads regardless the amount of CPU usage scale is almost logarithmically with the number of voices but even with a very low number of voices we consume a significant amount of time decompressing the heavier compression formats decompressing mp3 or Vorbis the interesting thing actually here for me was originally I thought that on iOS mp3 would be cheaper turns out that Vorbis is cheaper so I'm sorry if I advised anyone to use mp3 format on iOS so some principles here first avoid having too many active audio sources if you're muting your audio sources to get rid of them no don't do that turn off the game object disable the game object or keep call audio source dot stop to keep them from playing that is the only way to completely eliminate their CPU use CPU utilization you can also trade memory for CPU usage if you set your clips to decompress unload or use a lighter-weight compression format so switching from mp3 and Vorbis to adpcm or PCM particularly this is best for short clips things that you play frequently for especially on mobile platforms it's very hard to fit too many too many music tracks into memory so you end up having to keep those compressed it's unfortunate the May thing is try to keep the number of compressed clips that you are playing low monitor the CPU usage using the audio the audio CPU profiler and ensure that's acceptable for your game one of the things that I've noticed is that because audio background thread CPU usage isn't apparent in a CPU profiler people will have framerate stutters that don't appear to have any actual cause in their own code they just see random operations inside of unity mysteriously taking longer on certain frames if you go to the Timeline view in the CPU profiler you can often see like mysterious gaps when like updating all renderers or running running the animation system what this may be is the background thread interrupting the main thread the F Mon threads wrong with real-time priority whereas unities main thread does not which means that the audio threads can interrupt the main thread in the rendering thread and show up as odd pauses and perform in in the profiler another thing you can do though is clamp the voice count you probably don't want to write a whole bunch of different code to vary the number of audio sources that are playing on all these different platforms so you can change a setting I change the virtual voice setting and as you can see it does not fully eliminate the overhead of playing all these clips this is some done both the F mod and main threads but as we increase the number of playing clips the marginal performance cost increases at a very very low rate almost a logarithmic rate when we have a lower virtual voice count how do you do this it's actually quite simple what you do is you go to the audio settings singleton and called get configuration that has two parameters the virtual voice count in the real voice count you can then send that back to the audio settings by calling the reset API now note that the reset API is not cheap it's not something you want to be doing at run time it does interrupt the audio system stops everything from playing so you may need to restore the playing state of audio sources after calling this so in general you really only only want to do this during a loading screen or at startup time all right we're going to finish by talking about animations and animators using unities various animation systems and this might be a bit of a confusing section the problem is I'm going to be saying the words animation animator animate a lot so I want to start with the definition of terms we're all on the same page so first the animator if I say the animator component or the animator system what I mean is the I'm represented by the animator component that you attach to a game object and a sign an animator controller graph for those of you heard unity old hands this might be familiar to you as Meccan M what this is is effectively a system that relies heavily on multi-threading to mix together a bunch of different animation clips in the animator controller obviously you define you defined States these states can have an animation clip or a blend tree beat all the animations in the blend tree are mixed together all the animations on on the active States for a given layer are then mixed together all the active layers are then mixed together and that is output to the final model or the final to the bones that are spent skinned in unity there's another way of animating though animation the animation component specifically this is specifically the component called animation attached to a game object now since some of our documentation we call this the legacy animator or the legacy animation system let me dispel a rumor right now you should not be afraid of using the animation component we have no plans at this time to deprecate it or remove it it is an important part of your toolbox I'm going to show you why you see this animation component is a very very simple little system all it does is iterate linearly over the curves in the animation clip that you give it it has nothing else underneath there it is effectively a for loop now let's try that let's see what the performance difference is between these two things quick test 100 animated objects again spinning cubes what I did was I created a bunch of animation clips with different curve counts I just created a monobehaviour that allowed me to to animate a whole bunch of different parameters I used the animator system with a simple animator controller had one state with a loop with an animation clip that was set to loop and the animation system I set to just use that animation clip and loop then I then varied the animation clips in order to vary the number of curves that I that were being evaluated each frame so first thing I threw this at an iPad Mini 3 once again to get our low end at pathological case and what do we see the first thing that we can see here is that in some cases the animation component is significantly faster than the animator the animator has a relatively high base over at cost you can see that at the very very very right of the x-axis there this is because the animator component must set up a bunch of buffers each frame it has to copy a bunch of animation data around so that it can be evaluated in multi-threaded manner but because it pushes the animation work out on to worker threads it's scales far better with high numbers of curves than the animation component the interesting thing is actually if you vary the clock speed but not the number of cores this graph changes slightly but not no way you might expect if you take this is from an iPad Mini 3 imagine you you took an eyepin iPhone 7 the iPhone 8 was the first iPhone that let you use more than two cores so if you throw this on an iPhone 7 you're using the best possible iPhone that still allows you to only use two cores what happens the animation system actually moves its crossover up to around 6 or 700 curves that's almost the amount that you would use in an actual animated character now remember how I mentioned cores member hi mentioned that the animator is heavily multi-threaded how does the number of cores affect performance as you might might expect paying the overhead cost for animation becomes less significant when we can spread the actual animation work the actual work the animator component is doing across multiple cores suddenly its scale is a heck of a lot better and the crossover moves down from for about 400 curves to about 150 curves now this is on a relatively modern PC again the clock rate and the number of cores will affect us heavily so you so if you are using animations in your game what you have to do is test it on your lowest end hardware on your minimum target spec consider the number of curves in a given animation consider the number of curves on the device you are attempting to run on and use that to inform your choice if you are using if you are playing back a simple animation like a squash or stretch if you're just trying to animate a little button that's bouncing try not to use the animator system it's too heavy wait for that simple case the animation component is probably what you're looking for on the other hand the interesting thing is if you take both of these systems and you you hold the content constant use the same animation you just add more and more animated characters to your scene they both scale roughly linearly so the multi threading the multi-threading in the animator does not actually save you if you're simply throwing 500 characters the screen okay what about all those other cool features inside the animator component you know there's a lot of them in there as things like masking retargeting of animations well one of the ones that I see people getting into trouble with a lot layers be careful with animation layers every animator controller can have more than one layer each layer is effectively an independent state machine the states on each layer will all be will be evaluated each frame the active state will be evaluated its animation clips or blend trees will all be mix but evaluated and then the results of each layer will be multiplied by a layer weight and then combined yes I said multiplied which means that if a layers weight is 0 you are getting no result but spending CPU time on it you should be losing layers sparingly if you have debug layers demo layers cinematic layers on on your animation controller considering trying to refile consider trying to refactor those remove the unnecessary ones and merge layers together if you can this has a substantial performance impact you can see it here so if I add another layer again on my iPad Mini 3 the performance of an of an animator degrades by about 20% by the time I had 5 layers we have nearly nearly doubled the amount of CPU time that I'm spending just to animate my characters this is not the same tested before this is actually 50/50 Ellen characters from the 3d game get all playing their idle animation on multiple layers layers 2 through 5 have their weights set to 0 the other thing is there are multiple rigs multiple multiple ways to animate your character in unity by default unity uses the generic wig rig thankfully but people often switch to the humanoid rig because they say ok I'm animating a person I should use the humanoid rig it's probably got some special code paths or something well actually as of modern versions of unity there's not a huge difference behind them aside from features the humanoid at rig adds two additional features to the animation the animator system it adds inverse kinematics and it adds animation retargeting allowing you to reuse animations across different avatars the problem is you pay for that just switching the same character playing the same animations on different the same number of layers increases the cost of the animator by 50% just by switching from generic to humanoid if you are not using the humanoid animation features you're not using the humanoid rigs animation features you should be using the generic rig one final thing in the old days or still in current days people want to pool animators object pooling is it is a common way that people try to you know reduce spikes when not when you know adding more characters or instantiating and destroying characters but animators were never friendly to pooling every time you enabled a game object that had an animator on it that animator would allocate all the buffers they that he used to mix animations it would copy animation data into there this was called an animator rebind and was a common cause of performance spikes similarly if you just wanted to disable an animator because it was go ahead gone too far out of the players area of interest if you disabled its game object it would actually dump its state it would dump all of its buffers including its finite state machine State which meant that when you re enabled it the thing just started back from the start you couldn't save the state or pause an animator the only workaround for this used to be disabling the animator component not the game object the problem here is this had side effects who had mana behaviors on your character you had to disable those as well if you had mesh colliders or miss renders in your character you wanted to disable those as well to really fully save all the CPU time those being used by your character or by the animated thing that you were pooling we fixed that Thank You animation team in unity 20 18.1 there is now an animator keep controller state on an enable API this defaults to false but if in script and it is only addressable through script if you set it to true animators will no longer dump their buffers when you disable their game objects as of 20 18.1 you can finally pull your animators okay one last thing I know there have been a long kind of grueling talk many of you waited a long time I want to say one last thing this talk is not my talk this is my team a lot of the things that I talked about today a lot of the data that I got a lot of the tests that I ran originated from investigations that these people did at real customers on real customer projects so this is not my talk it's theirs I would ask you don't applaud me but to applaud them thank you [Music] all right we have about ten minutes I can take a few questions if anyone would like to ask any we've got microphones there and there if not I'll be out the show floor for most the day today tomorrow feel free to grab me and ask ask anything you'd like or you can ask that guy that's the Simon he's on my team he knows everything too he did the enema he got the animation data questions oh we've got one hi thanks for the information on transforms and how they can be better lay out the memory when when placed at the at the root of the scene really is but the benefit of the transforms is that it forms hierarchy of course and that helps move stuff so can you somehow say to some specific child transform that this forms a new hierarchy no unfortunately you can't the only way to do that would be to split that one specific child out onto another onto its own its own route what you could do I know I realize this is difficult is try to use some kind of post processing script to take your transform and move and after you've authored it in like a master scene generate another scene that has a a transform hierarchy of static or unchanging objects and it transform hierarchy of dynamic or changing objects so for example you have like a bunch of fans in your level geometry and that's the only thing that's changing have a post processing script that takes their positions copies them into the transforms and move them out into into into the round route okay thanks yep hi there some time back you spoke very passionately about using the optimized game object hierarchy option in model importing so yes that was some time ago and then 2071 came around and we were told working was told that it doesn't do that much anymore so we were told to skip it so and now with all the changes done to the test mahaki in 20 in 20 which one is it on or off it's very hard to say I actually don't have performance test data intuitively I would assume that it's still it's still better to have fewer transforms near scene thanks also good to see you again anyone else hello hi sorry quick question so the flag you mentioned for physics is it only for physics 3d or also for 2d uh check Kerry's talk on Thursday I'm not sure off the top of my head I believe it's for physics 3d only I think physics 2d I might have its own flag check out the track thank you hey Chris sorry I meant to sorry it's okay so just for all of you we're using a bunch of game objects for our roots yes you told us is could we use some scenes there would help like having free scenes just to have some overview in the hierarchy splitting whether you have three three root transforms in one scene or three scenes each with one root transform it really doesn't change our transform dick train dispatcher operates now it's all about the root transform yeah animation of the UI system oh yes that you'll note that I omitted that this year go on well yeah what can we possibly do because it actually forces deep hierarchies to actually have any yes UI so the UI system is very complex it does force you to have deep hierarchies there's not really much you can do about that on your own the best thing we can do is if you have to animate simple things on UI is to use a script to do that and not to use either the animation systems if you want to allow an artist to animate something you could at least use the animation component not the animate or because that is elite a little bit cheaper to start and stop anyone else nope alright thank you very much for coming hope you enjoy the rest of the conference [Applause]
Info
Channel: Unity
Views: 23,620
Rating: undefined out of 5
Keywords: Unity3d, Unity, Unity Technologies, Games, Game Development, Game Dev, Game Engine
Id: W45-fsnPhJY
Channel Id: undefined
Length: 46min 48sec (2808 seconds)
Published: Wed Jul 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.