Unite Austin 2017 - Writing High Performance C# Scripts

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Seems like Unity is getting a real ECS in 18.1. This is exciting.

👍︎︎ 7 👤︎︎ u/MaikKlein 📅︎︎ Nov 16 2017 🗫︎ replies

Interesting! It's going to be hard to switch to this when the language allows reference types and encapsulated functions. But I will definitely try it.

👍︎︎ 1 👤︎︎ u/davenirline 📅︎︎ Nov 16 2017 🗫︎ replies

Friggen. Amazing.

👍︎︎ 1 👤︎︎ u/00jknight 📅︎︎ Nov 16 2017 🗫︎ replies
Captions
all right thank you for coming today so yesterday in the keynote we showed this theme on I was done by audios and it's all about how you can do massive scale simulation in unity and this talk today it's a it's really about digging into the details of how we can write such code and how we can get this massive leap of performance so in this demo we have we can have a hundred thousand units we can spawn them at a really you know we can have these hundred thousand units simulating every frame we can instantiate objects when I blow these units up here we're basically instantiate a new units destroying old units we're not allocating any GC memory while doing that and it's barely noticeable on the frame rate when all of that happens and in this talk we're really going to dig into some smaller examples of how how we can do those things for the entity so let's start with why is performance important so I think performance is an enabler you can do a lot of things with performance one thing is just to build something like that if you want to make a real-time strategy game you can have a massive amount of units on screen if you have great performance or you could use it to do more detailed simulations write more complex code more complex AIS these kind of things or maybe you just want to run at 60 frames per second spend less time running your game code on the main thread or maybe you're just making a tiny game for a mobile phone and you want to save battery consumption performance can simply be used to run your frame instead of taking 30 to 33 milliseconds you just run it in 6 milliseconds and in the rest of the time of the CPUs idle and that's what reduces battery consumption so the basics of how we get to get to having great performance is comes down to really four things so the first one is we need to really have good control over memory layout and we want to walk our data linearly and we need to have a way of doing that we want to write multi-threaded code so that we can take care of all these courts that modern machines have and we want to write code that is automatically vectorized so that all these modern instructions where computers can basically process multiple of pieces of data on a single CPU at the same time so that we can take advantage of that if you can do it with our c-sharp job compiler but to take advantage of it we have to change the way in which we write our code and that's what what we're going to talk about today so let's talk about the very basics of how to achieve performance so most programmers grow up with writing object-oriented code and my belief I guess it's not a belief basically I think that when you write object-oriented code there's a limit to the performance that you can achieve it's very important to if we want to get to this to this upper end of performance we cannot write our code any more that way and I'm gonna try to explain why that is the case so to do that we have to understand a little bit about how a CPU actually works so that we can work ourselves all the way back to how we're going to write that code that is super efficient so CPUs read data and cache lines so let's say I just want to read a number I want to read an int and I want to use it it's somewhere in a class or a struct what happens actually is that the CPU reads 32 bytes of 64 bytes depending on the size of the cache line and reads it into memory it can't read something that is smaller than that from RAM and so if you read a single float from a random location in memory and then you read from somewhere else you effectively loading 64 bytes of data and you're not doing anything with the remaining data that is that has to be loaded as part of that and all of that is not really visible when you write code today because you know that would be pretty complicated so that happens behind the scenes but if you don't know that that is happening then it's very easy to write slower code so the second effect of modern CPUs is that reading a cache line from RAM takes roughly 100 to 300 cycles and that means we can read we can access like if we try to read a float from a class or something like that then at the time where we access that that that float at that time we have to wait 100 to 300 cycles if it's not already in cache before we can use that data and if you do that for pretty much every float that you access then you're going to get pretty poor performance so what you want to do is you want to take advantage of the hardware that can do that in parallel to you to your game code actually doing actual work so what what CPS do is they have a hardware prefetcher unit built-in and that means if we read data linearly we can we can we can take advantage of the hopper prefetching unit to basically do that in parallel while we're processing our data so the CPU automatically detects hey I'm reading a float here and then there's a float right after that and then after that and it just keeps prefetching the next couple cache lines and they're coming in while you are you well you are basically running your code and that can give you some pretty massive speed ups when you take advantage of that so the question is now we have to admit what does that mean like what does that really look like when we write object-oriented code so let's take a look at this very simple example so we have a class and we have a string in here we have some array of int some values with a catonian a position and of course that all takes memory right a pointer this is essentially a pointer to the string takes eight bytes this one also takes eight bytes a quaternion there's four floats in there so that's 16 bytes and this is 12 bytes so the problem here is that this is a class so when you allocate when you create a class in c-sharp that allocates memory in a completely random location in memory and when you then iterate over those things for example you put them in an array right so now we have each player data allocated one here another one over there and the third one over there they're in an array the the pointers to that are linearly allocated but they point to different locations in memory and so when we just want to for example we have an array and we want to find the player which is closest or which is close within a 0.5 radius so we just want to calculate the distance between each one and then return that one so if we do that what happens here is that at each error index we're taking that pointer and then we access a completely random location in memory and we're jumping to that and that's the point where that cache line has to be loaded in to load that position and in fact we're not only loading the position we're going to load 32 bytes of data so basically all of that data but when you look at that code we're not using any of their data we're just loading it and then it's in cache and it wasted bandwidth on our CPU but we're not using it so we're writing really an inefficient code from a bandwidth perspective so what's a better way of writing that so fortunately in c-sharp your struts if you put a struct in an array that will be a linear block of memory and now we have 40 bytes of data for the struct we still have some pointers in here from 8 bytes 16 bytes for the rotation and so on so it's a pretty fat piece of data but at least the memory is laid out linearly so if we have this loop here you know and we go over these over these players and we check the distance to each one then basically we're going to read the position and it's here and we compare it we do some math on it and then we go to the next one and that comes at a very fixed interval right like 40 bytes after is the next position so it's a very predictable memory layer that we have and that means that the CPU can do that happy prefetching force automatically you don't have to do any of that yourself that just happens and it happens because you laid out your data completely linearly now there's still a problem with that I mean it's great we're not jumping around the memory that's the most important thing but the problem is still we have waste right we're reading this position that's the only thing we're interested in this loop that we want to be fast but we're also reading the data for the rotation the value the name all these things so now we can do what we call a hot cold split so basically we just say let's just separate those two arrays let's take our array of player positions so that's just 12 bytes per element and then we have our player data let's call that the cold data that's the data that in our inner loop we don't actually use that data right like we are only interested in the positions in our algorithm that we want to be fast and so we just keep those two arrays in parallel and we just maintain that data that way so now iterating over the player positions is going to be really fast because basically we read 12 bytes and then the next iteration we we are right there right after we read the next 12 bytes and there's no waste anymore we're doing completely linear access without any waste at all and basically this compared to this we are basically you know let's assume we have a 32 byte cache line we're reading we reading 12 bytes so we have roughly a 2.6 improvement X improvement in the bandwidth that we consume all right so that's that's the basic idea right we want to have control of our memory and that is how we get performance that's a very different way of programming compared to how we usually do things object oriented there's also a couple other things we want to avoid we want to avoid virtual functions we don't want to have like an object that sort of like has a virtual function in an update method and then in the update method we you know like we basically jump to a V table where where it figures out which function to actually call and we can't inline the code all of these things we want to avoid it we want to think in batches we want to think about how do we process a bunch of units that we have and how do we make that run run really fast so we want our memory to be tightly packed we want to have linear memory access we don't want any GC allocations at all in fact you know when we have to the game running basically spawning 20,000 errors there's not one GC allocation and essentially in order to get great performance we have to stop thinking in object-oriented programming patterns we have to think about things they turn to design that's how how game engine program has called it and the new entity component system is all about making that as easy as possible right because it's important for us to make sure that that is easy modular and you don't have to spend a lot of time writing that code so let's take a look at how that actually applies this kind of philosophy to writing code in the entity component system so there is let's let's take a look at a very simple example that example is basically we want to rotate a cube very simple right we basically have monobehaviour we have a speed value on it and then we have our update method and in it we apply the rotation and then we calculate the quaternion multiply it and now we have a rotating thing right look some nandus just have these four things that rotate each one of them has a speed value and then by that it rotates let's say we want to achieve that and we want to do that at massive scale and we don't want to pay pay much performance for it so in a new system the way we can change this kind of code is like this the first part is basically separating separating things into two parts first of all a component contains data and a system is responsible for having behavior so we're just essentially removing the update method out of the out of the model behavior and we're moving it into a system and the system is responsible for iterating over all the overall these components and when you look here there's a get component call and that's kind of the key question what do we replace that with when we just deal in data and the idea is which is going to populate two arrays and these two arrays each index refers to the same game object or entity so at index 0 there's my first entity and it has a transform and it has this rotation speed component and these two arrays are automatically populated for me I don't have to write any code and we have a really we have really efficient code that populates them for for you and you don't have to worry about it and then you just write a system code so you just have a loop where you iterate over all the transform components and then you access them at that index and you access the rotator so you take the rotator speed and use it to rotate the transform component okay so that's cool now we don't have any we're still storing sort of like object-oriented data we have a class here but at least we're doing things in batch and that's good because there's no virtual calls to call our update method and that alone gives us already a speed-up and personally I also think this is a nicer way of writing code I think it's more flexible to actually separate the behavior from the from the data that you want to define so that's a first step the second step is that we have a new component type that we can create so we can create a what is called a hi component data so an i' component data is a struct and in this case the only thing we have in it is a speed value because we want to make sure that there's no overhead at all from the from the system and the entity system around it we want to make sure that one float comes one after another tightly-packed no overhead at all and when we write that code the only thing we change now compared to the previous example is instead of using component array for the rotation speed we use component data array otherwise it's exactly the same code we basically turn this one into a struct and this one we use component data array to access it instead of component array and now the data is completely packed so basically these speed values we're accessing them in the most efficient way we can and that's a really good step all right so let's let's take a look at one more thing so how does that work when you work in unity with scenes so for the time being the entity component system is actually completely written in c-sharp because we still want to experiment a lot with with these concepts come up with new AP eyes and keep on changing it when we ship it we're going to ship it first as an open source speed-up project and will allow contributions and these kind of things just because we want to have more time to really figure out these new concepts and what you see here is there is a game object entity component and that thing is responsible for basically reflecting the data from the old component system into the new one so everything that is on this game object is visible to the component system and it's basically visible in these in these component arrays so if you say I would like to have transform and rotation speed if you have a game object entity on those on those game objects and they will show up and the rotation speed component is here of course and that's just stores the the speed at which we rotate and looks like this right okay so the most important thing is iteration and we offer a couple of different ways of iterating over these over these arrays one is using tuple these tuples we have an attribute inject tuples and unity that knows what it is and basically where it's pressing the following we're saying I would like to have a group of entities that both have a transform and a rotator data on it and you can put as many as you like you could put another one here you could just say like okay I would also like to make sure that it has a rigidbody and now you can in these areas they're going to be populated and it will only contain entities that have all three components okay another way of using the API is to use what we call the component group directly so you can simply specify okay give me everything that has rotator data and the transform component so these this basically specifies the components you require then you get back a group and from the group you can then get the arrays so you can get the component data array and the component array and then you can iterate over those and do whatever it is you want to do and the rotator data is laid out completely linearly in memory while if you use a model behavior it's still classes they are all over memory but it's a good way to slowly transition code from the old way of doing things with a model behavior into the new one you could just one by one just start by moving your game code into systems and then after everything works you could move some of them into a component data to get even better memory layout okay so that that's basically iteration that's a that's what you do most of the time that's where most of the time is spent in a frame then we have creating and destroying entities so you can you basically use this thing called the entity manager the entity manager owns all the entities and and components in the system and it manages the memory for you and you can just tell the entity manager okay I would like to instantiate a a prefab and what it will do is this is actually a fast path because if you have any I component data on that prefab it will extract all the I component data components on that prefab at the root game object and it will instantiate those so basically that way you can use the prefab workflow to set up all your settings and so on and then you can take advantage of this insanely fast path for instantiating a lot of objects so we have done some profiling on that if we run this loop here we have four components 320 bytes of data and if we call this function a hundred thousand times it takes 26 milliseconds so that's magnitudes faster than what we have with the game object system so if you can represent the thing you want to do as a single entity then you can put everything into a component data and then you can get massive speed ups and that's exactly what we what we use in this demo all of the entities you see on screen or the arrows all of these things are single entities and they use instance rendering to draw it and there's actually no game objects for for these for these characters that are running around so we can do that and we get a really good good speed out of it right so we just call instantiate it extracts the components out for us and then if you want to you can also tell the entity manager you can say set the component data for that entity to some other value sometime and it's basically these the same as calling get component and then setting in value in the object-oriented way of doing things now one thing is that you want to you want to make sure that well we want to make sure that we can really scale and that often means creating api's that that are accessible by batch so all of the AP is where we can get speed ups like that we give batched api's meaning if we want to instantiate hundred thousand entities at the same time we can make that several times faster than there's already really fast way of instantiating instances and the way we measure how fast it is is against the fastest thing you can possibly do so a mem copy basically copies memory purely that is the fastest thing you can do there's no way you can beat it and I think you know since we're rewriting this thing to to to get great performance we might as well measure ourselves against what is the theoretical limit and for for instance a in this amount of data on my computer it takes nine milliseconds and a mem copy takes seven milliseconds so we are 20% within the limit off of what is feasible on any CPU so it's extremely little overhead to doing this and that's great because essentially it means you don't have to worry about spawning pools if you take this approach because you're just going to instantiate destroy objects and you get great performance there's all kinds of other AP is you can destroy entities destroying entities is also really fast 7.6 milliseconds if you call it one by one if you use the batch API it's 0.9 milliseconds so barely noticeable you can add components at runtime remove components it's a different API but basically the same concepts and of course if you have a entity which is basically just a struct an ID to that entity you can just call set component and then it will set the component value to a different value so let's talk about memory layout by default the memory of those components is laid out so that if you have component one two and three component one basically we allocate chunks of 64 kilobytes of memory and we just laid them out one after another that's a really good default if you want to get really awesome performance sometimes when you're iterating over lots of data you actually want to merge multiple components together again because you might create tons of tiny components because they now have very little overhead so you you start separating out your components out but then one thing that can happen is you might have you might be accessing too many of these component errors at the same time when going over one unit and that can sometimes be bad for performance so we give you full control over how the memory is laid out so you can say okay I want component 2 and 3 to be packed together so basically enter entity a has component 2 and 3 pack together and then comes component B so you can really control everything in the system with with regards to memory layout and that can that can mean some really you know it's it's a good opportunity to optimize your code and most importantly you can do it without changing your code you just at the end of your project you just tweak the memory layout and you get a bit of speed up from it and the component data is also always kept kept in this linear order for iteration let's say we destroy an entity what we actually do is we fill another entity into that place so that we always have completely tightly packed data and because we can copy the data so quickly this destroying and creating of units is still extremely efficient but the main thing is iteration is always kept super tight okay so that's the basic overview of the entity component system and how how that applies now the next part is about using our job system and container libraries because of course sometimes you want to use arrays hash tables these kind of things and you want to write your code Java files so you can take care take advantage of multiple cores so we'll take a look at how that works the first thing we're going to look at is native containers so like I said we don't want to we don't want to write code that allocates GC memory we want to manually control their memory when we want to write multi-threaded super high-performance code so we can create an array a native array and we say okay we want a native area of int and we want to have 100 elements and then you can specify what type of allocator you want for it we have 3 and these are the same three allocators we use in our C++ engine side so temp allocation Alec patience that are released when the function exits so it's basically a stack pattern and it reuses that memory constantly and it's really really fast and really good for that better temp job is memory that should not live longer than one frame very often we have data that is allocated at at in one system and then you schedule a couple of jobs and then maybe the last job deletes their data that's a quite common pattern but it lives for maximum one frame and that means we can make a specialized allocator which is lock free and as a result is really really fast for this particular pattern persistent memory is memory that goes through a normal memory allocator that is basically similar to the built-in malloc you have in a in a system using that that array is the same as using a normal array and c-sharp you just want to sign a value that's fine you just assign the value what's different is you have to dispose it it's memory it's it's I mean it's manual memory management you're responsible for disposing the native arrays what we do have is that we have leak detection so automatically in the editor when you allocate a native array we track that and if you no longer reference this array from anywhere but you forgot to dispose it then you get a nice error message pointing you to exactly where it came from so you can very very quickly find these leaks and they vary they usually don't don't fall through the credits through the cracks unless you're not watching your console and of course so we have several different useful container types native array native list native slice which lets you basically take a sub part of a list on array and native list works basically the same as a normal generic list so you can and remove elements all the functionality that you that you expect from a list and you have a native hash map to really optimize hash table lookups and we have a multi hash map as well so that you can have multiple keys and different values sorry the same key and different values and of course these container libraries written in a way so that you can use them from from multiple jobs so for example a multi hash map you can write to it from multiple jobs and we use atomic operations to very efficiently insert the keys into it okay next question is what does a job look like so a sub job looks like this you basically create a struct and in destruct you put the data that you're going to access in the job and that's really important because because the job system has to know what data you will access inside inside your job in order to make it safe for you one of the key parts of this job system is that it makes it impossible to create race conditions in this way of writing code and that creates simplicity that makes it feasible that you can actually use this to write a lot of game code and because of that you have to put all your native areas that you're going to access from the jobs into the struct so you can't access global variables for example and you also can't put a reference type in here you can't for example do this there's a reference type and you can't put it in there because the c-sharp job system cannot make sure that the usage of that is completely safe you also be clear how are you gonna use it so you're gonna say okay I'm only going to read that data in my job and by default something is readable and writeable so in this case we are simply trying to copy an area of floats all right that's maybe the first simple job you want to write so we have an area of source data and destination data and we just loop over the array and we assign to the destination from the source and then we multiplied by a value so you fill out destruct and you implement the execute function you schedule it on the main thread so on the main thread you create a copy floats job you fill in your erase that you did you allocate manually maybe you want to have a delta time you can't call time the delta time from a job a job doesn't have concept of a frame right it might be next frame in which case the delta time might change while the job is running so you have to put all the values that using in a job into the struct and then you simply schedule the job so when you schedule a job you get back a job handle the job handle is basically a thing you can either wait on or you can use it as a dependency we'll get into that in a second so if I write something at this right now we're scheduling the job the job is running on another threat and there's no guarantee on when it will execute right it might be now might be might be next frame so if I write to this data the source data that we're copying in this line if we do that then we're overwriting data that a job is reading from that's clearly race condition we can't do that and when you do this in fact unity will give you an error message and it will tell you this is invalid you need to change your code and if you want to fix that and you actually for some reason do want to write to their data then that's fine you can do that you just call job handled or complete and that ensures that your job has completed at that time and now you can write to this data and you know the data has been copied like basically the destination area has been filled with the same data that was in the sauce array so that's the guarantee we make we also have different kinds of jobs we have for example a i job parallel for an eye job parallel for basically can be paralyzed over multiple over multiple course it basically splits the works splits the work into chunks and it makes the assumption that that this loop doesn't access sort of like randomly into these areas so for example we are doing the same thing which is copying at the index right we're taking from the source and we're writing to the destination so the code actually gets simpler and of course it gets distributed over multiple course what we can't do is for example write to index zero right now there would be multiple jobs right into it index and of course again we have a safety system for that so if you did that you get a nice exception telling you what you did wrong okay so we can fill out a job data again schedule it and then we specify over how many elements do we want this parallel four job to iterate so basically this execute function is going to be called with every index between 0 and 500 and the second parameter here is sort of a batch size so basically there's an inner loop that this job generator like this code generated for you that has an inner loop and that inner loop basically it just has a type loop of 100 elements at a minimum unless it's you know the area is smaller than that of course but basically to make it to make it so that you can really have very little code in the execute function and the code and it is still really fast because we just have that loop for you okay so that's I job pal for that's a really really common one to use almost all of the game code in in in an audio stream 11 UK apparel 4 because we want all the game code just to process wide on all the course you know we had our 18 different 18 different worker threads running at the same time so we want each one to just always do work okay the next really important thing is dependencies so one thing that commonly happens when you start writing multi-threaded code is you schedule a job then you wait on the job then you do some more work on the main thread and you schedule another job and what happens when you take this kind of approach in multi-threaded programming then you're basically doing work in parallel and then you're waiting on the main thread and then nothing happens on the worker threads and then you schedule another job you're waking up all the work of threats and then they do some more work and essentially with this approach you can never really get to full multi-core utilization you just see these blocks of going parallel and that's cool but it doesn't really get us all the way to great performance so what we need to get to solve that problem is we need to introduce dependencies so if we schedule a job we get back a job handle right and instead of calling job handle complete on the main thread we can simply schedule a job and say we want that job that second job to only run once the first one has completed so in this case we're copying an array from one recopying a source array to a destination area and then we're copying the destination array again to a third array and we don't want the second job to start executing until the first one has completed so basically we just fill in the areas the same way we've got we scheduled a job we have two job handle a here we create another array we fill in that one with the destination of the previous job and then we copy to the final destination where we copy this memory to and then we schedule the next job with the dependency on the previous one so now when we wait for job handle B then it will wait on both of those jobs so it will guarantee that both jobs have been completed and they have completed in the right order and with that you can express you can express a very very efficient multi-threaded code in a very simple language so to speak in the in the in the demo we were showing yesterday on Tuesday we are using we have roughly 18 19 different jobs and we have I believe only to sync points on the main thread sometimes you need them sometimes you just you know you just don't have enough time to make it perfect but the point is you can you can express those things with dependencies and it's it's always good to prefer dependency over just waiting on the main thread and just let those course do what they're best at all right so let's take a look at what happens when we use these jobs in our entity component system how that actually looks in practice so last thing we looked at was this way of writing the rotator right we have a component data array with a rotation speed we're rotating we're doing things in batch and once we have this way of writing our code transforming it into multi-threaded code is actually pretty simple so what we do is instead is we have a job component system so instead of using a transform a component area of transform we're going to use a transform access area and the transfer makes this area is a way of using the transform component from a job that's what this is for and we can use the same inject tuples logic on that so we just replace that component area to transform Exeter and this one already uses a component data array so we can use it on a job and we write the job so first we put our delta time in here because we can't call time now delta time on a job of course that's a global variable and then we have our area of rotators that we want to execute over and for each iteration that we do we get past the transform axis which we can rotate with and index similar to the hydropower for we get the index of the element we're supposed to process so we just change to rotation and we multiply by the by the angle axis same coders here this is literally the same line of code right I mean basically just changed this one to be the transform and when we schedule it we give it the array of transform components' and we put the area of rotation speed onto the job data and that's it now we can schedule a job there's two more things with that so when we looked at the example with scheduling a job and dependencies we basically here we are manually tracking these events as we return a handle we give it to the next job and so on and so forth but we want to write modular code and of course if I have this rotation system maybe I have another system that modifies in a job the rotation speed and of course we want to make sure that that rotation speed is being changed from their job before our system rotator executes on a job and then changes the uses the rotation speed to animate the transform right we want to write that in modular way because these two systems maybe they don't know about each other maybe that's something somebody else wrote so basically the job component system gives you these two functions at dependency and get dependency so because you put this inject tuples on these on these arrays because of that when you schedule the job it knows you're reading from rotation speed cool we're going to take the dependency that the system automatically tracks for us and anyone who is writing to the rotation speed we need to wait on those before we execute the job so that's basically the dependency so it collects all the dependencies that you might have because you declared these tuples and uses them and then after you have scheduled a job you have to tell the system you have to tell the system back I have scheduled a job and in fact I was reading this rotation speed value and I was writing to the transform component and the a dependency tracks that for you so that the next system that maybe then once use the Transformers and that can depend on it so that's how a a job in the new when using the the new job the component system looks like and the behaved the data is really simple and we have just an I component data just containing our speed value okay alright yes let me show you one more thing so I want to show you what happens when we make make a mistake so let's say we it's not enough I yet guess it's ok so let's say we make a mistake with with one of our with one of our jobs so for example I have this collision system system here where we basically do rigidbody like we just do raycast so to figure out where these guys hit ok so let's say we just remove a dependency on the second job here so we just forget this dependency that it was necessary Jesus Wow cool yeah it's a beautiful error message when it comes it's a actually it's probably in the console lock now so I'm sorry the point was basically what it says before it crashed is this it says the previous scheduled job batch query job reads from the native area batch query job commands and you're trying to schedule a new job collision minion system integrate position which writes to the same native area via integrated positions raycast to guarantee safety you must include better a job as a dependency of the new scheduled job and when you get your hands on it it's not gonna crash just after which would be amazing right and I am really surprised it crashes now as well because it's been pretty solid while we were making the demo but I guess that's how it goes in stage demos so so we have these safety systems and I think the the safety system is actually really one of the one of the key features of this of the system because writing multi-threaded code is is difficult and it's difficult because creating race conditions is so freakin easy and if we want to make it so that a lot of people a lot of developers start writing multi-threaded code then we have to make sure that you can trust the system not like that crash like actually trusting it and so this system really detects every possible race condition that you can create so that so that you can just change your code last minute like for example in the demo that we that we made a couple days ago before the keynote we were running at 45 frames per second and there were several sync points left on the main thread that were really just creating big holes in the frame because we were just waiting on these jobs too early and so we decided to just change it around and nobody knew why there was a dot complete call I mean at some point the code got refactored and then it wasn't necessary anymore so we removed it and turns out we forgot a couple dependencies system tells you exactly where they needed and then you add them and then it works as opposed to the system just crashing at a later point because it's overwriting some memory and that's a really key part to getting more people to write this kind of this kind of Java fide code so there was one more thing I wanted to show you a bit here's let's take a look at sort of the simplest way to write a job in the new system so let's say you have this thing like we were basically exploding these these rigid bodies and when they are exploding then we just make them fly through the air cast array and figure out where they hit the ground and this is the job that performs the gravity calculation and what we're trying to do here is make it so that you don't even have to schedule the job yourself you just write the job code and you use this I auto component system job and what it does is on the main fat it gives you a prepare method so here we can extract Delta time or gravity set up any values that we need in this job struct and then you get an execute function where you simply get the reference to that specific minion and you can have two three four five different components here and they basically refer to the same entity that we're processing so that way you you're basically expressing the same things you're expressing I would like to have all entities that have a minion velocity on it and if you put multiple on here if we put like a rotation speed value on it and you're saying I want a minion velocity and the rotation speed and calm I execute function for each one so it kind of gets us a little bit closer to this more object-oriented way of thinking while getting exactly the same performance as if you wrote it completely data under design style and we're going to probably put some more effort into a sort of like getting more into this way of writing code so that it gets really really easy if we look at a bit more complicated example so this is similar to when when the minions in the in the bigger dimmer when they were exploding when we hit a fireball on them we basically destroy the old entities we create new ones and we basically have sort of a tiny rigidbody system right because if you're running 100,000 objects you don't want full rigidbody something we're not trying to make them fall accurately we're just making sure that when they hit the ground they stick around for a couple of seconds and then disappear right we're keeping it simple we want crazy-ass massive scale so here we have our job we fill out an array of ray casts basically we have the velocities for each of these minions and this raycast job is responsible for filling out each ray cast command so we have to position the velocity based on that we know how long it will travel so that's the length of the raycast then we schedule their job we get a dependency back then we fill out a secondary and we let the raycast system fill it out for us in a job and we put a dependency on the first job right because we want to make sure that this area has been filled out before the raycast runs and that gives us back another dependency so now we have a job performing the raycast then a job sorry a job filling out what ray cast to do then a job performing the ray cast and then we can have a job that reads the result of the ray cast and applies it so then we have another job here this is where I put remove the dependency and we we integrate that back in a job so then it looks like this basically we just check did the raycast hit okay if it didn't hit we just move it forward based on the person what we wrote into the input raycast right based on the direction and the distance we move it forward otherwise we just stick the unit at the position where we did hit with the raycast and then we apply it back into our array and basically this way we have written code that runs completely a synchronously to remember it basically this system the on update function it only schedules jobs it doesn't wait on anything at all if there's somebody before mutating the velocities that's fine we're taking a dependency on that on that system and we're just running it a bit later in the in the worker threads okay so that's basically how yeah the different job component system looked like when you when you write a Java file component system code so the next part is about c-sharp job compiler so the c-sharp job compiler is where we get really really significant speed ups on top of on top of just getting multi-core speed ups linear data layer and so on we want to also make sure that we have a compiler that is that generates code where if you write you know if you compare to C++ code we can significantly beat that I mean it I mean it's assuming you're writing normal you know maybe you have a math library a high level math library with sim decode if you do that our compiler will generate better go so the way it works is you simply put an attribute compute optimization on it and then you can specify what precision you want for example for your jobs and this compiler is only for c-sharp jobs so you so what is good about that for us is that it it means that we have to solve a much simpler problem because in these C sharp job systems there is already a lot of rules for example we don't have virtual functions we don't have reference types there's no garbage collector there are only native containers and all we really care about is making that code run incredibly fast and so basically we have a team that is very focused on just performance as opposed to making everything work so these are some of the these are some of the results we get from these from these optimizations and that's part of part of that is automatic vectorization of the code part of that is just in lining all of your code into basically a single function and executing it all together and that exposes a lot of ways in which we can automatically vectorize your code and generally what we've seen is something between 5 to 215 X speed-up so far and so that's really a lot I guess I mean it is it is it is pretty crazy like the kind of speed ups we get so there's a lot of cool stuff you get with it like for example you can control at what precision your math functions work if you call a sinner's function maybe you don't need 32 bits of precision for it maybe you're just you know animating a massive army of units and it's okay if it's off by one degree or something that you can make that choice at in the compiler you just may see you write Samantha's compute compute optimization and then you say precision medium and then it will compile code that has less precision so a big part of getting that performance in the new compilers that we also want to introduce a new C sharp math library so this C sharp math library looks a lot like HLSL so we have float one flow two three four four half values to make sure that your data is really tightly packed and then all kinds of utility methods that you're used to if you're if you know how to write code in edge yourself abs power a min/max clamp and so on and you also have functions like method select which you can use to write branch free code so select as a function for example where you can feed in a bool and then it selects between two different values and that is that is one way in which when you write automatically vectorized code you can get quite some significant speed ups so basically there's a there will be a new math library that you can use but you can also use the old one but if you use the new one there's a lot more we can do with performance and then the last question is of course when does all of the ship so the c-sharp job system ships in eighteen point one and that includes the native container library and we're also releasing the entity component and the math library with full source code so we will give those out as a project folder we'll have it on github so that you can check it out and play around with it because we want to really get a lot of a lot of feedback on it and iterate on it a bit more before we really bake all these new concepts because it's really a completely new way of of writing code in unity and we want to really make sure that it's it's right and we want to get a lot of feedback before it's really fully baked into engine the code itself dough and component system it's pretty robust it just isn't as as as integrated mainly there's this game object entity thing you have to put on each game object for it to show up in the component system but otherwise it's pretty solid and then the c-sharp job compiler will ship later in 2018 so now if anyone asked questions we'll take questions so I'm a fan of the entity framework even in its current state not so much of a fan of just forming dependencies of one class to the other so like currently I'm using like a dependency injection frameworks went through several so far this is a lot to take in you know it's I'll probably watch this again but how do you like in is is there has anything changed in terms of interdependencies of multiple classes in like more classic or P style programming well yeah so we I mean one thing that we definitely want to bring to wanna behaviors a new version of a mano behavior for the people who want to continue to write object-oriented code its dependency injection because it's you know when we start writing one more of this this code we need a lot of managers and systems yeah and dependency injection works really well for that so you're planning on implementing that natively yeah a framework oh do you know do you know what later 2018 also so for now we just integrated it in this entity component system so all the component systems have dependency injection but the chip in a new one on behavior that's and we'll figure that out okay great thank you I'm from Mario kinetic we make wise and basically I have some concerns related to well native plugins and how all this would essentially be able to how we'd be able to take advantage of it also oh whether we'd be able to write well [Music] basically the the great thing about this new compiler tech is that it doesn't need a GC so for audio specifically because it doesn't use it you see it that means there will never be any stalls so it makes it really feasible to actually write of this audio code or audio plugins in c-sharp and we're also working on a on a new lower level API for audio that you can use to basically black C sharp into a black sort of like these c-sharp jobs into it as audio jobs so to speak well I was moreso thinking related to accessing the the native variety native arrays in the native containers today yeah so then the so for intro up it makes it really easy because you can you can if you want you can get an unsafe pointer to the data so then you can basically get direct access to the data without copying it okay that makes interrupts with C++ much easier and then also the you have a transform access array yeah is there going to be any ability to have similar functionality for for other custom classes right and so that's I mean we do want to expose more and more of the engine with sort of like native arrays and these these primitives where there's less copying going on is that and that's still kind of like the open question I mean I I predict that what we will what we will do is we will probably rewrite a lot of components and we will rewrite them in c-sharp and will expose really really low-level functions that are actually not component based from C++ as an example that's how we do not mesh so enough mesh we just have enough mesh query buffer and you can do pathfinding operations on it and you can use it from a job but then the nav mesh agent is actually written as a I component data and that gives a lot of flexibility to the user because you know if you want to do a formation system or something like that very specialized you really need to control pathfinding and sort of like how to stay on the navmesh polygons yourself in a very tight way and that approach really makes it much more feasible and I think it actually makes a sense for almost every subsystem to do it that way given that we have this new system so we'll probably move more code the component implementation into C shop so that you can actually modify it thank you very much hi you're saying that this ships in 18.1 yeah but if we wanted to familiarize ourselves with the workflow is there a broken version that we can play around with yeah so we will I mean this year we will we will ship preview builds off it and it will start on that pretty soon and then the other thing was you mentioned earlier that the enemies don't have game objects right could you talk a little bit more about that because I didn't I'm not sure right so the idea is that this new component system that we wrote it's basically it's like you can think of it as a parallel universe of a new component system and this game object entity component is basically what injects the old game objects into the new one so that they are visible to it and and in a new system and you know there's no game object that's just an entity right like we don't really deal in game objects anymore and that's if you want really massive scale scale performance and you also want to make load time massively better right we want to improve load times by you know by a lot then we have to read the data compact we basically have to not not do any serialization we have to just men copy some data over and then that's your component data because all the data is not little it's like creating these these new super lightweight components is is kind of the way forward for us to get really really big speed ups and for you for you to write code that is just as fast or sometimes faster and C++ code doing the same thing awesome thank you very much my question we bout uh using using components like like sprite renderer on an aren't entity or that like can use basically any unity component or is it two specific components so they the component system supports any type of component right like as long as you for now as long as you put this game object entity component on it then it will be visible and that gives you main fed access in a component system but since it is a class the sprite renderer is a class you can't use it directly from a job what you can do is you can just write to some of your own data and then on the main fat apply it back so right now the only super sis component you can talk to from the job system is the transform okay thank you first of all I want to say thanks for working so hard and making a really awesome system I think Unity's change in the world and I'm a huge fan and this new API looks amazing thank you so I'm really interested in the integration with the physics system yeah and you talked about how there's that custom ray casting for the ground detection and stuff yeah if one were so inclined to write a complete rigidbody simulation that does accurately solve collisions and that kind of stuff do you think it would be faster and better than nvidias physics I think so I would love for someone to take up that challenge because there are there are there are some so actually physics is extremely well-suited for this because usually you know when you when you write physics code then you still there are so many algorithms in there that is very difficult to perfectly vector right right vectorize code in C++ for it right so for example physics they use a vector library but it's far from the most optimal way of writing vectorized code and I am pretty certain that our compiler would do a much better job of generating fast code than you would get in C++ in this particular case so that's one thing that I think you would get better performance out of it and the other thing is you can just develop it relatively quickly because there's a powerful math library there's a job system and there's a component system right so you basically just have to worry about the algorithms we're usually in the physics and you have a lot of memory management and all kinds of stuff to build that's one thing and then I think there's another really killer reason and that is let's say you wanted to make a physics engine that is deterministic across multiple platforms one thing that we want to do is in a c-sharp job system we can just like we have these attributes for lower precision which would be also really awesome for physics I mean sometimes you care about high precision physics sometimes low precision force and it just change an attribute on it and it runs faster but one thing we want to add is deterministic deterministic compilation so that we can on our machine and Intel machine we can generate deterministic bitwise exact code so you could use that if you want to make a network game where you want to do massive scale simulation and do lockstep simulation that would make that really really simple and there's no physics engine today that does that because it's practically impossible to do in C++ but with compiler we we will just have that as an attribute at some point that's incredible I've I've written a GPU powered physics solution in unity and you think the CPU based one with the C sharp trial compiler it would exceed a GPU based likely because you wouldn't have to transfer the data between the CPU and the GPU well it depends I mean I guess if you do a GPU based physics simulation then I guess the best way to do it is to connect the rendered objects directly on the GPU and you not go back to the CPU one last question I noticed Starcraft in your applications directory what race do you play and do you want to play some twos I play Terran thank you awesome dudu that physics system it will be awesome yes I am this is really interesting stuff somebody else said a lot to take in working in the non game simulation space this has some real interesting potential for that my question is with respect to dependencies I'm interested in what I'm saying from the way that the dependency injection happens my question is when comes to debugging that in terms of performance where there can be basically nested trees fairly complex nested trees of dependencies that can emerge in the job system and seeing your code and visualizing those across multiple you know multiple files of code could get to be fairly complex have you guys given any thought and do you have an idea of where you're going with respect to two things one being static visualization some kind of a tool that looks at the code statically and maybe renders a graphical representation of where your dependencies are and then part two of that and I realized that at the performance scale we're talking about deep instrumentation you know sometimes you get into a Heisenberg problem where your instrumenting it and you influence the performance because of the instrumentation but is there a plan for some sort of instrumented execution mode where you give up speed maybe scale time down or something like that give up speed execute that and have some way of visualizing where the weight states are you know one when you hit one of those complete calls or an implied complete where the weight states are so that you can figure out if your codes not as fast as an echo that should be s where that was just bring up timeline profiler so the first thing the natural kind of okay it's great that the timeline profiler the timeline profiler basically shows you that so if you do a weight on the main fat it will show you that it's waiting right now it doesn't visualize exactly what it's waiting upon right but we will pretty soon start drawing some lines to the job that is actually waiting upon that's one thing and we're already like extracting that data is pretty simple and I think exposing that in a graph way of looking at it so that you basically see all the dependencies between jobs would be really powerful what would be even better is if it showed you the dependencies that you put and the dependencies that are necessary based on the data right now because then you know sometimes what actually happens when you're right game code is you put too many dependencies and I can reduce your performance because you put additional weights on it and we could visualize when you put too many or too little and we can just give you an overview and that's like because of the way we've built it that is actually really easy to extract our can visualize thank you very much it sounds interesting to follow up on the physics queries would we be have like a parallel universe version of all the old physics calls like you know obviously all the raycasting stuff but like sphere casts line cast even compute penetration and the overlaps we have tons and would we also have like parallel universe versions of the old colliders like maybe we don't even need it you know Collider component we just have a structure that's like sphere Collider structure yeah so I mean how we move those things over right now we have just done it for enough mesh and with no finish with basically really just we just have a very simple C shop sample code so to speak up I mean it's a built in component but it basically acts as sample code of how to use the low-level API and then all of the code is available and I think that approach of actually moving you know maybe exposing the low level physics API and then building the component system in c-sharp on top of it and it's the most powerful way of doing it Thanks and would you baby have some when you release this first versions of this have some of the examples maybe not the RTS demo specifically but like examples called from that yes we will driving animations and we will absolutely have a lot of samples I think that's very important in making this thing a success that people can see exactly how to write this kind of code excellent thank you so in c-sharp structural value types and are you boxing all the energy data when you actually arrive from the interface no there is a well basically what we do I mean I would kind of defeat the purpose right right yeah what we what we do is there is a generic extension methods in c-sharp and with those you can you can do all of what we did without any boxing it sounds it sounds like from everything you've been saying is that the that you're manage the new managed system is beating like C++ code and those type of things but is there anything with this new system that's gonna is being hindered by the fact that it's still c-sharp managed code I actually I actually think that it makes it better to be answer shot because I think that I mean if you write multi-threaded code in C++ I don't know how I were to pray approach building a safety system for that and I think that safety system brings a lot of the simplicity that makes that approachable and I think it does it more more so when you write multi-threaded code rather than without and because we can actually restrict the usage in c-sharp because it's a more controlable language and because we can do post processing and we can write on compilers against this bytecode and these kind of things it gives us more control to enforce a model that really enforces performance in a really good way and that enforces safety and so I actually I actually really do think that you know like I mean it's just better to write high-performance code like this then in C++ awesome can't wait to try it out just curious a lot of this stuff reminds me of like how I wrote code in CUDA like converting you know eraser structures to structures and arrays my challenge there is when I ever have the shuffle or like sort and I was just curious how you like this solution you came you've come up with like writing it was just like copying buffer oh temporary buffers but right the big difference is that we're still running this on a CPU so if we want to sort some years yes about sorting right yes shuffling yes sorting so you just do sorting exactly like you do it now I mean you can write normal CPU code just because you have no the option of writing branch free code with math thoughts select and all these things doesn't mean you have to I mean you know it will just not be very vectorize if you have a lot of branches in your code but it's running on a CPU so okay and it will still like even in cases of sorting the compilers will still find a lot of stuff and get get some pretty big improvements over other c-sharp code thank you I think it's really cool and I like what you guys tried use that's awesome I I've read some stuff about data oriented design I think doesn't was called and in my mind it's amazing but I always think oh but it's only amazing if you have a big team if you're kind of a small team trying to manage all that and sounds like you guys tried really hard to make it manageable for a small team but I was kind of thinking what what made you take this approach versus trying to do something like an Amana behavior like linked update where the transforms between all the game objects of the same type could still be read and memory like you were saying you know sequentially I mean I tried that for years well I mean I don't know maybe I didn't try hard enough but I guess I think it's it's it's difficult to to go halfway with performance for myself because it's very difficult to compare when you're actually done or when something is good and the problem is is when you have this object-oriented style of writing code then there are always you know some holds like you know like your memory layout is all over the place and so you're always you know two to three acts slower at least and it just makes it very difficult to I mean I don't know I I just think it's better to go all the way with performance and then you know like we basically started out by writing a c-sharp Jobson and we just had native containers and then we built entity component system and I think we we're pretty good at figuring out how to write how to make component components easy to use and how to make it easy for game developers to write code and that's what what what we can do now we're like we just take this approach that with which you can get the best performance and now it's about how do we make it really really easy and simple so that anyone can do it and that's really you know the the safety system is the first part the second part is sort of like these more automatic jobs where you basically just have the system as the job and you don't think about dependencies and these kind of things it makes a lot of sense cool thank you this is our Kusa I have one kind of special question i familiar with actually any game with RPG elements any game with RPG element there will be compact skills complex what skills spells Oh RPGs with complex spells in it yeah so for example even Starcraft 2 yeah has complex skills for those units yeah I think there are ways to data-driven design with a lot of a graph editors current stuff so in this way is basically calling for a system that kind of man kind of data which might not be uh might not be silver for its trucks do you do you think this kind of thing can benefit from this system right yeah I mean I well generally I would say any problem can be expressed object oriented and any problem can be expressed data oriented design style in some cases writing object-oriented code is a lot easier and much less code and much easier to understand in some cases data enter design code can be simpler but I think it's just a question of I mean you look at a specific problem and sometimes you just rewrite it in a slightly different way and they turn to design and you get great performance and it's actually not that not that hard and sometimes things do get hard I mean you have you know you use when you do data and the design style coding use a lot of indices you don't use pointers and sometimes that's that can be harder but you know you're trying to I mean at the end of the day you're trying to solve a solve a problem right I mean like you don't have to use data aren't a design for everything but if you want great performance then you know you'll have to do something for it sure another thing is the job job compiler does this support I also coyote C so the C shop job system can run in mono al-tusi P P or also in the gnu compiler but the compiler is basically for that code that runs in a job it replaces the code the aisle to cpp code okay that's thank you thanks i guess my question is so the job compiler or the burst compiler so the code that you write for that is that going to have to be in like a separate folder or something is like a first pass that i goes through that compiler then any amount of behavior stuff would go back to the mono compiler or is this basically everything gonna go through the job compiler no i mean the the great thing about it is basically it's all just a shop code and it just happens to be a struct with an ID job attribute and then when we schedule it that is actually when we make the decision on if we're gonna run it through the sea shark compiler the sea chef job compiler or through mono because of course you know you need you very often need mono because then you can single step through the code and that's super important but then basically you can we can just add schedule time determine which compiler we use and then at build time you can use it but it's it's basically completely automatic you put your struct wherever you like I mean in most of these component systems I actually put the job structs into the component system S sub construct so this build up like just the same deal oh thank you for machine learning purposes like self-driving cars using lidar simulation usually revolves around either ray casting we're using a depth map from graphics to try to get the picture of the world yeah and my company we ran in to get pixels was quite slow to get a texture we started using async get texture which was an open source github repo and we're you know we're trying to basically have a C++ plug-in poll textures directly out of GL or or DirectX so when I heard about the programmable graphics pipeline and the sushar job system I thought you mentioned it could actually do ray casting so in principle you could go back to like Ray casting every pixel with enough cores and throw I'll write it that way right yes but there's also a computer-based essent greet deck function in seventeen point three or eighteen point one that makes my day because I've been using an open-source github repo to try to do that so it's going to be built into unity yeah fantastic thank you I'm curious about the how you specify the memory allocation like you said you could like switch between different moods so you've got yeah either contiguously doing your whole entry your component or doing parts of it separately you had the visuals on the slide so how does it actually work how do you specify that how do you specify the layout of the component data I do essentially you define sort of like stream groups so you specify a one component a and B to be in group a in group 0 and then component C and D and Group one and then they basically get grouped together and that's part of some configuration it's not part of the code then yeah so basically there is a like what I showed is like creating entities from prefabs but you can also create what we call an archetype an archetype is a set of components and it also defines the layout of all the components with the same set of components and on that you can specify the layout of the data have you thought about a process where you would like automatically profile what the best layout is and optimize based on that like that magical button I did what you did I bet someone on the asset store will do it yeah but but but well it sounds like you can get that information so you can optimize it so yeah it would be brilliant right yeah would yeah but it's also pretty cool there so we have to play a loop fully exposed so you can reorder the player loop and that's of course super important right because when you write Java file code important part is when do you schedule a job when you wait on the job and that's actually the player and say if the animation system mutating a transform data and then the bounding volume updating before rendering reads the data you want that amount of time to be as large as possible in the frame and often in a game you can actually change to play a loop quite a bit without breaking the game so they could also be automated so these things I think profile guided optimization for that would be really a great area I think for helping improve performance without a huge lift great thank you so we have this new entity component system but all these things they're not no longer game objects you know so in terms of how its rendered like you have this new entity component renderer or yeah so what we did for the for the demos we wrote is we wrote one NC shop I mean we have API called graphic start drama SH instance basic like from this CPU just scheduled all the jobs like manual frame exactly okay and that's actually a really efficient way way of rendering it and we will think about how to do sort of like a built in renderer that fits into the new system just like we have to do for for physics I mean as a starting point we're just starting with c-sharp only and then we will just start moving one more into that system so that you can write these really tight entity only components besides moving to a design oriented data oriented design do you have any tips for teams that might be starting production or just early stages that would like to jump on this as soon as it's available hmm I don't know I think we'll probably have built in one or two months I'm not sure what to do in in in the time between what's the topic a few times that's a good one but yeah I mean well I mean that's it I mean yeah yeah I don't know how about debugging right yeah so the entity component system like how to debug it visually I mean yeah so in general if you wanted to step through code and stuff like that right so if they are just if they are the pure entities and not game objects right because usually use the game objects in the inspector to look at what the state of things are right so we will make we will make tools that that show entities even though they are not game objects okay and you know there's interesting ways of visualizing that data now right because we can say let's look at it from the perspective of each component system how many are there in each component system and so on so I think we can actually make these these tools even more powerful and what you now have for debugging thank you so when do we get nested prefabs massive premium [Applause] why I mean we actually have a have a have a solid team working on it and they have been gathering a lot of feedback and so on but I mean that that is shipping in 2018 all right it appears that we're all the questions thank you so much [Applause]
Info
Channel: Unity
Views: 146,065
Rating: undefined out of 5
Keywords: Unity, C#, Game Development, Game Dev
Id: tGmnZdY5Y-E
Channel Id: undefined
Length: 100min 13sec (6013 seconds)
Published: Fri Oct 27 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.