Unity at GDC - C# to Machine Code

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

C##

👍︎︎ 5 👤︎︎ u/[deleted] 📅︎︎ Mar 31 2018 🗫︎ replies

Captions

[Music] wait okay now from the beginning it is indeed time for the sake hi everyone my name is Andreas and this talk is titled from c-sharp to machine code which is a dramatic title anyway about me my name is Andreas as I said my background before joining unity is I've been in the triple-a space for a long time doing game and engine development I've been at dice working with the Frostbite guys and I've been at insomniac games and if you don't know me I'm one of those performance in that cases that maybe you've heard about anyway I'm at unity now and I'm working with pretty awesome game code team and our compiler team and my job is basically to make this dream we have the performance by default every allottee so if you've looked at the other sessions here and you've seen all of the performance work we're doing and all of the stuff we're talking about a GDC it's very much about performance right what are we doing all this well it's a Direction unity is going in and it's easy to see why right even now like one of the most common reasons why people return mobile games or review things poorly is because they say it runs poorly on their devices right this affects developers it affects you it affects everyone who's consuming games right if you read a review about a game people have started really to pick up on like well the framerate wasn't stable like I think this was poorly off to buy something right and so gamers are demanding it and developers are demanding it and we have to deliver both us as an engine team and you as developers we all have to care so with that in mind you know who optimizes if you're on a team and you're developing stuff like who's gonna who's gonna make this happen right any in theory everyone should care right we know how important this is this this can be the difference between success and failure like are people gonna be impressed for the game or they're not gonna be impressed right so we would hope that our teams when they're putting new assets in and the writing key elements of your game you'd hope that performance isn't at the front of the mine's right there would be thinking carefully about well how can I get more enemies how can we get do more animation right what's gonna set us apart how can we look good here and maybe you're like well anyhow I don't care about any of those things well maybe you care about better battery life right turns out to get better battery lives on the new generations of phones that are coming out you actually have to use more than one core like running on one core that's it gonna become history what about smoother frame rates maybe we care about that if you care about none of those things like how about saving the planet come on like there's gotta be so anyway there's plenty reasons for why performances should be at the front a center of everything that we're doing and so we would hope that our teams would care and everyone would think about this actively but in practice it's people like us in this room who take time from their schedule to go to session titled from c-sharp to machine code right specialists so let me tell your story here like what it means to be a specialist it's not unlike driving a manual transmission in the United States and you know if your car enthusiast maybe you meet some other car enthusiast and in the States you're talking about Oh have you driven that road like did you what gear were you in and they start looking at your funnel and it turns out like less than 3% of cars sold the United States have manual transmissions right it's a thing that's going out of style and people don't care and no one really knows how to drive one anymore but you know so you can get surprised that people are speaking the same language still yes at this point right and it's all a big mystery to them and that's what it can feel like being a specialist anyway this is a room full of stick drivers right you here because you care and I think that's cool so it can be tough to be this performance specialist like I said and that's one of the things and one of the big reasons why we're pushing on the tech side to make that job easier and more fun so hopefully you've seen the announcements of the first round of tech that's coming out of unity now that has better default performance shorter so we're talking about the job system we've been talking about native collections you know where you don't have to use garbage collected memory I'm talking about the ECS that's great right that makes the job easier for the specialists that you saw Mike's part just now right things are set up by default so you can go in and get pretty dramatic speed ups and to sum it up like I feel like this is this this enables you your teams and yourself to build optimizable code and context so that's great right we basically removed most of the clowns at the factory so when you get the car it has fewer clowns in it and so you know to paint within the lines and you follow this stuff like you should expect better performance by default but it leaves one big unanswered question what happens to the actual code the thing that actually runs on the machine right and that's what this talk is about and the initiative here we're calling bursts and it's a compiler and so this is the second most important slide of this talk so please see this one and the next one which we leave the current options you have right when you're writing c-sharp and you're delivering it and you're worrying about different platforms they have some performance problems and we know that and I think the problem here is extends beyond just you know what we're talking about here usually and doing c-sharp if you look at any game any triple-a game any indie game there's always tons of performance left on the tape right we are as an industry poor at leveraging everything that the heart we can give us and that's because we have sort of funky tools and deadlines and all kinds of things and we get it working and we ship it right so I think this is true way beyond just unity in c-sharp and this sort of problem of sub optimal code generation right a poor mapping to what the hardware actually does is tough to fix if you're the specialist and you're trying to fix this what are you gonna do like you have thousands of files that all get compiled to some sort of sub optimal machine code right and that changes day to day as your team is making changes to it like how are you gonna fix it right you can only babysit the compiler so much and another problem that we want to solve is that it's really really difficult to make determinism a guarantee for your code base so if you're worrying about deterministic network simulation these sort of things and support more than one compiler and platform it's a nightmare and it's a problem that we want to fix here's the most important slide of this talk what problems are we not trying to solve right because this this should put you in the right frame of mind we're not going to become some language standards committee and we're not going to sit around saying well we need to support esoteric hardware like the one in this frayed chair but like that's not a thing and my least favorite thing about discussing language is an optimization this sort of thing right what always happens is that someone in the back of the room raises their hand and says well that sounds reasonable but what happens if there's nine bits in a byte right someone always has to be that guy like the standards guy you know the language lawyer who like kills all the joy of trying to optimize something so in fact we are not trying to solve that problem and we're gonna say that in fact character bits are eight and the address space is actually linear like on every device anyone ever cares about pointers always have the same size and all these things that you know can just throw a brick into any sort of attempt at making something reasonable and optimal and perhaps the most important of this talk is that we're not making a general-purpose compiler right you should not expect to be able to use bursts to compile your database server or something completely non unity right we're making this two for unity and nothing else so with that said I mentioned the hard problem of trying to ensure good code quality right that's a tough problem and you've probably been there like trying to find and those problems and so there are other problems here because you're using a general-purpose compiler tech to try to do your job right so in this space there's tons of problems that no one really is looking to solve in a game context one of those problems is aliasing which we'll talk about in the talk vectorization is another one of them like how do you there are no guarantees like you don't know what you're getting it's basically hoping for the best another one is support for actively moving things around in memory so that you get better layout and I mentioned determinism right reasoning about this problem is like a big messy sticky another thing you get back to in the talk is controlling precision trade-offs so those are some of the problems and our strategy to attack these problems is kind of simple what we want to do is when take control of the compilation pipeline and move it into unity and we're going to teach our compiler to solve problems in this space in the way that we care about for games and we're going to do that by bringing unity knowledge into the compiler and so we'll explore that in the talk and you'll see what I mean but first let's review the options you have today to deliver c-sharp right there's mono of course which we all know and love you know it's one of the reasons that people get into unity is you know in no small part due to mono right this is a the primary focus of this technology as it pertains to to unity is to enable you to quickly iterate in sandbox and do things and be flexible and all all that's good stuff which is great right and it's a big reason probably while why some argue in this room but you know if you disassemble what that JIT is doing like you just want to go home and cry like it's it's not good code and that's not the point of it right and it's called scripting and unity for a reason even though it's a compiled language all right so there's also aisle to cpp and you know a lot of people think that well this is an offline compiled things of the focuses performance but the focus of aisle to cpp was never performance in fact right the focus of aisle to cpp was to enable you to deploy games on platforms that don't allow JIT code generation so like the iPhone for example right and so sure it's statically compiled offline you don't have the mono JIT but there's still plenty of problems and if you look at the code again you actually get out on the other end it's a little bit better but there's plenty of cases where it's still not good enough and furthermore because it's working one additional layer away from the actual hardware with the C++ compiler in between and different cecum plus C++ compilers to boot right it's difficult to iterate on the code quality if you play a musical instrument you'll probably find this analogy accurate is it's sort of like putting on gloves and then touching your instrument with some extra padding in between it's hard to get the effect that you want so those are two options that we have today right and so the third option we're adding is burst and this is our in-house compiler stack and we want to engineer this to take advantage of those better defaults we're putting into unity and make no mistake the primary focus of this compiler tech is performance that's why we're doing this we're not doing it to you know make it convenient or do whatever those are all nice things that we want to have but the primary leading focus of this effort is performance and not only that but we want to make it a companion like when you're in that specialist role in looking to get that perf this should be your trusted ally that you can just pull out you can work with it in the same way you can iterate on game code and behaviors with mono we want this to be your performance workbench so in the big picture you've got your game all your code as you may know we're rolling out packages right so in the future you'll depend on a bunch of packages to do a bunch of heavy lifting which is great one of those packages that we're very excited about is the e CS and all these modules and your own game they're gonna be producing job Carnales things that you've seen Mike's talk and other talks right this is the way we're going to be writing performant code in the future and so we've got hundreds and hundreds of these different types of workloads they all sit on top of the job scheduler in the runtime and they're all supported by these garbage collection free containers and that's all backed by the unity runtime and so now there's this big square you may be guessing what's gonna go in there and that's where a compiler slots in so you can see we're taking a very deep integration approach here right the burst compiler is a package and it's deeply integrated with our jobs and the way we type them up it's also integrated with the job scheduler and it's integrated with the native containers that it has awareness of these things right and it cooperates with the unity runtime to deliver a completely deep integrated compiler stack right it's not like we're taking some executable and just bolting it onto the side like no this is a unity effort if you know me and I fall over I've said here's what I said three years ago I said compiler are good at performing mediocre optimizations over and over and Here I am telling you I'm making a compiler you must be a liar well in fact I stand by that statement and I stand by that statement because I said it that's what I said about a general-purpose language compiler right some tool that tries to solve every problem for everyone and does so while taking you know taking into account the guy that has nine bits in a byte and every possible thing that you can ever hope to run code on right and this implies that you accept all of those constraints right you have to make all these trade-offs and often when people ask why can't the compiler do X right that's often the answer but if you look outside of that space if you look in the into specialized compilers and those sort of other transformation tools that have a much tighter bound or much tighter constraint like a shady language we're suddenly seeing you know a lot about my stations that just make a ton of sense right and we hope that we get some of them in C++ but we often can't because reasons so when we asked ourselves the question how can we get our c-sharp code generation to rival the best handwritten code and the best code that comes out of the most special purpose compilers and we realized that setting constraints is what we have to do to make this a reality and that's why we're introducing high-performance c-sharp right so this is a subset a perfect subset of c-sharp that has some pretty tough constraints if you're used to the full language right we're taking weight class types we're taking away boxing taking away GC allocation and we're taking away crazy things like using exceptions for control flow which honestly shouldn't be doing anyway and there are some other things in here like you can't really access statics and like we're really making this a very very tight constraint subset and we're doing that not because we like to punish people we're doing that because this means that we can optimize it really really hard and because of the the constraints with the language it also means that we can actually analyze a lot of stuff offline and at runtime right we can safely tell you that this is going to work and without race conditions and things like this right so maybe if you're feeling uncomfortable at this point you know why are you taking away the c-sharp that I know one love like it still looks like c-sharp it still is c-sharp right so you don't need to read this entire slide but I'm just calling out a couple of things here like first of all you can still do plenty of abstractions with structs right and here's one that implements an interface which is okay because generics and interfaces play nice as long as you know you take care and if you make a mistake and you happen to introduce boxing or compiler will tell you you got attributes here and you can see that some of these attributes are actually some pretty advanced abstractions they're still written in this high-performance subset of C sharp right you've seen them probably in the other talks but I give you things like a component data array you got things like an entity array and even a command buffer abstraction here now that looks sort of like a class right but it's actually all written in this high-performance subset and your regular straight line code is pretty much what you'd expect you put in methods and they just are on structs instead of classes and it's pretty much what you'd expect so it's not too scary so it is a little C sharp right and we like it because it is that sandbox and you've got all of the basic types you get structs and enums and generics and interfaces right and if you make a mistake and introduce boxing the compiler will tell you and so we're retaining all of that good stuff about iteration and safety like you're not gonna take that on unity by using it you know the most common question that we get when we talk about this is why can't you just fix c-sharp like just make it better like I wish I could like wave some magic wand but like if you think about Microsoft and other companies they have hundreds of extremely talented engineers who've been working to try to solve this problem right - making garbage collection algorithms faster and all the things but I haven't seen a garbage collection algorithm that can actually deliver hitch free gameplay at 90 in a VR Tyler it's just never gonna be a thing right don't solve that problem because there's no solution and that's what we're trying to do we're trying to focus on the subset that we can guarantee you we can optimize all right so that's out of the way let's talk about how the compiler works so currently we compile your assemblies normally using the c-sharp compiler right which means they can run on mono or they can run on burst and that's good when you're debugging for example and then in the editor when you have burst enabled what we do is we consume that il that comes out of the C sharp compiler which is currently opt in like you have to tag your jobs to enable this and then we transform to LLVM ir and we feed in a lot of metadata that we happen to know because we control the whole ecosystem and we run LLVM sock my station suite and with JIT that and then we swap out the code dynamically in the editor right and this is your performance and box where you can iterate on stuff and you can measure the performance as you're working what the team is working on right now is fully integrating out ahead of time and workflow for this right so all of the stuff that's tagged for a burst compilation is going to go into and blend together with your I'll to CPP output so you can deliver a fully optimized build that has no JIT and when I say performance and box I mean it right we want to show you what's going on under the hood and give you know a performance specialist the tools he needs to do his job or her job for that matter and the burst inspector is the first one of those tools but we're gonna make even deeper integration here in the future and so the burst inspector is this tool where you can you look at all the kernels that are currently tag for compilation and find them there and pick them and you can iterate on the code yep right so you can look okay what if I compile this for AVX how does it look for SSE all right you can look at all these different hardware features and see how it transforms your code and see that it makes sense for you right is that what I expected and you can also see the dotnet il that we consume that fed into this right so you can look at the input to our compiler and see if it actually lines up with your expectations like why is it compiling this I thought this was if left out likewise is here right it's a very very transparent way to work and you also get the LLVM I are both unoptimized and optimized so if you suspect that there's a compiler bug like well you have all the tools here it's very easy to copy these things out and just tell them show them to us and we can figure it out it can also play around with the different compiler options here like you can enable the safety checks which normally should have disabled when you're profiling things but it's convenient to see just how much they're adding in so overhead to your code things like this so again what we're trying to do is trying to make unity better by having a compiler but we're also trying to make this compiler way smarter by knowing stuff about unity so I'll talk about three things here in the talk and one of them is context where alias analysis the other thing and we've talked about is I'm going back to the precision determinism like what are we doing there and we're also going to look at some research we're doing with a higher level data layout changes you can make to arrays but let's not with aliasing so if you're not familiar with this concept right you may go to Wikipedia and then read in the most high striking Wikipedia voice in computing aliasing describes the situation in which a data location in memory can be accessed through different symbolic names in the program right which means you have more than one point or reference to a thing as a result aliasing makes it particularly difficult to understand analyze and optimize program emphasis model okay so let's walk through a simple example and I'd use C++ here for this five line function but you'd have the same problem in c-sharp with a managed array for example right so you have two inputs coming in here a and B there are some blocks of floats and I have an output block like a I'm gonna write some floats over here fantastic and then I'm gonna pick eight floats from each array and then sum them and I'm gonna write them to this output array right and if you're anything like me and you like to keep up on how Hardware works you're like wait I know Hardware you can do like more than one thing at a time there's this indie thing so this looks like the perfect example right it's eight which is like a perfect match for the sim D width so I would expect this to be really efficient like maybe this isn't some low-level code or whatever and you compile it using the best compiler in the market or whatever if clang and you get this all right I'll excuse you - don't need read assembly so in fact I prepared some slides well this is a scalar move that's a red flag right away like oh wait we're fetching one float okay that doesn't look like a good beginning and then we're doing a scalar add okay and now it all makes sense right we've got scalar loads and scalar adds and the compiler unrolled it for it eight times okay what did you do that like it we're scratching our heads and we go back to the code oh and we realize is aliasing right the compiler cannot know that your output block isn't one of the input blocks and what if it is right then you're actually mutating what you're going to be reading next and the compiler has to be defensive and assume that that's what's going to happen right and what you can do in C++ and C is you can slap this restrict keyboard on a pointer and this tells the compiler that no no it's okay you can break the rules I promise which we'll get back to that this output array is in fact not overlapping with those input arrays and what are we get instead we get this so again I'm gonna be helpful you've got a packed load operation for memory right this moves four times as much data so four floats at a time and this add instruction is adding four floats independently of each other together right this code is four times as fast so why am i boring you with all this like why don't we just expose restrict to c-sharp and you can sprinkle it everywhere and things are great right no because first of all I don't know how many people in this room had heard about restrict or or you know you look at a lot of code bases like people don't really go through the time to put to put restrict in everywhere to get these or simple speed ups why don't they do that well remember I had that function up and we could reason about it and had eight floats and everything and that's great right but really that keyword while it applies to the function you're really making a promise to your compiler in your tool suite about every possible call to that function present and future right you're promising the compiler that junior guy can hire in three months he'll totally be on board with this like he's gonna get this and he'll never make a mistake and it's not gonna a least a race and if he does like undefined behavior doesn't even start to describe what's gonna happen right and if if you if you do if this is like an extension that's not even standardized so good luck with that right and okay so if that happens like how do you know that it happened well you don't like there are no tools and we're like scrubbing all the object files and looking for things that look wrong right so it's a difficult tool to use correctly and another serious problem with restrict is that it's unhelpful if you can't get to the pointer right so in a c-sharp context you have a list of Te or something inside it's it's a managed array that has a pointer I can't get to that pointer like how are you gonna put this information into that struct right and in C++ you've got STD vector also contains a pointer like you can't get at that pointer if you'd like don't fish it out and throw away the vector right so here's another way that this fun problem can materialize so again I'm using C++ here for this example but you know again you'd have the same problem in c-sharp so bear with me here where I've made sort of a very simple STD vector like abstraction so it's an integer array right so I have a pointer to a block of ins and have a count and I have a member function here on this array abstraction to say like hey go through the array and set every element to a particular value okay seems straightforward and then maybe it's you maybe it's someone else on your team swings behind there like hey let's not get all spendy here and spend eight bytes on that count because we're never gonna store more than four billion elements in here so let's switch that to an int right that's the only change that gets made you're like yeah I guess that's reasonable we can check that in and then today after you've being to perform a specialist you get this book hey this code is four times as low what what happened well let's look at the disassembly so on the left hand side where we had a size T we're getting blocks of 256 bytes moved directly into cache at the speed of the machine right it's pumping out as fast as possible so it takes point six micros to run that for like an 8k sized array but what's going on in this right case here where we changed it to an end well first of all we're moving elements one by one to memory so that seems not awesome and there's this guy in the loop right this is a needless reload of the count of enter of entries right so for each iteration in the loop the compiler reloads it from memory and in fact this is slowest part of the loop right the latency of this instruction on x64 is the slowest thing so what is happening here is that the C++ compiler using type based alias analysis has to assume that you have set things up so that the base pointer to your first integer actually points to the count variable that's what it has to assume because someone could do that that's what it has to assume right and you can see one potential fix here which is to say I'm gonna load the count into a local variable which then cannot possibly be alias because how could you have a pointer to it when it didn't exist right and you just use that local copy of the count and then we're back to that optimal Cochin right and needless to say like this is fine for a simple example you can sit down and work it out like this but how are you gonna do that across your entire code base right and that's the real problem with the current state of the art with aliasing okay so you know about the problem space I think we're ready to play will it a leus if you combine start combining things into structs right I have struct foo I have an unsigned integer and I have a float going on in there all right so you're looking at some code and these could be references so it doesn't really matter right and you have to a laces here like you have a food pointer and a float pointer well the compiler have to assume that the alias yes because you could have a pointer to the float inside of the struct right and all bets are off so we have to be defensive all right so you're like well here's this other case I have a foo pointer and an int pointer clearly they should not Arius you think but alas they do alias because C++ and the type based rules are used they're in assume that unsigned int + int because they're sort of sibling types and they have the same size no it's okay they can alias and the problem doesn't stop there right it's not like you can look at any sort of abstraction just see what's in it or quite often we abstract things further by putting things into ever deeper boxes right and so here I put this food struct twice into a bar and now we come dragon with that in pointer you know and you're looking at this code at face value like well I don't seem intend there so this should be fine but in fact it's a li anyway because you could have a pointer to deep inside of that hierarchy right and this just points out how difficult this can be to just analyze at face value all right so to wrap up the problem statement here what actually is the problem with aliasing it's not that the code is incorrect right it's that prevents the compiler from using simply instructions as if you saw in one of the simple examples here it's often a 4x performance loss and that's not holding it so but I'm not gonna bore you with a super low level architecture details in Vista but if you're really interested you can go back to my talks from a couple of years ago and look into more more into why that's true but we also saw that reload of a variable in the inner loop right and quite often those are going to be the style and performance killers they really really slow down your loops and so this is a problem because compilers working without context cannot possibly solve this right a compiler x equals plus compiler or c-sharp compiler can't reason about every possible call you can make to a thing and then make a global choice right so we need to allow the compiler to break the rules and that's the current state of the art here's what we're trying to do we're trying to put that context into the compiler and we can do that because we know much more right each job that you compile we know cannot get any sources of pointers from anywhere else besides from the inputs to your jobs right and so all the native arrays and component data streams and everything you're setting up we actually guarantee when we schedule those jobs that those things cannot have aliases they're guaranteed to be unique and because we own the compiler we tell the compiler hey this is true so in fact it is as if a performance specialist had visited every part of the code and making sure that those annotations were there automatically so we're not actually arrived at a fully perfect solution here like there's some pitfalls which you know if you want to be bored to tears I can talk to you about but what we're seeing even with the very first so scratching the surface implementation here is that most loops have a much better chance to auto vectorize and we get out of those needless reloads of data and you know the gravy is that it costs nothing to maintain for you when you're trying to get perfect or systems so just to go back to that initial example I had where we're adding 8 floats right here's the same sort of idea expressed as a c-sharp job so again I have two input arrays that are read-only and I have an output array that I'm writing to and I'm doing the same thing right I'm pulling out eight floats and writing that there are no annotations there's not no crazy markup which is the way you write it in the c-sharp job system and it generates Cindy called by default all right you don't have to do anything so that's cool and that's what we're at today and we can do better for more complicated use cases when you start copying these things around and moving them or whatever and we think we have a really good shot at doing that it's also missing precision control right and so the reason we think precision control is important is that quite often in a lot of code segments you can get a good speed up if you don't need that last bit of precision but to compute things with lower precision and lower accuracy that has to be a decision that you make as the all as the author because it's context dependent you know if you have visual effects in the far distance like it's a smokestack and it's going to be 15 pixels high like who actually cares if it was computer using doubles or halves like it's gonna look the same so you just want that to go fast but maybe you have another case with it's also a visual effect but it's the one game defining thing and it has to look good at 4k no you don't want that stuff wobbling maybe it's your terrain pathfinding code that you know if you take away too much precision it actually starts dividing by zero and generating paths that you know don't make sense or maybe it's you know some completely custom game thing and the point is like only you know write a general-purpose compiler or system can't make those trade-offs for you so if you're trying to do this today and you this is very difficult to do in c-sharp but in c and c++ you can sort of do it but you have very sort of clumsy controls so if you need to work with half position data which if you're not familiar it's just a 16-bit float right and so it's got a very small range but most where nowadays comes with dedicated support for half right so you can do twice the work with the same bandwidth that's cool and you I can think of hundreds of cases where you want to use that in the game but if you need to do this today you need intrinsics and you need to do all the things like for a specific platform and really down code it so a lot of people don't and even if you're just trying to make floating-point operations be more flexible say like I don't really care about this particular thing you can you can fold the expressions here and you can do just give the compiler more leeway the only tool you have really is just like treat the whole body when you have a headache right so there's this fast math option and you sort of do it per object file and it's kind of messy and it's really hard to ensure that the compilers are doing what you hope they're doing when you try to give them this license right so I'm talking about things like instead of dividing you want to multiply with the reciprocal right then the reason the compiler can't just do that is because the I Triple E rounding rules and these sort of things will not be exactly the same so you can loose a couple of bits of precision you know one you LPU something like this but again you know who is the author that's okay or not okay right the compiler can never know that and square roots and and other trig functions are good examples of where you want to just reduce the precision and just get away with faster execution so tied to this and related but orthogonal is determinism right and so if you're not familiar with this concept right a lot of network models send just inputs and then rely on everyone to compute the same frame right and it's sort of compression right instead of sending hundreds of thousands of units around you're just sending the inputs from all the players which are tiny and then it works itself out but it's the student this as soon as you have something non-deterministic going on everyone's gonna desync right and that's you if you talk to a battle-scarred RTS developers at GDC they will tell you endless stories about how all the fun that was right so what needs to happen is everyone needs to compute the same bitwise results for all the computation that goes on need to do the same thing right and it's important to realize that this is orthogonal to precision as long as they compute the same thing it doesn't matter if they use the low precision to do so right so it's not necessarily true that the determinism has to mean highest possible position so if you're doing this with floating point math and you're trying to target more than one compiler or platform it is very very difficult because compilers reordering things or even the slightest change to something can mean that you know you're differ by a couple of you'll piece and then things large to slowly diverge right and your players hate you and you know I don't know what happens next so you can fix this by using integer math for sure you can use fixed point and you know those things are always going to be deterministic but like who's gonna rewrite all of their stuff to use integer math right or maybe you have a game that's not the nor TS and you want to sort of target that space that's a lot of code to rewrite to make sure it's termina sztyc and it seems to us that no one's really interested in solving this problem right you don't see a lot of discussion on the clang mailing lists or you know language extensions to solve this problem so where we want to go with these two areas like precision determinism is to give you controlled per job or maybe per project or per module whatever makes sense is we haven't figured that out yet I want to expose these attributes and expose that they're orthogonal so you can get to make the choice so to go back to that RTS example right I'm gonna try to make a case for all those four settings right so a bane of my existence is that in every game I've ever worked on basically some junior programmer makes a bird system and you know it spawns a bunch of pigeons and they all fly around in paths and it takes five milliseconds and I just cry into my console right and it's just like why are these birds here you know if I had a really quick way of saying like wait these don't affect the game simulation and like there's there a view bound not that it would be because they're gonna be cache miss bound but let's say they were it would be really nice for me to just say like okay reduce the precision here like make them slightly cheaper maybe that's good enough right on the other hand in an RTS maybe you have a bunch of catapults right so the play needs to aim them and if long distances you need pretty good precision but you also need them to be bit twice exact right because when you let go with that fire button everyone has to do exactly the same thing or the simulate it's gonna decent you know but only on the other hand you have units maybe you have hundreds of thousands or tens of thousands right and their path finding is gonna be quite a bit of CPU work now it would make sense to compute all that with a lower precision because how much do you really care about this individual guy and the path he took no but what you do care about is that it's on the same position on every machine even though we're computing his path fighting with low results and you know you can make the case for like fancy effects or something that don't affect the simulation but you still want to use good precision so you can make all of these different cases so we have some stuff here sort of researches stuff to figure out before we can and deliver all this for example if you need a math library you can't just use the FP use basic sine function because it's going to differ on different platforms you can't call out to the math library because it's different on different platforms right so we have to basically internalize all that to be able to give you guys the guarantees that this is not gonna change and so there are options here there's a couple of libraries called sleeve and others that can also simplify things like sine and cosine so we're looking at those and what we have to figure out is how do we guarantee determinism and what's the test matrix look like what are the test cases so if you have feedback there just lives now on the forums but we think that we have a pretty good pretty good plan for how to set up sort of this testing framework we also are not sure what the performance hip is going to be we think it's gonna be reasonable but it's definitely be slower to run deterministic code than fast code right that's accepted but we don't exactly know how much the hip is gonna be so I'm sure there's gonna be fun out of spec things and Hardware bugs we're gonna find all right so the final thing I want to talk about in this talk is some of the future research and what really the promise of this burst technology is so I've mentioned Sims II a couple times in this talk and it really is important like if you haven't done Cindy or like you're just peripherally aware of it trust me when I say that this is how you get the most out of a CPU today and it's not only computation doing more than one thing at a time it's leveraging the hardware resources that you have like the caches of the bandwidth like all of these buses are 128 bits wide and if you're doing at 32 bit low you're throwing away 75 the cache bandwidth right and Cindy is the only real way you have to use all that hardware but you have to layout your data correctly to do this and you know I'm gonna try to give you a to slide introduction to Cindy so I can set up the next section of the talk right so how a computer works is it does things in sequence so we're gonna add two values let's say they're floats B and C right and we have a and if these things are all in memory here's what has to happen right we have to load B from memory we have to load C from memory because how else we're gonna do it and we need to add them and we need to store the result back to memory right just how a computer works do these things great all right the only thing you have to understand about Cindy in that example is that Cindy simply makes them wider right instead of loading B we load b0 b1 b2 b3 they're completely independent beasts it's a few had done that thing four times right same thing with C we're loading four completely independent C values and then when we're summing them it's still a single instruction that sums them but we're doing four completely different sums independent ones like if you have done the same thing four times and when restoring things we're storing not one value but we're storing four but the thing is they're consecutive in memory and so that's the only thing you need to know what Cindy really and so the constraints then are we look at the step here we're in load b0 b1 b2 and b3 great right but that's not the complicated part the complicated part is what we have to do who has performance specialists before we get to that point which is we have to make sure that all the data is already in memory in that form right because the the workarounds if they're not are usually often that great you have to switch those things around and you have to waste a bunch of infrastructure costs to actually get to data so it's worthwhile calling out that if you have basic arrays of primitives like you have distances or something that's homogeneous in nature you already have your data in that format so you don't have to do anything but the real confusion and like the tough well I should point out that that's true as long as you don't do anything crazy between your inter elements which would requires like the different algorithms but the real confusion usually you know people get a raise when you talk about Sydney but then when you start thinking about struts and abstractions you it's it becomes much more difficult so why is that well as programmers we tend to think of things and give them identity right and I blame sort of Pascal and C and Ada and every language every invented for the last 50 years for this right because we tend to think of a things identity as a pointer or reference to it right the pointer to the first byte in the struct or a reference to it like we tend to think of that as the thing right and so we conflate this concept of what the abstraction is and how its laid out in memory and the things we're trained to do this right it's sort of like a comfort food for programmers it's like oh now I know what you're talking about you have a pointer to the first byte and so when we make an instance by grouping things together you know not only are we sort of satisfying that human behavior of extracting things by naming groups of things right because humans are excellent at doing that it's one of the few things that I guess allow us to function in the real world right I don't refer to my car as you know oh that's that pile of mechanical parts one of which is the steering wheel and like I say it's my car right I named the thing but that doesn't mean that you know people who make the car have to think like that so it's really unfortunate that this connection between our brains working in a particular way and us picking memory layout has this this performance ramification because what happens when we do this like I've called a thing here I have some some grouping of two things right I have a floating point three position and have hit points that's great like maybe it's an exploding barrel that would make it funnier but I can't edit the slides now and we have an array of these that we're gonna do some work on like let's blow up all the barrels well because we abstract the things this way and group them together in this way here's how they actually lay out in memory our comfort-food logical instance is this box here right you got your X and your Y and you see and we get your hit points and it's important to realize at this point that these are heterogeneous types they're different like there's three floats and an int right and these all go in one long stream in the memory and that's what you get with an array right but remember what we said about Cindy in order to actually work with Cindy on disarray right you're going to load four things so when was the last time you want to do the same thing to XYZ who are floats and hitpoints which is an INT right at the only case I can think of is like setting them to zero like that maybe it's not the most common thing that you do right but that's basically what we've crippled the hardware to be able to do with this layout and still it's the default and what we do when it just shows that those abstractions for 50 years ago are sort of getting in the way of what Hardware does today and this is called a OS array of structures so on GPUs and and high-performance programming you can do different things right so if you say well what's the optimal layout for the hardware well if you'd tiled it around and you have the same abstraction in your head but you don't actually use that struck to use something else you have increasing addresses and you take your logical instances and you split them into a number of different streams right for in this case right now I've put all my exes in one area for all my wife's and one array I have all my Z values in one array and have all my hit points in one array something two things happen one most programs freak out because now you can't use a reference or pointer to a thing anymore like our logical instance is purely a thing of our mind like it doesn't look like that in memory anymore and I can be confusing right and it's this awkward step that you have to accept but the other beautiful thing that happens is that if you look across any one of these streams in memory you'll find like data you'll find x coordinates or hit points right which means that this is now super amenable to Cindi right you can load for hit points or for X values right for example let's say that you're doing a 2d projection and you don't care about the Y value you're not even touch that stream right and that makes for very trivially parallelizable Cindy so I mentioned the wrong SOA but you can also do an hybrid of these and because the traditional SOA Laird has some problems like you need all these massive arrays or small arrays and anyway you have a bunch of a race and that's fine for a handful but if you start getting up into eight or so then you to push the hardware in the cache bandwidth on the hardware because the cache Hardware only has so many prefetching streams so what you can do instead is you can tile this around and basically store it in one long stream that's interleaves and that's called AO SOA or chunked or striped layout you'll see different names for it so again we have increasing addresses in memory and what we do in our single stream as we basically interleave so we go X Y Z and then hit points alright X Y C hit points where the second guy third guy in fourth guy and these are so small that I couldn't really put in the boxes for logical instance but it's very similar to the SOA case right our logical instance is split into four different places in RAM and then we start over with a new batch right so you can see it's like you took that SOA layout and you basically splits all that together by forming these small colonies all right so I've ordered tiers with all this layout stuff like why am I doing this it's set up to to make understand how important good memory layout is for performance you've seen the other talks here do you see probably about the ECS and those are things and the easiest honestly gets a lot of the performance just from the better defaults for memory layouts things are already packed in a race this is the next step what we want to do is be able to say for this array actually laid out in SOA form right or just chunked SOA form which means that things are now scattered inside of that array it still looks like an array to you so if you read from it will actually go gather all those pieces and put them back in struct for you if you've right - it will scatter it out so that sounds like we just made it slower but you forget that we have our own compiler so what we can now do is we can teach the compiler to say oh SOA erase yeah they work like this here's the layout right and all of the strides and all the loads it computes now when it's trying to auto baked rice will be embarrassingly simple to vectorize so let me show you in practice what that looks like here's some sort of very simple culling job right I have a sphere and I have an array of spheres and those are my two inputs of this job and the purpose of it is to see does this input sphere intersect and in any of the spheres in the array right very common sort of test all the things problem and in my head I'm thinking the sphere like this it's got XY and z and radius which I don't know how you think about spheres but that's how I think about them and then the loops that we're gonna do here is we're in loop through all of the spheres and intersect them one by one like I'm writing scalar code here like this is not some magic it's not simile or anything and we're testing each one and keeping track of if we saw an intersection and the intersect routine is pretty much what you'd expect there's one way to write it right we subtract the positions and we add the radii and then we compare like are the positions closer squared than the sum of the radii squared right so it's like what you'd find on in any textbook so that previous example will generate scalar code even today with bursts and we're hoping to make that not true at some point but like it's going to be crippled by that memory layout right remember if you want to run wide on this we now have a layout in memory that's X X Y Z radius X Y Z radius right that's really awkward to work with so with this one-line change in the prototype that we've made we're simply saying that oh by the way this is a native array s away and that's the only change that needs to happen and we generate this code so this is full-on Sindhi all the way through like you would hand write with intrinsics and warms my heart and it this is especially interesting for performing specialists I mean I would love to have it had a tool like this doing Triple A optimizations over the years right because normally what has to happen for this to to work it's not only do you have to go touch your containers in your arrays to declare that these are SOA you also have to go touch every place that you're accessing them to make this transformation for gathering and scattering right and that's a ton of busy work so in practice you just have to pick the most important ones and hope that you guess right then fix just those but with this sort of tooling you can do it on a much much bigger level so that's what we're hoping to go with memory layout all right so in conclusion we're trying to make this compiler smarter by teaching it about unity we want to teach it about our jobs our scheduling how our containers work how memory layout work and we can do that because it's our compiler and we're only gonna be compiling unity code like you won't find a download of this that you can compile database service with and you've seen examples here in this talk and other talks about high performance C sharp which is this optimizable subset that enables all this to work so that's it for my prepared talk and I think we have a little time for questions [Applause] thanks for the talk so you mentioned this burst compiler is coming out in 2018 point two but that's also based on the like new HP c-sharp so does that like how is the like transformation gonna happen they're having a transition period like opt in for awhile and their parts are gonna be like you just can like translate to HP c-sharp certain parts like how does that work okay so the question is because this is coming out in 2018 point two we're not going to rewrite our code all of our code in HP c-sharp by then so what's the plan that's fair well the the compiler is not changing this is the language subset it's it is going to compile right and this is the direction we're all now with the performance initiative at unity right but we have plenty of features that aren't ready to be moved into this ecosystem yet so it's gonna take a little while right but this first embodies the future this is where want to go so I can't give you a perfect date or a timeline right but it certainly will be production-ready this summer so you can start moving your game code and other things onto this and if you write in the subset you can actually use this compiler so it's just a matter of like what features to bring over first you can start translating parts of it and into this subset or using the specific patterns and then yeah and you can even do things like you know if you have a mostly mostly data oriented system but you have a few tendrils reaching out into the object-oriented world like the game object stuff but you can't access from jobs right you can still do that thing on the main thread in the regular unity space and then set up the data and then move parts of your work over to the space and still get most of the benefit so yeah that's available this summer will be available for for release is what we hope so you can actually shift games like that cool other questions if you manually stride your structures today well first vectorize it for you it's hit and miss it's a good question so the question is if you were to do that SOA transformation yourself right you would just say like well I'm gonna write just you know in c-sharp be in unsafe context you can use the fixed key word like to make small blocks right so I could have done that I could have said like fixed float for whatever and you could have done that all the way and we can have an array of those and the answer is it depends so LLVM Zotto vectorizer is not particularly great for those types of things like sometimes it'll work sometimes it doesn't work and when it doesn't work it's very frustrating because you can find that you know we have for this particular example I was actually trying to get it to auto vectorize using exactly that pattern right what I found was if you wrote it the way I wrote it here it doesn't work but if you take certain if you take out the the loop that loops over those four things in the inner loop and you split that into separate function you get vector code right and its super brittle and we think that the only way to get that to work is to basically force feed that information to the compiler so one question I have about how the structure laid out in memory is what about empty Struck's that are just tags as one of the previous developers discussed is there like an empty base class optimization for that or is it just like a byte or something yeah so the question is what about structs that are empty right are we gonna do anything to magically optimize those away and I think the answer is twofold well in general we're not gonna do that but in the special case of marker types in the EECS we're definitely gonna do that right because as we're developing these demos and sort of trying to show you guys what the best practices should be we're finding that we have a lot of uses for those and it's kind of dumb to reserve a byte that you're never gonna put anything in right so yes in the easiest that's definitely on the roadmap that's what we want to do and we want to be able to have basically free tags that we can we all we need to know it's that it's there because it doesn't have any daily data we don't need to store it so yes one upfront Isaac will it be possible to fill a compute buffer with those like optimize the race because when I saw that I was like yes if I could just like point to that directly and wouldn't have to like build my own array and then send that over to a computer over there would be super nice right so to sum it up like you're asking can I can I basically put one of those arrays in a compute buffer and then just kick a computer on it today you can't well you can make one of them coffee it right but it is not it all built in at this point but we're definitely exploring that option and we think that there are certain cases where it can make sense especially on PC machines where you typically can have a little more GPU wiggle room right compared to like a phone runner or a console right it makes a lot of sense to be able to target a subset of the compute work there but that's research I can't promise when that's coming but you know you you know just like you pointed out it's not a very difficult leap to make so we're definitely thinking about that Thanks all right one more all right I'm real good thank you [Applause] [Music]

Info

Channel: Unity

Views: 18,879

Rating: undefined out of 5

Keywords: Unity3d, Unity, Unity Technologies, Games, Game Development, Game Dev, Game Engine, Machine Code, C#

Id: NF6kcNS6U80

Channel Id: undefined

Length: 56min 40sec (3400 seconds)

Published: Fri Mar 30 2018