How To Render 13,086,178 Objects At 120 FPS

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Good stuff I'm (mostly) familiar with, really cool you brought up Time.time caching as well. Really irks me the engine doesn't actually cache that value considering it's used quite often.

👍︎︎ 19 👤︎︎ u/_Auron_ 📅︎︎ Mar 26 2023 🗫︎ replies

Perfect video that shows how performance is dictated by the developer and not the game engine used. Often hear about how horrible Unity performances compared to Unreal Engine. At the end of the day the performances really based on the developers experience with the engine

👍︎︎ 71 👤︎︎ u/Dreamerinc 📅︎︎ Mar 26 2023 🗫︎ replies

Thanks mate. As a noob dev I've learned a lot from your videos.

👍︎︎ 5 👤︎︎ u/NomyDev 📅︎︎ Mar 26 2023 🗫︎ replies

Very cool, I was wondering how to do this a while ago and couldn't find a good explanation before. Thanks for sharing man, much appreciated!

👍︎︎ 5 👤︎︎ u/tharky 📅︎︎ Mar 26 2023 🗫︎ replies

Did you mention your test machine specs anywhere.

👍︎︎ 5 👤︎︎ u/digidomo 📅︎︎ Mar 26 2023 🗫︎ replies

That's so COOL

👍︎︎ 7 👤︎︎ u/jubin_jajoria 📅︎︎ Mar 26 2023 🗫︎ replies

I also always cache the Camera.main, however in the example code surely the compiler is just optimizing the cache loop out completely? My guess would be the extern call version (even if cached somewhat by Unity) causes the compiler to not be able to optimize out the loop. It would be interesting to create a build and decompile to IL and see what has happened to the code.

I do love your videos and explanation style though, great work.

👍︎︎ 3 👤︎︎ u/samoatesgames 📅︎︎ Mar 26 2023 🗫︎ replies

wow your whole channel is amazing. learned lots today from your 10 things you should be doing in Unity. subbed for life 😃👍

👍︎︎ 4 👤︎︎ u/Fl333r 📅︎︎ Mar 26 2023 🗫︎ replies

Nitpick: on the indirect draw example, I believe you meant to say resubmitting “draws” every frame, not “meshes.” Graphics APIs do be confusing.

In any case, fantastic demonstration!

👍︎︎ 5 👤︎︎ u/frizzil 📅︎︎ Mar 26 2023 🗫︎ replies

Captions

all right Hello friends let's see how far we can push Unity so this is my test scene it's just a whole bunch of Cubes dancing to purlin noise and if we get closer we'll see that they're all rotating based on their offset height and in this scene specifically we are doing individual mono behaviors so that means that every single one of these cubes has its own script and it is managing its own position and rotation I will show you how that's all set up so in our level 1 script we are and I know this looks a little bit weird but it is just a function which makes it so I don't have to keep typing this nested Loop in every single uh level so for each grid position we are instantiating a cube we are adding the level one Cube script and we're initializing it in the level one cubescript which is on every single Cube obviously we are setting its new position and rotation every single frame so in this I'm using a little helper function here which is doing an inverse lap a quaternion slurp depending on its height and it's also doing a calculating its y position and depending on if we're using the burst compiler or not I'm using Mast F versus Unity mathematics a bit of a spoiler for the best compiler there so back into the build you can see that you know 10 000 cubes we're getting about 75 FPS if we ramp that up a bit to 60 000 cubes it's starting to really chug you would never do this in a game and up to 90 000 cubes you know you're looking at eight FPS um so obviously not a good way to run your game so how can we improve this well an easy way is to use a manager script so instead of having a script on every single one of these cubes just have one script the manager script which will Loop through all of these cubes in an array and update their position for them so let's have a look at the code we are spawning them all just like we did in level one but this time I'm adding them to an array and then in the update function I'm looping through the array and setting their position for them so there is now not a script on any of these it's just one manager script and if we compare the two 75 FPS 90 FPS which is you know that's already enough reason to do it but a good test here is if we ramp it all the way up right to 90 and if we look away from these cubes right take away the rendering on one on individual scripts so 90 000 update Loops we're getting about 20 FPS we go across to two with the managed we're getting you know 31 32 so yeah there is a difference uh basically it's more of a data-oriented approach keeping all your data in the same place and and like looping them in one manager as opposed to uh having them all handle it themselves more of an object-oriented approach um and obviously you get the benefit of reducing um all of those update calls to just one okay so this next one is just avoiding an external call uh I this is more of a personal experiment for me I just wanted to see if there was any difference between uh grabbing the transform position each frame as opposed to caching the position and not actually doing that extern call so in this one you can see that I'm spawning them but I'm also saving their position in this last positions array and then in the update function I'm simply grabbing the array as a instead of transform position and then also I've got to write to the array so we're exchanging an extern call for a read and a write operation of an array so let's actually see any difference if we do 10 000 cubes on two we've got you know like 89 FPS maybe if we go to three you know a few FPS different uh definitely not worth doing at this scale anyway so let's actually take the rendering out on two with the extern call got about 69 70 FPS 70. if we go to three I got about 75. uh but let's continue with our performance test so how do we push this further right and that would be with GPU instancing now as you can see we have climbed massively in performance going from 90 with 10 000 cubes to over triple so if you don't need an actual game object you know with a transform component and possibly colliders and all that and you just need the visual representation you can just have the GPU directly render it and skip the overhead of a game object so let me show you the code so here's our level 4 script in our Loop we are just setting some positions now we're not even instantiating anything and in our update function we are looping through all of our positions and we are updating a matrices array with uh the new position rotation and scale so matrices comprised of three components and then once per frame we are now calling render mesh instance and we are sending in our material this can this can also have a bunch of other properties we're sending in our actual mesh that we want to render this is our submaster index so depending on how many materials that your mesh actually has uh you'll need to play with this number mine only has one so I'm leaving it to zero and then of course the updated matrices every single frame and then the GPU will actually just render that directly so that is absolutely a nice performance boost so let's uh ramp this up a little bit still getting you know 70 at 50 FPS go all the way to 90. still getting 40 FPS which is you know not too bad this is a lot of Cubes um all doing their own little computations here so how can we push this even further currently we're doing this synchronously right where in One update Loop and we're one by one going through each Cube and updating their matrices and that's slow so why not take advantage of our multi-threaded systems and as you can see there is a huge performance boost here so this is utilizing two components of unity dots right it's utilizing jobs which allows us to unlock the full multi-threaded capabilities of modern systems as opposed to just single threaded operations like we were doing before and burst which takes your IL or net bytecode and translates it down to native code which is super super fast to run so let's see how this scales up to 60k Q tubes still over 100 FPS up to 90 000 cubes and was still over 60 FPS here which is absolutely crazy so I'll just show you the code and how that looks using jobs and burst so as usual we're setting the initial positions we're creating this job here and then down in update instead of as we did here actually directly setting uh It All synchronously We Are scheduling this job we're sending in the native matrices and then the job is saturating all the threads and doing all of these calculations in parallel and because we're using burst compile here we do need to use the unity.mathematics library as opposed to the math F Library this will work in the editor for some reason but once you compile it uh it will it will break so you have to use the unity.mathf library and as you can see massive performance boost so how do we take this further and that is by adding the final component of gonna do dots which is ECS sorry the energy component system this is now pure dots and as you can see the performance is getting quite up there and this scales quite well so now we have 90 000 cubes uh at over 150 160 FPS which is insane right also because it's because we're using ACS and not GPU instancing this now gives us the benefit of having more control over our objects we could add collisions here obviously adding any kind of physics or collisions is going to tank your FPS a little bit but all right so if you were to ask me what my opinion is of unity dots I would say that I love it I would say jump into using jobs and burst immediately because the performance increase that you can get from it is insane if there's any tasks that you think can be done in parallel it's so easy to put it into a job as far as CS goes the amount of breaking changes that it goes through especially just in like the last two years is insane basically every article is deprecated you've got to hunt obscure sources to find out how to do things I would say it will stabilize this year as 1.0 just got released but holy it has been just like a wild ride just trying to learn enough to put some decent tests together anyway um so where do we go from here how can we improve it even further and would you believe me if I said what I've already shown you is not even close to the performance that we can push this next one is truly mind-blowing so let's jump into it okay so GPU instancing indirect now I know this is not the purlin cubes that you've been looking at but holy I was getting bought a point in cubes and I'm sure you were too so I've changed it to this and this also allows me to kind of like show you how it scales better than those purling cubes um so there is one main difference between indirect and direct GPU instancing so we've been using direct previously and with direct you have to send the mesh data to the GPU every single frame which is a massive overhead in indirect you send it once right at the start and then the GPU can cache that mesh data and then reuse it every frame so I'll show you how to do that so basically we're just filling up this uint array with some mesh data and then we're setting it to an args buffer and then when we're rendering it I'm just sending it in right and the GP will actually cache this and reuse it every single frame so that's obviously removing a lot of overhead but the main most significant change between the previous demos and this one is that I have offloaded all the computation to the GPU previously we we were here on the CPU calculating all the matrices and then sending that data up to the GPU but on this one I'm doing it all in the Shader so at the start I'm just creating their initial positions right I've got these two position buffers position one position two basically I'm selecting for each Cube I'm selecting a point close to the center of the sphere and then another point which is further out and then I'm sending all that to the Shader which I'm grabbing here and then I'm doing all the calculations directly on the Shader I'm just lurping between the close and far position as well as changing the color depending on how close they are to the center to give it kind of like this burning Sun Visual and with just those two changes let's actually see how much we can render on the screen without my computer completely dying so the couch fifty thousand ninety thousand and it looks pretty crazy when you get all close looks really trippy it's like a 90s wallpaper or something um but then we've also got this multiplier here because this was just not enough so this goes up to 30 right I'm going to pump this all the way up to 30. 2.7 million beveled cubes all rushing in and out calculating that position and their colors right and we're still sitting at 20 FPS if I wasn't recording this would be um about 30 35 FPS you'll have to take my word uh but it's it's absolutely ridiculous there's no culling going on whatsoever every single one of these even on the back are being rendered um and it's still at 20 FPS it's just absolutely insane it's such crazy performance now you may be thinking all right but this is cool for like visual things but what if I want to interact with with these cubes what am I going to do here well that's what the next test is about so if I go to eight you'll see I've got all this spinning around and I've got this little Pusher here and I can control it with my okay board so the way this works is I'm still using GPU instancing indirect but this Shader here is actually a lot simpler so I'm sending in some mesh data with their base positions and all I'm doing in this Shadow is setting their position according to this data here and also their color but I also have this compute Shader here so this compute Shader is receiving that same data per Cube but it also has this this Pusher position which is which is this little sphere here and this Pusher position is being updated every single frame so every single frame I'm sending to the compute Shader um the position of that object and now depending on how far away the cubes are to that Pusher I determine how far we need to push it away um and then I actually override the data in this mesh data right I'll set a new Matrix with the position and also determine uh how much we've pushed it so from zero to one the amount that we've pushed and then back on this Shader this Shader doesn't actually even realize that this mesh data is changing it's just performing at same simple routine which is um setting its vertex and color its position and color and that is how I am able to actually manipulate um these objects virus Shader it is worth noting that both this and the previous one will only work on systems that support compute shaders all right so pretty much every new mobile phone will work old mobile phones may not work every computer should work um yeah just thought I would note that now this very last one is just to settle a little debate that takes place on every single freaking Unity Discord and YouTube video so every time in a video that I cache the main camera you'll have someone in the comments saying Unity cases that for you you don't need to do it I assume they sound exactly like that but I'm here to tell you that there is indeed a difference let me show you the code so we've got three modes here extend cache property if it's extern we are making a million calls to camera.man if it's cash we'll cash it once and then call that cache and just to ensure that people don't come back and say oh it's because it's a property and there's a bit of overhead with a property compared to a field I'm also grabbing it from a property here so let's see how it goes so it's already on extern right one million calls per frame and we're getting about 43 FPS if we go to cash however 1.2 kfps so obviously a substantial difference property you're looking at 600 and by the way this is just caching camera main you get the same kind of results for any extern call even time.time or transform you will get the same results across the board basically avoiding extern calls wherever you can does actually make a difference for everyone that says that there is absolutely no difference you are wrong feel free to test it yourself do it at like a large scale like this one million calls there is a difference whether this difference matters is another story I personally think it doesn't nobody is going to do one million extern calls per frame right they're going to do one or two so actually casing it makes no discernible difference the only reason I do it is because Ryder yells at me and it just feels better to do it like I know that there is a performance benefit there as negligible as it is so I do it so yeah hopefully this was fun hopefully you learned something uh like the video subscribe and I'll see you next time

Info

Channel: Tarodev

Views: 39,161

Rating: undefined out of 5

Keywords: RenderMeshInstanced, DrawMeshInstanced, DrawMeshInstancedIndirect, RenderMeshIndirect, unity dots, ecs, burst, job system, gpu instancing, compute shader, unity shader, data-oriented design, benchmarking, millions of objects, framerate, performance

Id: 6mNj3M1il_c

Channel Id: undefined

Length: 14min 56sec (896 seconds)

Published: Sun Mar 26 2023