Destiny's Multithreaded Rendering Architecture

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello so we're going to try to start on time I have a lot of materials and hopefully you guys will last through the whole talk welcome to destinies multi-threaded render architecture talk my name is Natalia - torchic and I'm a graphics engineering architect at Bungie today's talk I want to warn right away is not about shader tricks it is not about specific graphics techniques or other sexiness we covered a fair amount of that in previous presentation of GDC and SIGGRAPH and you can download a bunch of them from the advances website and this slides also will be an advanced of slide website so if you miss some diagrams or some information you can look it up there now for this talk I'm going to assume that you have some familiarity with task based parallelism concepts so I'm not going to focus on job manager designs on synchronization primitives I'm threading etc now I also recommend that you check out very generous talk on the foundations of the destiny threaded model because he covers a lot of the base concepts that I rely on in a design of our renderer now today what I will do is cover the Destiny core rendering architecture specifically I'm going to talk about how the engine data flows all the way from gamestate to the GPU about how we approach renderer Java fication all of the concerns for tasks and data parallel execution including dynamic work load balancing latency reduction techniques as well as jitter and also how did we keep GPU fully saturated at all times getting rid of GPU bubbles as a continuous topic throughout development I'll also touch on architectural principles that helped us encapsulate some of the complexity of the multi-threaded design and a lot of graphics engineers focus on what they do best writing graphics features now if there is anything I want you to walk away from this talk and that is that creating a data-driven pipelined architecture allows you a lot of flexibility for creating graphics features as you go and also allows you to help optimize performance for a variety of platforms and create cross-platform implementation for features so the outline of this talk is as follows we're going to do a quick breakdown course grained parallelism we're going to talk about the goals that we set out for the Destiny architecture well dive deeply into the details of this architecture and then draw conclusions about what we've achieved so I know most of you are familiar with this but let's do a super quick overview of what happens on the frame in a game of course we're going to run our simulation this is the beginning of a game take AI physics animation and any code that affects the gameplay computations then we use the simulation data to determine what is visible in our view this can be visibility computations or just the raw list of visible objects for example in UI or in cinematics we use these elements to generate GPU commands this is the traditional CPU submit operation now the result of this is GPU command buffers that will flush to the GPU which crunches on them to output the back buffer which will flip to display to the player the very next available vsync that's pretty straightforward and it mats pretty well to what's called this system on a thread pipelined architecture we should do that approach for a halo engine now in Halo Reach we met major game functions to a small number of threads which is what you see here now we had unique threats for rendering audio simulation where each CPU thread was essentially mapped to a hardware thread a different way to view this execution is by looking at the frame tech diagram now here you see as we go through the game tech end simulation computes game tech data and then we're rendering the previous frame in the meantime that's the run of the CPU submit that I mentioned once the simulation has finished we copy the game state the entire data to start rendering this frame during the next game tick in Halo games all output systems processing began only after the simulation has completed its work for the current tick and then we copied the phone game state we ran serializable ability of computation at the start of each render submit operation now until visibility is done we actually can't start generating draw calls so we can't feed the GPU one of the problems with system on a thread is that you have suboptimal workload distributions for heavy workhorses for example rendering and simulation threads you would see very heavy utilization consistently but a lot of threats saw a fairly a huge amount of idle time which is of course bad now this approach also doesn't scale well to heterogeneous prone platforms such as ps3 different core layouts an additional hardware force would not actually support in this model we also pay for a full copy of the game state which of course includes a ton of data we we the rendering guys don't care about I don't need the pathfinding data or the animation state machines a memory constrained problems this was quite a bit of an issue serialized visibility computation also means that again if we didn't pipeline it correctly we would see GPU bubbles in certain frames and potentially longer latency but the plus side of this design is that it's darn simple to use it is super extensible and of course it's convenient to code for its serialized execution for the rendering it doesn't exhibit any complex concurrency problems because a threading model is super straightforward now we overlap simulation or rendering which means that we can do more work in each phase when you have complex AI in physics computation and animation and your simulation can take up the full workload budget and heavy rendering work clothes this pipelining is very important so with all of this in mind what were the goals were set out for the renderer engine for destiny of course the most important goal is well we want to ship the bloody game so we want to ship a fun game with great visuals and responsive gameplay which of course translates to low latency destinies worlds are complex alive and beautiful our players explore large destinations with diverse environments with lush vegetation and a lot of features which means that our renderer needed to have high quality lighting dynamic time of day real-time shadows high resolution rendering weather elements rain snow tree rendering and many other features to make this happen we implemented variety of them and here is a teaser in case you haven't seen destiny you you so with all this graphic sweetness we needed an engine with rock-solid performance this was super important for all of our platforms consistency of that performance and so we needed an engine that would be scalable across several console generations and ideal it also lasts beyond what's currently available now of course we see quite a bit of a difference in CPU design as you see some of the specs here and GPU designs across console generations are even more varied and so this introduced a fair amount of challenges for our system design that our architecture had to cope with at the same time as I mentioned destiny is a fun fast-paced sandbox game with highly responsive gameplay which means consistently input low input latency is required and that's latency from the time the control player presses a button when when they see the response on the screen any sandbox game can have a lot of variability and workloads you can have frames with very little CPU simulation and very heavy CPU submit and GPU workload as what you see here mixed with a frame whereas CPU and GPU might take a little break and then of course during the big battle scenes you'll see both of them brought to their knees and our engine and renderer has to keep up with the dynamic workloads which required us to have very efficient load balancing now to ship on last gen consoles without sacrificing the gameplay mean that we had to squeeze every millisecond from every event available execution unit to do that we moved everything on the renderer and visibility workloads into jobs use dynamic load balancing and also smart job batching to get the best occupancy from both CPU and GPU and to keep our latency low now in personally a means your goal of rendering architecture was to keep the API simple all of this dynamic load-balancing introduces a lot of complexity for the threading design and we want it like I mentioned to have graphics engineers write leaf features without worrying about all of that complexity we had a ton of features to write for a lot of platforms and we wanted to create them quickly and automatically Java Phi them we also wanted to decouple game state reversal from rendering which allows us to improve rate late see and help make the rendering submission a data-driven streamlined kernel processor now this data-driven rendering pipeline allows us to have render passes executed via jobs that are operating an individual render elements arranged in coherent caches rather than directly and game objects so when all of this has put together the execution of the renderer looks something like this this is an actual job diagram from a shipping cosmodrome level on Xbox one we start by simulating gain ticks so you see a few simulation jobs of course it's fairly challenging to Java Phi the simulation secret sauce workloads then around our visibility workloads for variety of views to determine what should be rendered in this frame we use Umbra visibility in a destiny engine and last year's GDC covered a lot of the details for how we implemented the visibility computation our engine next we're going to extract current game state only for visible elements into dynamic data structure we call frame packet we're going to convert this data to GPU friendly formats ready to be slammed into GPU registers during the render prepare phase and then finally using this GPU friendly data we're going to generate the GPU command buffers for that and in other words draw calls during render submit this also goes wide to keep the latency low now it was important to have well-defined synchronization points which are necessary to ensure that we have safe and well-defined concurrent data access across each phase of the execution here's an example from this is a job executed telemetry from ps3 noted the distribution of jobs on ppu and then we also have a number of jobs running an SPU so mapping the previous diagram this is the set of jobs that runs the simulation you see that it's very heavily occupying PP use we have our visibility were closed which runs solely on CPS SPU's we extract game stage during this phase we occupied every unit that we have available we then run render simulations and prepare GPU friendly data again going on any unit we can get our hands on and then finally generate GPU commands to render this frame so we had a pretty solid wall of jobs to execute continuously throughout the entire frame now today I'm only going to focus on this part of our rendering engine I will not be able to give you all the details and especially for the advanced topics but hopefully I'll give you a taste of what pipeline data-driven architecture is all about so how did we decouple gamestate traversal from rendering we drive all the workloads by results of the visibility to decouple game objects from rendering and so in order to do that we also need to decouple visibility objects from game objects providing a very thin interface to allow them to communicate this is what allows us not to traverse game objects for visibility computations what we store in a game object is not necessarily what we need for rendering we as I said we don't need AI components we don't need the animation data so we only cache the data that we actually need in the render typically this will be the static data such as mesh references for example for static objects maybe their transforms their material handles etc most of this data does not change until this object is unloaded from the engine and so we can cache it in a cache coherent representation in memory and we also can ensure that the access to this data is read only for the duration of the lifespan but dynamically generated data we don't need full object and so we can extract just a small slice of data every frame we also need to care a data grab data only for will actually render in other words visible objects alone and if the object is not visible don't bother extracting its data this lets us save the amount of data that we need to double buffer it also reduces the amount of data we need to actually copy reducing the workloads so if we bring back the halo engine diagram and modify it for this approach you'll see that we'll have simulation followed by visibility followed by extract which copies out a much smaller per frame data out of game state which is then used for rendering this frame now to decouple game systems and visibility with separate representations into game objects visibility and render objects using thin interfaces to allow them to communicate across different phases of the engine so we start with a game object this is just your general biped for example each component will map to a particular render object on the renderer side when in the new game object is added to the world it registers itself with a renderer which then caches all the static data as I mentioned and then also make ash a game object component this is the interface that we'll use to communicate during extract the renderer then returns a handle to the render object back to the game object which of course caches it locally to the component using the render object within register with visibility and then you the visibility cashes the render object handle and this is exactly how we're going to drive the visibility and extract pipeline now we actually store the visibility inside the game object component which means that if we want to hide an element was simply unregistered with the visibility and it stops rendering now the render object doesn't know anything about visibility at all only visibility knows about the render object so with the coupled game object system we'll decouple the visibility and render layers and this is what allows us to drive the entire system from results of visibility without having to access game objects now we store all of the dynamic data being extracted or converted to GPU presentation in a custom data structure called frame packet ring buffer this is a multi-threaded allocation safe data structure that is fully stateless we generated every frame and we throw it away every frame each frames worth is referred to as a frame packet it has a fairly small footprint we stored about one megabyte on last gen consoles which for comparison is only 9 mega % of the total game State on those couple consoles for comparison for reach we copied about 10 megabytes every frame to a mirror so once we do that how do we go about the actual Java fication so a render pipeline operates on concept of views if u is defined by the usual frustum and camera parameters this is pretty standard and we have a number of views in our frame for example player view we have shadow views we can have overhead map use etc reflections and so forth now a view in our system also defines and to end render job chain so the view becomes the Java fication unit so if you look at the simplified form of how the architecture handles views we start simulating argument if we then determine what views we might have for the current frame and then we generate the red frame packet data structure for this frame we figure out what elements are visible in each view we populate internal renderer data structures for that view for efficient iterations and subsequent jobs we extract the game object data into the frame packet and then at that point we can tell simulation hey you're free to go and do the work for the next game tick then we prepare the GPU friendly data or presentations we also run render specific simulations cloth particles etc and then we tell the four systems hey now we're ready to submit to the GPU we run our high-level submit scripts with generate draw calls in individual submit jobs for each view and once we've completed all the draw code generation we fire up the present signal which of course tells the course system that we've completed this frames submit now we run a few global render system jobs that essentially are executed regardless of what data we have on our frame these essentially act as the main synchronization point in our system however for every single view that we have in a frame we create a separate view chain these job chains are data driven if you don't have data for example for a local light shadow view we don't generate a view chain for that specific view and here is an example of the job chain jobs now each engine Thane's completes by all of the global synchronization jobs these synchronization jobs are used by the core renderer and the simulation system to control access pattern to the underlying data containers that's your game objects the visibility data in the render objects the first job we execute is a global job for the frame to compute which views we have like I said this is to figure out first person view etc this job also reserves the frame packet and the view packets within it so here is an example of that representation now note that for all memory management in the Fuca frank packet within the view job chains we actually don't perform any runtime allocation in fact in our engine we don't do any dynamic allocations of runtime there are no heavyweight heaps we simply allocate frame entries and render nodes within the lock free frame ring buffer now once the views are determined those job sets up job chains by executing all operations in each view so next we begin the extract phase to copy data out of the game state now that we have views defined we run the visibility job for that view now this job operates in visibility objects that representation that I show showed you earlier and then it returns a list of visible objects note that we actually can skip the ombré computation and run a bypass job that simply gives us a list of visible elements since we only need the visible elements until the next job that we run we store the visible list and a temporary visibility ring buffer that only lives until extract is done so the next job we're going to run for this view is populate render nose job essentially this is the job that sets up cache go here in a race that the entire pipeline will operate throughout its execution we call those elements render nodes this is the efficient representation that is very small very cache friendly and then again from that point onward that's what we're using the popular render knows job is going to run through visible elements for that view it sorts them by the render object type for coherency and then it will reserve and populate the individual or race that you see here for the visible nodes within each view those are called view nodes they store some view information this is your bounding sphere distance from camera etc this data is stored actually in that tiny render node for the View for cache coherency since we run a lot of tied iteration loops on the render nodes during extract prepare and submit now this may introduce a small number of redundancy because for example bounding sphere is the same across variety of views but we keep it in the render nodes in order to have the data already in the cache when we're iterating in computing different operation now the view no can actually allocate data dynamically in the frame packet if it needs to allocate some very specific data for that frame each view node is going to map to a frame node now frame node allows our system to share data across multiple views for example we can have multiple views that have a skin character we don't want to copy skin matrices multiple times by putting them in a frame node we're only going to do this work once and we have one copy of the data so we take this even further by allowing the same model for the per object data to share for data for different game objects and here is an example this layout the reason why I'm telling you about all these nodes is because later we're going to map it to small set of functions that the graphics guys implement and all of the coherency all of the data access in Java fication is going to be done because they're accessing the right elements in the frame so for example for renting this frame this is a cinematic shot with the Raider character which you see here if the Raider is present in the player view then the view node would be our reference to the renderer object so this is the example pointing to that now each frame node is going to link back to that render object the Raider in this case which allows us to access a statically cache data so how do we extract data out of gamestate for each view note in the view packet we're going to extract data out of game state by essentially iterating coherently over the view notes which are sorted by the render object type we're going to reach into the individual object types extract entry points and then those jobs only will operate on the individual view node individual frame node they will write results solely in the frame packet and then we also allow reaching into the game object during extract using the handle cache than the render object coming back to our friend the Raider we would extract the stories per view attributes in the view node and then we're also going to extract it skinning transform in the world any particular parameters about its state and then store it in a frame note now we would get this data by reaching out to the game objects so here's how we walk over to the game object and then while all this sounds kind of basic like I was saying this is simply the core background for how we structure the individual jobs in order to support safe threaded access now extract phase is the only time that we allow crossing the object system and render a wall which in our world is kind of like this it's a tiny little hole and we don't allow any other access for it so during extract we actually generate jobs that go wide across this phase this is example of multiple extract jobs for the view and the jobs are generating with smart batching for different object types once extract jobs are finish we actually tell the simulation we're done go start simulating the next game check now it's really important to mention latency at this point as you notice we can't start doing the game tic until we finish extract and so we need to make sure that we keep the phases to the minimum in order to reduce the latency so here's an example now during extract phase all of the output system this is not just rendering this is networking UI audio they're all reading game States simulation is pause they're extracting data which means that game state is not modify it's read-only during that phase so we have to run all of these systems extract as quickly as possible in order to unblock the simulation during our development this was actually our most latency critical phase extract window for rendering consists of visibility computations for all views and actually raw copy out of data so since visibility is CPU work what can we do to move it out of extract if you look at visibility most of the computational workload is in static visibility for the main player view this is typically when we see the most workload this is also the most expensive view we want to do the most completely accurate visibility computations for example shadows are a little bit lighter on the GPU so let's move all the static environment visibility out of the extract window if the staggered visibility with simulation we can achieve that we start by pulling a controller during simulation it's important to do that as late as possible so you reduce your input latency and as soon as we have controller information we can figure out what views we have in a frame so once we have that we're going to run static visibility computations predictably using that information using the static computation for the player view now static environment doesn't get affected by simulation static by definition so we can run this in staggered computation with simulation at the same time and that's this example of the player view now once we've computed the predictive static visibility we're going to run dynamic visibility this includes both dynamic objects so my bypass my eye combatants my players but it also will include visibility for dynamic views maybe we don't know we have a light present in that we want to do a local light shadow view we only going to do that if we know that that light is actually visible in the main player view doing this computation moving it out or extract actually say this about 3 to 4 milliseconds and last gen consoles and about 2 to 3 milliseconds of latency on current-gen so it's very worthwhile effort next we're going to also slim down extract by only focusing on taking data out of game state and ignoring it for that face copy that copy out the minimal amount and don't do any complex transformation to unblock the simulation now we need to convert this data into a GPU friendly format and we do this during pre paraphrase we're also going to do this computation only for visible elements because we're running all of this for after visibility has run we avoid any extraneous work for elements that won't be seen and other optimizations that we can do is we can actually use LOD to skip computations that won't be perceivable by the player this is important in case for example I don't want to simulate finger bones for somebody in the back world because well we probably won't know them so here is an example of a video where we can actually look at some of the computations we do in prepare we convert the extracted data to GPU representations so for example we do we take the local node transforms for our skinning bones and then convert them for example to world space dual quaternion data that's ready to be shoved into the GPU registers we also run any non game State effecting simulations cloth is an example of that particles we don't want to run those element simulations for things that won't be visible now another thing that was really important for Bungie is to keep very predictable budgets for execution of different workloads for example cloth had to simulate in two milliseconds consistently so during prepare we design our systems to do bucketing into different LOD levels into for example even for GPU we maintain a consistent one millisecond for skimmed data and this is also where we do all of this bucketing and sorting work for all the characters that you see here all of that workload is happening during prepare so once you do that you basically prepare the render rate rental ready data while the simulation is running for the next game tick this provides us with much better pipelining so now when the simulation is running in parallel we're executing the prepare phase in this case similar to extract prepare also operates by iterating through the view nodes this is again on the core render architects architecture side it uses the frame nodes corresponding for each view node it uses the cache render object data and then executes prepare entry points for different objects prepare also writes results into frame packet there is an important difference from extract though it no longer can access game state remember simulation is ticking bringing back our afraid the friend the Raider during prepare we're going to fetch its view node and we're going to run cloth simulations for its tattered key we're going to compute all of the non-deterministic computations for animations of the bone phase if the Raider is really close to our view we're going to skip all that work if he's far away we're also going to bucket him based on his birth count into a skinny LOD bucket remember that we want to have a predictable skinning GPU performance and all of that is done in prepared jobs which are running wide across different render object types once we completed all of the prepare we publish this data to be ready to be submitted and created GPU draw calls now the reason why we need this explicit synchronization is because again we might have multiple jobs generating the data that we may need to use across different views so we need to make sure all of that data is completed before we can start submitting it and this is the synchronization job that runs it next we're going to execute jobs to generate GPU commands I'll get back to that actually in a second because it's a fairly complex topic so when all the submit jobs are done we're going to signal to the GPU that we've finished and this is the call to present this synchronization point indicates that we're ready to flip the moment GPU has finished the workloads and also this is for again for those memory constraint platforms a sign that we're now done with the actual GPU commands ring buffer and we can start overwriting it with the next frames worth of GPU commands of course we can repeat this job application process for all of the view chains so here's an example of expanding into three views and again the underlying system does it automatically so for all those complex job chaining we need to make sure that the writing of the leaf features is actually easy this is the bread and butter of graphics code the majority of our graphics team was working on every one of us wrote features myself included so create the dog food consume the dog food was a very important you know architectural principle and we didn't want to change job dependencies or occupancy or synchronization points any time we added a new feature any time we added a new platform for that matter we really didn't want to muck with the leaf features either and so the job Java fication had to be transparent and automatic and the way that we've done it is via feature render abstraction so we talked about a friend frame rendering is if we have the same types of objects on the rendering side of course we have many different elements on our frame for example for this shot we have static environment instances we have gear objects this is the customization system with terrain we have trees and decorator system that populate our environment and then we have SkyDome objects etc so all of these elements build up the actual frame and we want to express this taxonomy in a core under architecture by using the render feature encapsulation you can kind of think of it very simply it's essentially the same data representation the same code path equals render feature and so for example most of the skin elements will need skinning transforms they're going to need to convert them they're going to need to shove them into GPU registers in a very specific format and they're going to need to iterate over mesh parts great this is one feature a particle system I have a very different data representation a very different logic for draw called generation that's our second feature so each feature is implemented with what called a feature renderer interface this is essentially the main interface for implementing graphics features in destiny this interface defines how we extract the dynamic data how we convert it into GPU friendly representation what simulation we might run on it it also defines the actual data format with store in the frame packet all of the nodes that I mentioned to you and it also defines of course the code how do we render the bloody thing so it exposes a number of entry points simply thinking functions with very explicit inputs and very explicit output structure which of course provides the data encapsulation that we need for the safe multi-threaded access now a game object can register with multiple features they can be statically determined what features it needs to register or we also if we add a dynamic component we can then register dynamically with the appropriate feature so if we look at this hunter lady we register with a simulated cloth feature renderer we register with scan feature render and we register with customisable gear render now we provided an interface for those entry points as I've mentioned with very strict rules for what data they're allowed to read and write this encapsulation is what allows us to actually ensure safe access they can only render no data the view the frame notes and cached static data the later was done to automatically ensure that we don't overwrite any data because again these are going to run in parallel since you're only writing output to frame packet the feature writers automatically get double buffered synchronization and don't have to worry about it themselves now the core architecture generates jobs for every phase that extract prepare and submit and this job application is done automatically the reason this is so important is because throughout development of our game we continuously added removed features we can we added platforms fairly late in development and we also had to go back and rejigger the load balancing for existing platforms but not a relief feature codes was affected which was awesome now here is the set of entry points for feature renders our interface provides if you can't read all the text don't worry about it you can look up the slides later just to give you an idea of the sort of level of details these entry points mapped each phase this is the extract the prepare and then the submit now for extracting prepare we had entry points that operated only an individual render nodes and these were the bread-and-butter obstruct and prepare entry points this is essentially one render node at a time for one specific render object in this case a simulated cloth there were cases however when you had to do operations on all of the visible elements of the same feature for a given view this is the case of the LOD bucketing let's say for cloth or for skinned elements and so we added entry points to allow you to do that to essentially use on individual view packets then of course you also wanted to do global operations you wanted to look at all the visible elements across all the views for the specific feature and then basically operate on that data to do a more frame global operations now all of these are running as individual jobs concurrently at once how do we deal with the concurrency issues on the feature renderer level we split out the operations by frequency of access this is the view frame and object and this is what allowed us to share data across different across different render objects to save memory in the rent in the frame packet so like I said we only need one copy of skinning matrices we also allow sharing across objects it also allowed us to save performance since of course we only want to compute the complex data once for the frequency that we use it the core architecture uses sync primitives to ensure safe access now when feature renders we'll fill out the function called extract per frame they don't need to worry about anything related to the Java fication they don't need to worry that that job will actually execute let's say from shadow view number 1 concurrently from player view and maybe shadow view number 2 it is important to use however very performance and creation method because as you remember we're operating at a ton of visible elements at once and so we can't afford any heavy locking scheme for the high frequency operation for frame and so in our case this frequency is controlled by the individual render nodes and we can have tens and tens of jobs vying for the same element in this case this is the early in development example of using simple locks you can have essentially the wall of red is your locks this is how often you are wasting performance by doing that synchronization in a very non performant way this is of course not what we ship with instead we developed custom lossless synchronization primitives for the render nodes we use them interlock bit vector and then we essentially use a hashed render node hashing it to a key during synchronization the key varied by frequency of access now one word of caution is that although lock less primitives are superb for performance once you get them right to get them right you get the Schroedinger bugs this is your Oh holy I can't really handle this because the moment I start debugging your ended printf the thing disappears so timing related bugs due to lossless primitives can be quite interesting to debug and get correct so in our case we ran extract and prepare prepare per frame operations that will do the expensive computations only that element is in any of the views in this case for the luckless primitives we essentially hashed it to the per frame no taxes this was unique to all objects that are registered in the render for poor old game objects so for example in that case of the hunter that I was showing you we wanted to compute a lot of the computations once even though we had a number of different features for that game object again skinning cloth simulation but we also did dynamic AO for the objects we also were computing forward light probes and like I was saying we were running a bunch of animation jobs for visible elements and so in that case we computed that using the luckless sync primitives that were based on the skeleton hash since that skeleton was unique for that specific game object so let's look at a couple of feature examples simple one is a static projective decal so here is like a 2b you can kind of see that in the image appearing this is very simple there started they never change they don't need to extract anything they don't need to prepare anything so we simply need to render them using the submit node entry point now as an optimization because you know you're going to render a bunch of decals at once we can also set up all the render States in the block begin and then in block and the next level is skin feature render now of course this one needs to extract a bunch of data from game state this is my transforms my object properties in extract phase in prepare like we were mentioning we're going to do all the per object computations for bones based on the object LOD we're going to generate a GPU registers dual quaternion data etc the dynamic AO light probes now both extract and prepare use render node entry points and they only rely on for object data in computation to save essentially the representation and work now and of course we render our skin element using the submit node entry point now one of the most complex features is cloth feature renderer this feature render ended up implementing almost every single entry point that we have in the engine it iterates over all of the possible cloths at the start of the frame to reset its simulation state it then extracts the skinning data object transforms the collider transforms for the cloth elements in the extract per node entry points in the finalized on the entire frame we actually run the LOD bucketing to constrain the cloth simulation elements to hi-low and GPU skinned hi-low simulation and GPU skin buckets because again remember we're going to only want to do two milliseconds worth of simulation then in prepare entry points we're going to update the cloth colliders we're going to do the gather for static and dynamic environment shapes and dynamic objects to collide cloth elements with we're going to launch and run have a cloth solar jobs we're going to synchronize on the final cloth world updates for the whole frame we're going to update the cloth vertex buffers and so forth so you can see that the feature render can map to very complex execution pattern and of course to render individual cloth element we're going to run the submit point so at this point we have a job pipeline that submits general views this is pretty far from the real work of course because we don't render with the same shader in the same set of render targets so what we really want is they selectively select what objects we're going to render using the right share techniques when we want to render that pass of course we want to make this process automatic and super extensible so this is the CPU submit the point of CPU submit is to generate GPU commands for that frame the first workload we're going to do is run the player view submit script I call it script but it's not actually executing any Lua or specific script language before we break down how that works let's take a quick look at how the GPU frame is structured if we look at breakdown of the frame you'll see that we build a frame from a number of passes here's the passes on the right note that these passes are actually different depending what data you have visible in the frame we typically start computing GPU updates for all of the particles in the GPU driven elements we render the atmosphere pass we do the G buffer pass this is the depth pre pass the G buffer a pink the decals we're going to do shadows this is the generate cascade and local shadows use and apply them we're going to do the lighting pass the additive decals the transparent space of course we're going to render geometry to light shafts occlusion buffers for the sweet sweet god rays and then we're going to punch everything out by rendering lens flares and then finally we have the complete frame and that's our final composite so what does it take to submit this now the high-level submission is fairly branchy complex code we need to determine based on game state are we in first-person mode or in third-person mode you have shadows do we have cinematic high-quality subsurface scattering or special post-processing passes and the script executes essentially high-level rendering directives it's important to say that it interleaves global state this is your render targets binds clear expensive state for aliasing decompressing and results and those go directly to the main GPU ring command ring buffer we also store the the global state for like frame registers views etc also in the global ring buffer this is the direct insertion now at the same time every time I want to issue one of these passes that you see on the left here we're going to issue a submit feeder Ekta this is a call from the high-level script to the core architecture that is designed to generate GPU commands for that view from the perspective of people who write the high-level script it's cross-platform it is not Java fine and of course it's very simple to create it's just a couple of functions literally which means that if we need to a jobs a job adjust the job granularity again this also doesn't need to change under the hood the core render architecture converts this directive to a set of submit view jobs and the rule of converting jobs vary per platform duty are desired to have optimal GPU occupancy and keep the game latency low I'll talk about that in a few minutes if we come back to our previous design for the pipeline execution where does GPU fit into this picture so what we do is essentially thread the rendering insecure to start feeding the GPU right away to have GPU crunch on the data as quickly as possible to get a frame to flip at the very next V blank that we're able to hit and ultimately what we want to do is figure out how to fire up the submit view jobs using the feature render entry points that I mentioned for the submit node from the high level submit pipeline what we need to do is distinguish which visible elements from this frame packet that we have need to be submitted to like let's say a decal stage using the transparent or or transparent using this high level script view and so again as I mentioned the high level script is transparent and it's organized in phases where essentially each phase is a sequential combo of required passes this is you know shading paths don't mapping resolve etc and data-driven dynamic fastest this is the render stage directors now render stages are mechanism essentially think of it as a filter for submits at its core it's about filtering the right shaders of finding the right shader to submit for that phase and finding the right mesh set to submit for that phase using that shader so how do we go about it each object can subscribe dynamic Lehrer statically to a render stage either at registration or at any point so for example here we activate our super right now the object registers the transparent render stage for that character and then we start rendering to the transparent effects paths for that element dynamically now to filter the list of visible elements in each stage we use the view subscription for the render stage individual views will say hey I actually support this or I do not so for example shadow view only supports the shadow generation render stage which means it will only select the shadow generation shaders and it will only render shadow caster's a player view will support a variety of render stages so visible elements will render G buffer decals transparency etcetera when we execute those high-level directives that I mentioned earlier the essentially the tests that we do on a high level job creation is does this view subscribe to the render stage and we have any visible elements that also subscribe to the same render stage so how do we get that information about the elements well we have a custom shader language that we developed for destiny called tf-x for Tiger effects and we basically have each technique specify what render stage it belongs to once this is specified we can actually offline compute for every single match that this element has an assigned shader for and then create containers for each render stage for example mesh one two three will render to the transparent stage and mesh five six eight will render to G buffer this gives us constant access at runtime very small data presentation which means we have super cache coherent execution at the submit jobs once the visibility returned the visible list we're going to go and essentially populate the set of render stage nodes now what we do is as I mentioned we run the populate render nodes job and now on top of the view nodes that I mentioned we also will use render stage subscription data to create blocks for individual submit stage this is what the what the submit jobs will iterate on super cache coherent and quick to do and again this is going to be selected for the specific view that needed now these submit nodes is actually how we run all the sorting remember transparents need to render in a very specific order back to front great we only saw the submit nodes and we're good to go if we want to sort G buffer submission to be per feature for coherency great we'll create a submit sort metric to essentially use that to control it / render stage now any object can register as many submit nodes as we want because we allow it essentially based on the limits of memories so if you wanted an object that let's say had five transparent elements and you wanted to suppose essentially sorted individually for the mash parts create five submit nodes you're good to go then you are sorting and submitting each individual mash part in the right order which was super useful our artist loved it for creating higher quality transparency and of the sorts themselves is super fast because they're operating a tiny elements in cache coherence submit blocks and here is an example of the byte size for the individual render nodes as you see we kept them really lean and mean and so that gives us essentially really fast cache coherence core render workload execution for you know our swords our allocations all of the traversal for extracts prepare and submit now we duplicate some of the data as I mentioned across the view frame and submit notes again so the data is already in cache when we're iterating on them now when we say on a high level submit note that we're going to render first cascade shadow view the core system takes that function which is literally submit cascade view for render stage and it's going to generate submit jobs for the Cascade shot of view job chain using the view packets render stage submit node Blanc internally what it does is essentially grabs the appropriate submit node block for that view so here's an example and then we're going to execute the appropriate render objects using the essentially the view node the frame node the cache static data and of course as I mentioned we don't allow access to game stay because simulation is running concurrently the course system will actually split up the high-level directives into separate jobs so for example we're going to group set of nodes here you see color-coded example of the batching 1 set of like let's say to submit now go into one job 3 go into a separate job within the render stage and across render stages and we'll talk about that batching in a second so feature submits are already grouped by the code path similarity by the data presentation similarity we can sort the submit blocks by the essentially features type if the render stage allows it we're only executing using local state using specific stages kernels I realize this fairly late in architecture design but it I realize this looks very similar to GPU architecture I guess they came from the GPU hardware group from AMD so I must have been influenced a little bit by that and so all of this allows us to run go here execution in stage out and in some ways we have a lot to thank us to you for because of course being able to fit that into those precious 256 kilobytes is what forced us to go down this path but it Maps very well to data parallel processing our submit jobs just became fully streamlined kernel processors so when you put all of this together across all views all render stages all render features you come back to that Java fide renderer pipeline so now that we have the core render architecture established we can start adding bells and whistles to it the nice part about getting to this point is from now onward none of the leaf features are touched and you can reconfigure whatever you want in the core area and actually through our destiny ship cycle we tweaked pruned bashed and otherwise massaged job dependencies synchronization occupancy rules distribution across CPU and GPU continuously to make sure that we have super low latency and fast execution and we didn't have to rewrite any of the features I'm personally very happy about that since I own a bunch of leaf features and I'm lazy so now that we have this what other cool stuff we can do with our architecture it to make it even more effective render jitter and Rhydian rendering latency reduction comes to mind now we had to implement a number of these techniques unfortunately due to time I won't be able to dig super deeply into this but I added some slides in more details in the published version now to keep the GPU maximally occupied in our game latency minimal we multi-threaded our submit as I mentioned earlier we also implement a custom CPU and GPU dynamic load-balancing to flush out the GPU bubbles efficient command buffer was crucial once you've generated a command buffer Yi to get it to the GPU as soon as possible we also use asynchronous swapping and platforms that supported it make sure not to block your CPU if you don't have to and we also implemented a jitter reduction technique that essentially kept a constant present interval but a recovered latency when it went too far by reducing that present interval so with all this happening under the hood we were able to do whatever we needed to GPU occupancy is a very interesting subject and like I said this will be somewhat high-level we need to keep in mind the following information about the GPU occupancy when did we generate the CPU the GPU command buffer from the CPU if we take too long we obviously don't start the work and when the GPU finished working on the previous frame so to help answer a lot of these questions we actually implemented custom tracking of GPU idle there are a number of counters and all included in a published presentation the your engine can query to get the information about the idle blocks on the GPU now these exposed counters are a little tweaky and not super reliable on all platforms but still I highly recommend you use it anytime you track it throughout your performance run and you start seeing big GPU idle reports you know you're starving your GPU and you need to address it so here's a timing capture from the xbox profiler using a development mode now in this case we're going to start CPU submission here and we're going to start executing let's say serialized command in our case that actually meant a few jobs but every one of these job is using a single threaded device and so you can see that we ultimately start working on the processing of these GPU command buffers fairly early we're able to feed the GPU quickly here but we're done with a generation of the the GPU commands all the way over here this took a fairly large window and basically you right away know the serialized submit takes a very long time to execute on CPU also bad for game latency because we can tumble unblock the submit loop for the next one and this is where we also call or present now you can see that the GPU starts working on submitted draw calls here and you can see that there's a ton of bubbles in the GPU timeline this is the GPU execution now of course this timeline means you're starving your GPU you're wasting milliseconds and so it's negatively affecting the rendering latency obviously we can't ship with that mess and so what do we do to keep GPU occupied it's a fairly complex subject but in the very least you want to multi thread your submit you want to make sure to flush your command buffers as soon as you gottem this can be quite tricky on some platforms where you need to essentially run a watchdog thread to pipe out the whenever the GPU command buffer is ready flush it down to the GPU and you also need to muck with your submit jobs and I'll talk about that in a second to make sure that they're executing a proper tandem of CPU and GPU workloads and lastly the blank synchronization is very important you want to have set your computations based on V blank now in order to get the GPU work complete we want to make sure once that work is complete we're ready to flip it and display it to the player now the good news is a lot of consoles allow you to track V blank events and you know exactly when they are cured that's wonderful other consoles NPC in particular do not allow that which is terrible and they require you know manual work and so for example we implemented a separate thread that sits and waits for be blank event now in PC watch out for a deadlock whenever you resize a window or do essentially a full screen change because both present threads and the V blank wait wait threads are going to want to change your window pause that little tip can save you a week so how do we go about getting our GPU occupancy in the optimal way the quicker we can get commands to the GPU the better off we are so in this case we're going to start executing multi-threaded submit using deferred command lists or Kabam buffers and so once we're done we're going to run along a lot of these commands to submit to the GPU now the game latency is reduced look how much quicker we're able to complete the work here and then we present the frame a lot faster we might even run so quick that we're actually going to run the GPU a little late but it's a solid sorry it's a solid wall of work if you notice here we have 0 bubbles in this execution and so that's exactly what we want GPU is continuously working and it's not resting until we're done so how do we go about doing the this wall of solid work now important thing about multi-threaded submit is not just hey you need to generate a bunch of command buffers you need to submit into the GPU in a particular order we can't flush the transparent jobs to the GPU before your G buffer jobs have completed that's going to generate visual errors and so we need to establish in a very explicit synchronization for the flushing of the command buffers so we allocate the command buffers from that high level submit code we execute submit jobs and we flush them to the GPU in the same order that we've allocated because the submit high level code is going to be in order of GPU expectancy we're going to ensure that we're going to process them on GPU in order now this is remember that some high level script that I mentioned where we do a render view per stage for each feature renderer what we actually do for this directive is we map each request to a set of asynchronous command buffers or deferred command lists on a pro platforms they call it that way these commands are either directly inserted into the GPU command ring buffer or we use a custom indirection for example in xbox 360 there's a lot of trickery there for the actual flushing lncluding published slides but essentially think of it as each request generates a synchronous command list in concurrent jobs and so we can Java Phi as I mentioned using the submit node blocks and we group them by essentially render feature type and the render stage and it's important as I said that they're flushed in the order of creation and so each submit node block here will map to a single submit job in a single command buffer for the execution well as I mentioned we need an optimal cadence of CPU and GPU workload what I mean by that is we want to run really light CPU workloads first to get your GPU working you want to submit let's say typically a good example is my environment depth pre pass and so super easy to submit for terrain and environment instances it requires very little logic but GPU is now busy and then we're going to run a more complex let's say G buffer workload for characters there might take a little more time on CPU but GP is still busy we're ok and then we're going to render the heavy GPU work and then submit all of the computations for the rest of the pipeline so in our case we allow the features and render stages to supply a cost metric essentially mapping it to the cost on CPU and GPU and then we sorted that cost metric for the conversion of the submit jobs and so you can basically automatically fine-tune the dynamic load-balancing both for GPU and CPU occupancy for each platform because that cost metric can be switched on every platform per feature per stage creates a little array of little descriptors essentially heavy CPU lights medium CPU light CPU heavy GPU light GPU etc coarse grained but it really allows you to fine-tune to create that wall of GPU workloads which brings us to another topic and then dynamic load balancing now we needed to attack and access on CPU right because if in a data driven pipeline we can easily be doomed to a very high job overhead cost and so of course the naive approach to all of this Java fication is Java Phi the hell out of it move everything into a separate job yay we just start with that and we then discovered that unless yum plus we had about 7 milliseconds of job overhead and we cried a few crocodile tears because seven milliseconds out of let's say 30 millisecond workload is a fairly high percentage and so that was bad we need to get rid of that also you notice as I was saying some of the features do a lot of work others a pretty light my terrain doesn't really do anything you extract my skin does a lot so again using the same concept we allow specification of granularity of workloads per phase for each feature each feature render then supplies a simple metric heuristic that says for extract I'm a light one or I need to run in a standalone job and then the core render architecture uses that information and batches light features for a bunch of different render objects together and then maybe it will take some heavy features my scanned object or my cloth object and move it out to a separate job and so essentially this allows you to have similarly automatic load balancing and again you can tweak this cost per platform which I suddenly start doing a lot more work for terrain maybe for ps4 fine I'll just change that metric as reported on the ps4 now another nice thing about that metric is similarly you can also have the feature report which unit it wants to execute on and so again it can say hey for phase a I'm going to execute on SB or on CPU or PPU in this case and then of course the core render architecture will schedule the heterogeneous execution base that metric and so we mostly use this functionality for extract phase because that's when we can go wide across all of the units typically some simulation ran fairly heavily on ppyou for example and so we left it alone for those phases and so here's an example again from that ps3 execution so our PPU our SPU and if we look at extract phase you see that the workload is spread equally across all of the units and that's the example where we allowed some of the heavier features say whichever unit they want to run on and then we looked at the availability of jobs scheduled on that unit and then simply move that job to that workload that workload to that unit so I know this was really easy topic you guys lasted you know a good amount I'm very proud of every one of you and my ability to speak quickly so it's not a complete system description this is not a spec we barely touch the some of the more complex aspects of our system we should have a pretty good idea of a role or overall design of the system so back to our goals how did we do we did deliver low latency stable highly scaleable renderer that executed on all of our shipping platforms multiple console generation yay it enabled us to ship the game again remember that big goal the vast majority of our feature code things that aren't eme DRM ugliness etc were all cross-platform code the vast majority of optimizations for the core render architecture didn't touch that so I'd say ship it but where do you have now multi-threading the renderer was key to shipping destiny you saw that thick wall of occupancy on ps3 we wouldn't be able to support some of these more challenging platforms now we can't use this kernel processing and actually extend it in the future if we have some spare GPU to run on the compute units are we already have kernel processing of course we don't really have spare GPU but if we could we would and so the architecture scalability and dynamic load balancing cache coherent data access heterogeneous support this what allowed us to achieve low latency despite very heavy workloads across the board so again coming back to that takeaway if there's anything I want you to walk away from is if you do this similar data-driven pipeline architecture and then hide all the complexity from the individual feature writers you can write a ton of features you can re optimize whatever you want and both will be happy and not bothered with that complexity so a little plug if you like what you heard and you were excited about the work that we've done we're hiring many people contributed to the graphics of destiny many people contributed to the different aspects of the architecture we have an incredible team of engineers and artists and designers that created this game so stop by and let us know if you're interested so we're out of time but we're the last talk so if you guys have questions feel free to ask them while you're asking the questions the slides will be loaded up there and then if you have questions my email and Twitter's up there and vault student

Info

Channel: GDC

Views: 34,951

Rating: undefined out of 5

Keywords: gdc, talk, panel, game, games, gaming, development, hd, design

Id: 0nTDFLMLX9k

Channel Id: undefined

Length: 66min 37sec (3997 seconds)

Published: Tue Nov 10 2015