CppCon 2017: Nicolas Guillemot “Design Patterns for Low-Level Real-Time Rendering”

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Is this the guy who asked a question in Mike Action's 2014 Q&A and got rekt? XD

That aside, this presentation is good.

👍︎︎ 2 👤︎︎ u/TheBuzzSaw 📅︎︎ Oct 19 2017 🗫︎ replies
Captions
- Alright, so hi everyone, and thank you for coming today. I'm going to be talking to you about Design Patterns for Low-Level Real-Time Rendering and my name is Nicolas. So, the reason why I made this talk, or one big reason is because of the introduction in the recent years of brand new graphics programming APIs like Vulkan and DirectX 12 and Metal. These APIs give you control that wasn't possible before. And so what I think that we need to do at this point is we need to realize that the rules have changed. About what we can do with graphics programming. And so I want to take a step back and rethink how we can program with GPUs using the new primitives that we have access to. And a lot of this talk is, I'm trying to boil down things to like a level that can be understood so there maybe times where you disagree about the way that I present the information, so I encourage you to look at the nitty-gritty details yourself and try and think if you can think of and even better way of seeing a lot of these problems. One of the big motivations I have for what I'm doing here is to try to think of generic solutions to Domain-Specific problems that can be expressed in Domain-Specific terms for the solution as well, so this is just a little project I did. It's a ray tracer implemented in DirectX 12. It's nothing super amazing, but I thought it was a really great project and I was really happy with the results, because I implemented this ray tracer based on a design from a paper by Simon Elaine and the design of the ray tracer you can see on the right a diagram there. So the ray tracer itself is kind of like a pass graph, with data that passes from one node to the other through these cues. So when I started implementing this I started just writing the code, you know hard coding, you know, here's the function for the logic node, here's the function for the material node and just typing in DirectX 12 code and it was really tedious. And then I realized I was wasting my time because I was typing the same thing over and over again. So I just made kind of a pass graph system myself. Where I could add nodes and declare cues between them and then it kind of all just worked automatically. I could focus on those things that I really care about which is adding new materials and customizing all the nodes. So this is kind of like the general kind of solutions that I think are a great way to use these new APIs. Is to use like a low level API, to implement a very high level interface to work at the level at which you are most productive. So that's just the motivation, kind of like a lofty goal to try to achieve. This talk is mostly split into two parts. So in the first part, I'm going to be talking about some general GPU Programming Basics. From the perspective of a fresh, kind of modern perspective, I think. And in the 2nd part, I'm going to be talking about how these primitives can be used to implement various data structures that are useful for designing a renderer. So let's get on with part one. So, GPU Programming Basics. The first thing I want to talk about is Memory Management. So, I want to start with Memory Management because it's such an important part of everything we do. That it feels really important to address how to do that properly first. The first thing I want to do is I want to talk about memory management on a discrete GPU. So discrete GPU the CPU and GPU are relatively separate. As far as the CPU concerned, the CPU as you can see right here, has it's own region of memory, we refer to as system memory or various things for it. This basically your RAM. And the CPU can access this RAM through a memory management unit. And this memory management unit does very many fancy things. One of the things that it does is it allows you to use virtual memory. So for example, in this case, using virtual memory, what's happening here is that they system memory has two pages of memory allocated inside of it as you can see at the bottom. And these two pages of memory are not contiguous. But when the pages are accessed through virtual memory which you can see at the top in the MMU block. These pages appear to be contiguous. So they're not physically contiguous but they're contiguous in virtual memory. And this is a theme that we'll see over and over again. In this slide and the next slide so that's why I wanted to really deliberate on this point. The MMU does actually a whole bunch of fancy things. But another fancy thing that it does is it can automagically fold pages into the file system. So what I mean by that is to say that for example if you have too many processes running at the same time and you actually exhaust the total amount of RAM that you have in your computer it's actually possible to take pages of RAM and put them on the file system and run another process in the meantime. This is something that happens kind of automagically and is very interesting. I'll show you why I'm talking about this topic. It's kind of relevant to GPUs as well. Or rather the fact that it's not relevant is what's relevant. So, moving on to the GPU part, the GPU is, you can say it in this way. It's very similar to the CPU. It also has it's own region of memory. And that region of memory, it's called in this case, we'll call it video memory. But it has many names. This region of memory is also accessed by MMU. That is on the GPU side. Now this MMU is set to do very much of the same functionality. Such as, for example, virtual memory. So in this case there are physical pages of video memory being allocated, which may not be physically contiguous, but these physically dis contiguous pages can be accessed through a virtual address range that is appears to be contiguous. So this kind of functionality is supported by GPUs as well. And now we get to the weird part where the two devices kind of start crossing over each other. One thing you can do for example, also, is you can have pages of memory in system memory, which the CPU can refer to of course, using it's own virtual memory space. Also, the GPU can refer to it using it's own virtual address space. So this is one way in which you can share memory between the CPU and the GPU. And you can actually go one step further and do this the other way around. So in this case, if you look at the purple boxes, that just appeared, in this case the physical memory is in video memory and we have the same kind of situation where the CPU and the GPU each have virtual addresses that point to the same physical memory, but this time the physical memory is on the GPU side. So just to say these scenarios are all possible. And another one that's also important to note is DMA engines, so GPUs have the ability to, they have special hardware for doing copies very quickly but between system memory and video memory, using the so-called DMA engines. And actually one thing I want to highlight is that I deliberately put the DMA engine block here at the bottom and it's directly connected to the physical memory of one device the physical memory of the other device. To say that in general you can't assume that a DMA engine has the ability to understand what virtual memory is, which means that for example, it can only work at the granular area of pages. Because it can't go from one page to the other in a virtual way it has to actually go from one page to another in a physical way. Now there's also the case of the integrated GPU. In this case, the CPU and GPU are roughly one unit. In this case, things actually get a lot simpler. But it's not too different at the same time. In this case, we have both devices sharing the same memory, in system memory. They don't necessarily share the same address space, although that kind of thing is possible with some extensions. But in the most general case, they each have their own virtual address space in which they each have different addresses to refer to the same locations of physical memory. And one thing I want to highlight, which I think I forgot to mention on the last slide, one thing I want to highlight is the memory that's being access by the GPU in the yellow blocks, I deliberately did not connect it to the file system. In these diagrams, because what I wanted to say is that the memory that is accessible to the GPU, does not support this kind of fancy mechanism of faulting pages back to the hard drive, like it may be possible, but it is physically possible to support this configuration and some do, but in the most general case you can't assume that you can do that. So, moving on, I want to talk about Command lists a little bit. The way I see Command lists, from a big picture point of view is that in this case, we have two processors, mainly. The CPU which is an Out of Order pPocessor in most usual cases. And GPU which is an In Order Processor. And so by the nature of, you know if you think about the word in order, what does that mean, it means that the order in which it runs the work that it's given, is the order in which it was given. Rather than an order that is more efficient, that it can lift the CPU run time, which is what your CPU do. Right, when we talk about branch prediction and those kinds of things. Those are fancy things that a CPU can do because it's an out of order processor. So the importance of, the thing about being an in order processor is that it means that the performance depends mainly on whether the commands you are given were given in an order that is actually efficient. And so one of the big picture ways that I think about this relationship is that it's very beneficial to think about how we can use the CPU to order the commands to be sent to the GPU in a way that is most efficient. If you want an analogy, here's an analogy. So let's suppose the GPU is the train and the dog is the CPU, and the CPU is being very careful being very agile in laying these train tracks as a command list for the GPU to run. And actually while I'm on this topic, I'm going to actually take this analogy one step further. And I'm going to introduce to you the concept of a fence. So, a fence is kind of like an operating system object and also represents a hardware object, I suppose. What it does is it lets you keep track of the progress of the GPU as it makes progress in the list of commands you've given it. So for example, the fence has a current value. And this current value might initially be zero. If we assume that the train on the right is the GPU, and the tracks are the command list, then as the train moves along, and reaches the first fence with one, then the fence changes, when this reaches the fence value of one, then the fence value changes to one. And as it keeps moving along the commands and reaches two, then the fence changes from value one to two. And so on, as it reaches number three, and this mechanism as I'll show you, is how you can implement various kinds of synchronization between the CPU and the GPU. So I'm actually going to elaborate on that a little bit more with a slightly more concrete example. So in this scenario we have a CPU and a GPU. And we have a fence object in the middle. The fence object has currently the value of one. Just to be arbitrary, and let's suppose we do this. The CPU writes a list of commands that it wants to send to the GPU. And at the end of this list of commands, it outputs a special command which is a signal command. Which is what changes the value of the fence. So after it writes this list of commands, and sends it to the GPU, the GPU takes control of that list of commands and it executes the commands. When the GPU reaches the signal command, it changes the value of the fence from one to two. And using this fact, we can now synchronize on the fact that this list of commands is completed. Using the knowledge that completion is indicated by having passed by two. So for example, the CPU can wait for the GPU to finish what it did by checking that the fence is greater than or equal to two. And similarly, the GPU can also check that it finished the work by checking the fence is greater than equal to two. Because it's actually possible to have more than one list of commands running concurrently on the GPU in a general case. So you need to synchronize sometimes between these different lists of commands. So I'm going with this. So I've talked a lot about commands but I haven't actually said what the commands kind of are. So I'm going to give you a taste of commands. And again this is very high level so I'm kind of like hiding a lot of details. But hopefully it's easier to think about this way. That's the way I see it, even for me, I think it's easier to think about it this way. So for example we have some commands that may abstract, for example, the DMA engines. So you might have a command that just copies memory from a source location to a destination location. Page two four, I think. You may also have some commands that are made for executing programs on the GPU. So for example the sequence of commands I'm showing at the bottom left here. So like we start by setting the program, to say like this is the program I want to use the next time we run a program. And then we have a command that sets the parameters to that program and then we can actually invoke the program using generally like a draw call or a dispatch call. Those are the two main ways in which you can set a program. And you can kind of take these three this three couple of commands and boil that down to being from a C+ programming perspective it's like a function call. So when we make a function call in every day C++ code, you know you have a function you pass the parameters and then you do a branch, right? That's like generally what happens when you make a function call. And this is actually pretty much exactly that. It's just that it's spread out over multiple statements. So sometimes it helps to look at this sequence of statements and think about it more like a function call then to think about it as a separate like as a separate three steps. And from an API perspective you can make it look like that and it could be convenient. So GPUs also have built in hardware support for various kinds of rendering operations. So the commands are also commands that exist for just controlling how this hardware renderer works. So for example you might have a command that sets the render target. To say like where should the render be drawing pixels to. You may have a command that allows you to set the graphics pipeline to say that you configure the way in which the renderer should render triangles for example. There's a wide variety of configurations possible. And finally, this is one of the most abstract set of commands that I'm trying to basically pull a quick one on you or something. So they're commands for managing the memory model, and I call it the object model. To say that for example, if you want to deal with data hazards, you may need for example a memory barrier to say that like the rights that have happened previously should be done before you move on. Those are the kinds of things you have to care about when you write some kinds of programs. Also, I have this command here, which I call transition, to say that it transitions an object from state A to state B. Which is to say like some operations on some objects can only be done when the object is in a certain state. And so to be able to do a certain operation, you need the object to be in a certain state, but if it was not in that state beforehand then you have to transition it to that state before it can do the operation. This is like a slightly more flexible, I would say I think about it as a slightly more flexible way of doing a memory barrier in the sense that the barrier does more than just wait for memory to finish. I also have these, these are the sneakiest ones. Construct and destruct, if you look in the documentation, you will probably not find anything that looks like a command that constructs and destructs. And I think about it a little bit more placement new, placement delete. I'll talk more about that topic later. Another thing I want to talk about is just a very common pattern you'll find in this kind of programs is related to commands. Is whether the command argument should be read at the time at which the commands are recorded or whether the command argument should be read at the time at which the command is executed. And so that's kind of a trade off between these two ways of doing it. If you for example on the left if you pass the number of threads you want to run by value, then the number of threads is stored within the command. And then hopefully that can give more opportunities for the driver and the GPU and all that to know that what's coming up ahead of time. On the other hand, another thing you can do is the so called indirect way of making draw calls for example and dispatch calls. Where instead of passing the number of threads by value, you pass it by reference and then just before the command is executed, it's gonna read from memory the number of threads that it needs to run actually at run time. And run that number of threads at the time. So there's kind of like a performance versus flexibility trade-off. And I wanted to bring this up because this is actually a, like if you look through stuff you'll find that this is a very common trade-off of preparing things ahead of time or doing things at the last second. And the trade-offs in performance and flexibility that that involves. One last thing I want to talk about when it comes to GPU programming quote unquote basics is descriptors. So descriptors, the way I think about it is something like the hardware view of what an object is. And when I say object, I mean mostly things like buffers and textures, although there are various kinds of buffers and there are various kinds of textures. So it's a bit more complicated. The way I see it mainly it's a small struct that's sort of something like an address plus some metadata that is relevant to how that object should be accessed. So for example, a concrete example, is from the documentation here is what a descriptor looks like on AMDs processors. GB processors, I mean. So there's a bunch of stuff it's like 128 bits. It's not the only texture descriptor that they support. There's more than one kind of format. But in this case you can actually clearly see that it begins with an address to the resource that this object represents. And then later on it describes the width and the height of the object that's being referenced, or the texture that's being referenced. So just to give you an idea that this is roughly what I expect to see when I think about a descriptor. Then these actually kind of map relatively closely to the assembly code in the sense that if you look at the assembly code you'll see that it'll take the descriptor and pass it as an argument to an instruction and that may sample from a texture. For example, so this is a concept that is very closely tied to the architecture I would say. So that is all for the first part of this talk, Which was about GPU basics. And now I'm going to talk more about designing a renderer at a high level using the primitives that I talked about in the first part of this talk. So before I go too much into the nitty gritty, I want to introduce you to what I see as a decent way to architect a real time renderer at a high level. So usually the way I see it is I start from the simulation, which is for example if you're making a video game this is the part where maybe the game logic happens. And the game logic happens in terms of so called game objects. And so the term game object is not relating to any specific implementation. Every game engine has it's own idea of what a game object is, so that's kind of like a very general term. But the point is that every update of the game somehow you're taking all the state of the game objects and moving them around based on the game logic. And so this kind of like happens, I see it as two separate parts where like, in one sense you have like some objects that are updated based purely on time. As the time passes in a relatively continuous way, and in another sense you have things that are more like events like you might receive a message from another object. Or maybe you press the keyboard key and then that keyboard key is like a discrete event that you have to handle like on that one frame. So that's just like one that I think about breaking down that problem, that you might like. So, as a simulation happens the way I see it is that the simulation should make updates to a data structure that represents the current state of the scene. So as objects are being moved around, you would take like for example the position of a mesh and move it forward based on some game logic. And this is all happening at the level of what I call the scene. Which exposes some high level concepts such as geometry and materials and instances of meshes and cameras and lights and so these are all like not related strictly to any GPU concepts. But are kind of like a high level interface for the simulation to describe what it wants to render. From there, the renderer itself is in charge of reading the data in the scene and presenting that like how a presentation of what it needs to be done into a representation that makes sense for talking to a GPU such as for example you this is where we bring out the Buffers and the Textures and the Shaders and the rendering passes. So you know it's a bit lower level and at the same time it kind of gives you a separation of like how the scene is describe versus how the scene is rendered. The separation can get a bit blurry so it's not always straight forward. This is just a mental model for how to mentally separate it if it helps you. So the renderer will read the data in the scene and then presumably create a list of commands that it will send to the GPU to actually render the scene. So this list of commands makes it's way to the GPU and the GPU will run those commands. At the end of running all of those commands it will probably output an image to the screen. Which we call like presenting and so this presented image actually gets put in the so called swap chain. That's like the DirectX vocabulary anyways. The swap chain represents the double buffering mechanism supported by the operating system. What you put in the swap chain gets composited onto your desktop along with the rest of what's on your desktop. Or directly to your screen if you're in full screen mode. Finally that makes its way to your display. So in the context of real-time rendering we actually want this whole procedure to happen within 10 to a hundred milliseconds so if you're working on a shooting game or you're working on a virtual reality game you probably want to be on the lower end of that spectrum. But if you're making something that's kind of like an editor maybe it's acceptable to have a hundred milliseconds if it makes it easier to see like a good visualization of what you're working on. So depending on exactly what kind of game you're making or what kind of project you're making you can afford some latency. So the next thing I want to talk about is Ring Buffers. Because I think they are a very useful primitive for implementing all sorts of stuff. I'm going to give you a very basic example of what a Ring Buffer is. So, in this case, I'm showing you a Ring Buffer in the context of streaming data from the CPU to the GPU. Although it could happen the other way around. It's a very general data structure so you don't have to be tied to this exact use case. But for the sake of example, let's think about a CPU streaming data to the GPU. So what happens is that you've got the storage of memory, that is the Ring Buffer itself. And the CPU will allocate a range of that memory and write data to it. Then this data that was just written gets passed on to the GPU and the GPU can read it. And while the GPU is reading that data, the CPU can write more data. So you've got this kind of like back and forth going on where the CPU and GPU are both able to work in parallel. And this pattern kind of just repeats itself So what was just run by the CPU again gets passed to the CPU and CPU keep writing the next RAM data. Now I think that the API you make for accessing stuff in data structure can be actually beautifully simple. So for example, what I'm proposing here is kind of like a little bit of an extension on the typical allocators you see in C++ code, where I'm proposing something like an allocator where you take your ring data structure and you call it Alloc on it in this case passing it a type to see how you want to allocate an object or a struc and instead of just returning just a CPU virtual address which is what an everyday C++ allocator does, this allocator returns both the CPU virtual address and the GPU virtual address. So in this case what you can do with this with this kind of allocation is you can use the CPU virtual address to write the data you want to upload and then you can take the GPU virtual address and you can pass that on to the renderer through a command. Now I think some people will look at this and also throw in the suggestion that you should probably consider copying that memory to proper video memory before making a command. Making a draw command so that some people are some people are really strict about that. Depending on how you feel, there's some trade-offs. And moving on, at the bottom part of this slide you can see that I'm applying the same idea but instead of allocating just plain memory I'm allocating descriptors, so you can make a Ring Buffer of descriptors and just have a function that allocates you know end descriptors. You get back a CPU virtual address to the descriptors you allocated, a GPU virtual address to the descriptors you allocated. So you can write the descriptors metadata using the CPU virtual address and then you can take the GPU virtual address and pass that on for your rendering. Now there's a bunch of tricky cases with ring buffers. There's a lot of buts, so for example, what happens when you run out-of-memory? This is a tricky case and I think ideally you want to avoid it. So I'll talk about that a little bit. So what happens if the buffer's full? The GPU is still reading the yellow blocks on the right. But the CPU wants to write the next RAM of data. And it can't because the GPU is still reading that memory. In this case the CPU might just have to wait for that block of memory to be available which it might do using a fence. So once the GPU finishes processing that data the CPU can take control of it and start writing again. That's one way of handling out-of-memory situations. Using a fence, for example to synchronize that access. So another big problem is how do you handle wrap-around? Which conveniently, you might have noticed, I avoided in all of my animations. So there are some tricks that can make it a little bit sane. So for example, one thing that I highly recommend is to make something like a virtual offset in the ring buffer. So, like in reality your ring buffer has a limited size, probably, but with the virtual offset you just pretend that the ring buffer actually has an infinite size. So even when get past the end you just keep incrementing your current iterator and keep going. But then when you actually want to access the memory that corresponds to one of these virtual offsets, you mask it by the size of the buffer. Here we are assuming that the buffer's size is the power of two so that's why this masking trick works. This simplifies a lot of math relating to ring buffers. Another thing that is kind of convenient is that is simplifies working with fences, too. Because the fences increase monotonically and in this case, the virtual offset also increases monotonically. So it's gonna be easier to match them up if you want to use fences and the ring buffer together. One thing you can do is and actually I do this almost all the time, is just disallow wrap-around, so you just allocate enough memory that you know you're not going to need anymore and you just run with that. And if it fails like you actually run out of memory you just assert and if you hit that assert you just double the amount of memory. And it's kind of dumb but it'll get you very, very far. So that could be a very worthwhile engineering trade-off, let's say. If you're actually implementing this in the wild I recommend you consider reading this article by Fabian that actually has a lot of really nice break down of the invariance that are related to a ring buffer and some proofs about how the virtual offset for example makes things a lot simpler. So moving on, another neat thing about ring buffers is you can actually do allocations with them in a lock-free fashion. So let's suppose that the virtual offset in the last slide is now an atomic, in this case it's actually very straight forward to allocate memory with it in a lock-free fashion. For example, if the ring buffer's offset represents an index in an array, it's very easy to do a lock-free allocation because all you have to do is do a fetch add. So when you do a fetch add, what that does is it increments the current offset by the number you said, so it will allocate that many entries and it returns the old value. So by returning the old value that's what tells you like here's the start of the range you allocated. And it bumps the pointer or the offset forward by that amount, so that's a relatively straightforward application of lock-free allocation. The more complicated case is when you're allocating bytes. So in the case for example earlier when it had like alloc of camera this is supposedly an interface that can allocate any struct of any size. So it tends to be that some different interfaces for working with a GPU require different alignments for the data you pass in. And so you need to align your allocations which makes it a little bit more complicated than simply bumping a pointer forward. And what I suggest we do here is just pick the worst case alignment that is relevant to the platform you're working on which for example for DirectX is basically 512. And just pad up your allocations to that size. So, you know it's not, maybe there's a lot of waste in this kind of approach. I haven't really measured it. It's convenient, especially like it's very easy to forget that you need to align something if you don't do it automatically like this so this is actually pretty convenient for a lot of reasons. Yeah, so I just said. So there's some pros and cons like I mentioned. So for example, I think one of the biggest pros for me is that this interface takes memory management which is a relatively complex problem where you have to create a bunch of buffers everywhere and it boils it down to being a very, very simple API. And also like avoids fragmentation because all your allocations are tightly packed in this ring buffer, so it can be a very nice way to use memory. I think it's a , I say a powerful building block to build all sorts of higher level abstractions. Like when you find you have this ring buffer, you find all sorts of uses for it. So I'll suggest you know Procedural geometry, if you want to generate some stuff on the fly. Texture streaming, so for example you allocate like pixels into your ring buffer and then kick off a copy into a textural objects on the GPU. And also this so called Sprite Batch abstraction for rendering sprites in a 2D game for example. It's something you can implement very easily if you have a ring buffer handy. The main cons I can think of are that it's hard to allocate memory for it, so for example if you make it too small, then either your performance is going to be bad because you're going to be synchronizing on these fences all the time Or you're just going to crash because you put an assert there so it's kind of tough. And if you make it too big then you're just wasting memory so it's not straightforward to solve this problem in a general way, I think. Although I'm open, if anybody in the audience has any solutions please, write a blog post or something about it. So another big problem I think is that the memory configuration itself is not one-size-fits-all in the sense that this whole time I've been talking about this ring buffer like memory region but we haven't talked about should this memory region be in system memory? Should it be in video memory and what should the cache properties of this memory be? Should it be write combined cache? Should it be a write back cache? Should it be a write through cache? I don't know, so and these things really matter a lot because for example, you know, if you do procedural, if you run a procedural geometry algorithm in place in a buffer that you've allocated with write combined cache you're gonna have a really bad day. So, it's important to think about these details when you're implementing or when you're using a ring buffer. So the next thing I want to talk about is Parallel Command Recording which is a really big feature of a lot of these new APIs. Here's the big idea, so the big idea is that it turns out that writing command lists is actually quite CPU intensive, so one of the ways that that overhead has been mitigated in recent APIs is that they make it possible to write commands in multiple threads in parallel. So in this scenario, the CPU creates multiple threads and each of these threads can record command lists in parallel. And when they're all done creating their command lists you can bundle them up together and submit them all to the GPU all at once. And this has been proven to make some pretty big advantages, when you have a lot of work to do. So if we actually want to apply this let's first approach the easy case. And before actually, before I even talk about this I just want to say if you don't need to make it like multi-threaded, don't. But if you have like 10,000 objects then maybe it starts getting interesting so my point is, like don't over engineer it. But anyways, this is a relatively simple case when it comes to multi-threading the command writing. In this case, all we've done is you know say we have a whole bunch of asteroids to draw and we want to draw them as fast as possible and writing the commands is expensive because we have a whole lot of asteroids. In this case, what we can do is we can split the list of asteroids into several jobs and send each job to a different thread. In this case, each thread will write the commands independently so each one will start by setting the render target and then submit the changes of geometry and material and make the draw calls for each of the objects that are being drawn. And at the end of this parallel for loop all the lists are grouped together and submitted all at once. So this is the easy case. In the next few slides, I'm going to explore the difficult case, which is kind of adventurous and almost academic so if you cringe, it's okay. I'm a bit in the clouds right now. The difficult case is what if you're not working with regular work? Regularly shaped objects, for example maybe you have a render that you want to be extremely general and you want to handle like blob shaped objects which require like martian cubes or you want to have like subdivision objects which require like some kind of triangle subdivision algorithm. And then you want to have particles which require like bitting and sorting, all sorts of stuff to make them look right. So if you want to have all sorts of different objects rendering like this then the amount of CPU work that is required to render one of these objects changes depending on what type of object it is and maybe other things that are hard to measure like how far it is from the camera and how much of it can you see like how much is actually visible to the camera. In these cases we have to adopt a slightly more sophisticated approach then a parallel for each if you want to be efficient I suppose. And so the solution I'm going to propose for this problem is to use Fork/Join parallelism. So to try and summarize Fork/Join parallelism in relatively rockable way, I have this image on the right here. So the idea is that we start off with a big list that contains all the objects in the scene and then we first begin by checking is this list big enough that it's worth splitting among multiple cores? And if it's worth splitting among multiple cores then we split it in half and we actually apply this recursively. So like every time the list appears to still be big enough that it's worth splitting among multiple cores you split it again. And at the end of the day, you get a bunch of tasks that are hopefully roughly well-sized to be run on their own core by themselves. If this interests you, I highly recommend that you watch the talk by Pablo Halpern from CppCon 2015, I thought that was a brilliant talk. And really explains all this slide in more detail. But actually, what I'm going to do here is I'm going to go one step deeper and I'm going to talk to you about a solution to another problem with this fork/join thing. So zooming into the part of the fork/join graph I showed earlier where there was a cube and a green cloud. In the case that I showed in the last slide the cube commands, the rendering commands for drawing the cube are written into their own command list and the commands for rendering the green cloud are written into their own command list as well. And so, I should say like, you generally want to avoid creating too many command lists because there is some overhead associated with submitting to many command lists. And one case where this is especially an obvious loss is for example if you find out that the two tasks that were created for these two objects actually ended up running on the same CPU core. So for example, if they both run on the same CPU core, we don't need to have two separate command lists. We can actually just have one command list that will write the commands to both of those objects. And that reduces the number of command lists that we need in total. So what I'm saying here is like why not, for the 2nd task, if we're in the circumstance where the 2nd task is run on the same core, why not automatically use list number one again for the 2nd object? Now this is actually possible to do in a relatively elegant way using the so called Hyperobject idea. And with this Hyperobject idea implemented, the final result is that in those circumstances where the same CPU is used for subsequent draw calls the two lists of draw calls are combined into one single list. Now, the key idea, I don't know if I want to go into too much detail but I think this is a really neat idea. And if it interests you, I highly recommend that you read this paper. Actually, I would say this paper is really interesting even in general because it explains a lot about the internals of how C++ scheduler works because it explains how this whole Hyperobject idea actually integrates super nicely into their scheduling. So a very interesting paper that I recommend to read if you're interested in this topic. And another thing that's specifically about rendering is that if you apply this kind of Hyperobject idea it still keeps the order of the draw calls intact. Which is actually very important because it turns out that if the order of the draw calls changes between frames for example, it may have issues of aesthetics like for example, if the order of the draw calls change you might see objects flickering. And also changing the order of draw calls may change the performance, so it's important to have stable performance for each frame so keeping the order of the draw calls intact despite the parallelism that you're adding here is actually a nice bonus. So now one of the last parts of this talk that I'm going to show you today is about scheduling GPU Work & Memory. So going back again to the big picture. The big picture here is that you can decompose a frame of rendering in a real time scenario into something like a bunch of passes where each pass has a certain set of inputs and a certain set of outputs. And you can kind of like take a step back and visualize this whole thing as a big kind of graph. Now when we make a scheduler what we want to do conceptually is we want to take as input this graph and somehow find a way to execute that work in a way that is first of all valid in the sense that like dependencies are respected. But also ideally efficient as much as we can. Another thing is that for example if you look at the middle of this graph, there's a Texture 2 object which is the output of Pass 1 and the input of Pass 2. So actually this Texture 2 object is no longer relevant after those two nodes that are using it. So what I mean by managing allocation and lifetime is that there exists some resources that exist for only a small period of the graph. And it is interesting to think about how we can best do these allocations and de allocations of those objects, as we execute the graph. And finally, I wanted to say that I'm putting it as respect dynamic nature of real-time rendering. What I mean by that it's very tempting to look at something like this and to make an API where you build a whole graph of the frame up front that's like a big data structure in like a init function somewhere and then just like let that run and that can work and it has some nice advantages but I want to make there's some respect given to the fact that in a real time scenario the contents of what's shown on the screen changes a lot from one frame to the other potentially. So what happens is that a game can change a lot so it's hard to come up with one, big, static data structure that represents a frame in these kinds of scenarios, so there has to be some kind of thought put into the level of like flexibility that you want to allow for the generation of this graph and in the most general case, I think are very common cases to actually just regenerate the whole graph every frame. This is a common approach, but it's unnecessary. So you make your own decisions. So this is, this kind of pseudocode here is roughly how you might have done it for example in Opengeo DirectX 11, so this is kind of like the old school approach of how to do the problem. It illustrates a lot of the problems that we have to deal with when we implement a more general solution to this problem. So for example, the code on this slide is split into two parts. We have the higher part and the lower part. Both of these parts of code represent a pass of rendering. The fist pass of rendering puts a depth buffer in the form of a shadow map. So the word shadow map that I've highlighted there in the first part that's the output of this pass. And if you don't understand what a shadow map is that's fine, just imagine just the output of that pass of rendering. Meanwhile, in the bottom part of the code we have another pass of rendering and this pass of rendering takes the shadow map as in input now, so this shadow map went from being an output to being an input. And more precisely it went from being a depth buffer which has like this read and write kind of permissions to being a texture that is only read as a depth texture. And so if you're working with Opengeo or if you're working with DirectX 11, this kind of like change of an object going from being written to to being read and changing from being a depth buffer to something like a depth texture that's being read, these kinds of transitions are handled automatically for you. And the driver may even be able to build a graph very similar to what I showed in the last slide automatically from what you're submitting. But these are the kinds of responsibilities that now are on our plates so we have to figure out the solution to this automatically. But if you are using Opengeo or DirectX 11 you can probably actually make your real code very similar to the code I showed in this slide. And it's going to be just fine. So, I think actually this is a nice way to architect Opengeo code for example in general. If you're not going to come up with a really fancy astronaut solution like the rendering graph I'm going to talk about. So when it comes to scheduling GPU work and memory there is actually some previous work on this topic. Which is, I think very interesting. So if you're interested in this topic I definitely recommend that you look at the work on frame graph by Yuiry from Frostbite and also the work on render graphs with Vulkan by Hans-Kristian and so if you're interested in Vulkan I especially recommend that you look at the 2nd one there, it's actually open source as well. So you can go and look at how it's implemented. I don't want to repeat everything they said but today I'm going to try and give you a summary. I tried to boil it down, like the core ideas and some of the fundamental challenges to make it a bit easier to understand some of the details. So first I want to talk about submitting work. So there's various operations, I would say various problems that we can solve in the process of submitting work using a work scheduler that I'm proposing. For example, I'll say in general you can implement Compiler-like optimizations. And so one example of this is dead code elimination. So let's suppose that you have the graph that's shown on the right here. Where you can see this Pass 1 and Pass 2 and actually Pass 1 is outputting a texture. But a texture is not connected to anything else. So we're creating this texture and then we're just throwing it away. So an obvious solution to this problem is hey, make the work scheduler realize this output is actually useless and remove that entire part of the graph. And then you know, at this point you might be asking well why did you even write that code? Just delete the code, why should it even be there? But you know, it happens that sometimes, for example you might have a node that outputs some debug information and then you want to be able to conditionally display that debug information. So there are various cases like that where you have these optional outputs and you ideally want to not output it when nobody needs it but you don't know that nobody needs it until the end of the frame. Just to say that it's not always obvious to know some code is dead ahead of time. Another thing that's very important thing that your scheduler should handle is handling the Data hazards when it comes to both memory barriers and to handle like transitions between reads and writes or exchanging between an object that's being read and written or writ and read, that's one. And also barriers that are relating to the render because there are various kinds of I'll say various kinds of barriers that are more advanced that are relating to the way that the rendering pipeline works. So for example in this area that I've shown we have one blue node which is attached node writes to some memory which is outputted in this texture. And then the next node will read that texture. So in this situation we have a read after write data hazard and so in between these two nodes we have to make sure we insert a barrier that says that all writes before this point must have finished before we can move on and read the data that we just wrote. Another, I guess this is kind of tricky aspect is managing the latency between passes. So for example, I've shown two schedules here. On the left we have A followed quickly by B. And C followed quickly by D and on the other side, on the right, we start A as early as possible and then only run B like very late. So the trade-off here is that if you run A and then B immediately, then hopefully you get some cache coherency from running like the outputs and then using the outputs immediately as inputs. But you may have to wait for A to finish before you can do that. So there might some extra like cost to that. Whereas in the other case, we start A soon and do B like very late and as a result, like it's very likely that while A started has been mostly finished by now, so we won't have to wait for A for very long but it maybe that the memory that A produced is now no longer in cache so it's kind of a difficult trade-off. And also I'll add that on the right side there's more tasks running at the same time so that means that the total amount of memory necessary is greater so lots of non-obvious kind of heuristic trade-offs are necessary here. And finally when your scheduler has finished deciding what tasks actually need to be run it has to somehow traverse this graph and invoke some code that somehow will record the commands that are relevant to that node of rendering, that pass of rendering. Now I want to get a little bit more concrete about the algorithms that you can use to schedule this work. So the closest thing I could find to describe this algorithm is like list scheduling, so called. This is a very simple algorithm. The algorithm itself is simple but the heuristics are the tricky part, basically. So the algorithm itself is something like allocate priorities to all the tasks. And then in a loop, take the highest priority task that you can actually run and schedule it and then just repeat that until you've hopefully scheduled everything. So in this case, to illustrate that concept, I've created a small task graph here. And I've ordered them by priority so A B C D E is the order of priority. And let's suppose that we want to run this schedule. So first we begin by running the highest priority task which is A, so A gets scheduled. And then let's suppose that the next highest priority task is B. But maybe B needs to allocate some memory that it can't satisfy, like there's actually not enough memory available to run B, so we skip B, because it's not a valid runnable task and move on to the next highest priority task, which is C. And maybe C is possible to run, so C gets scheduled. And maybe now after C has finished, it may be possible to allocate the memory for B because it has used up its input so we can schedule the rest of the graph. That's kind of how I see it and again like the tricky part about this is how I come up with the right heuristics which I don't know if it's going to solve the problem although if you look at the blog post that I linked, there are some suggested heuristics. Another aspect is, I talked about this a little bit before, but just like managing the lifetime of objects as the flow of tasks come in. So for example, if you have a very simple task graph like this one, where Texture 1 is an input to Pass A, and then Pass A outputs Texture 2, which is an input to Pass B, which then outputs finally the final output of the frame. One potential way you could schedule this is something like this. So in this case, what's happening is that before I run A, I first allocate Texture 1 and Texture 2. Because both its inputs and its outputs need to be allocated before the task can actually run. But then maybe after A has run the memory for Texture 1 is no longer necessary so it can be deleted. And again, like earlier, when I say new and delete I'm talking about something that's a little bit abstract and it's more concrete if you look at the documentation. You'll find it, it's like a sequence of specific commands, I'll say. But moving on, so then before running B we have to make sure that the output is also allocated. So make sure that that memory is available and then run B. And finally we can delete Texture 2 because it's no longer necessary and we have the final output. And part of what your scheduler can do is optimize scenarios like this one if for example it may be that the same memory can be reused for Texture 1 and Texture 2. So for example, if Texture 1 and Texture 2 have the same size and format, and it turns out that the operation done by Pass A actually like takes every pixel from Texture 1 and just makes the calculation on it and outputs a new pixel at the same place it may be that you can reuse the same texture for both of those operations rather than doing a copy from one region of memory to the other. And in this case you can have something more compact schedule where you only allocate the one texture and reuse it for both Texture 1 and Texture 2. So these are some kinds of things that you can do with a task scheduler is identify resources that can be merged or reused. I think that the, well one of the places where you can find most of the actual concrete details of how you implement this, I'll say is the documentation is CreatePlacedResource in DirectX 12, turns out to actually have quite a beefy amount of documentation in that article. So, in summary, in part one today, I talked about three main topics. First I talked about Memory Management where I talked about the separation between System Memory and Video Memory and I talked about how Virtual Memory can be used to bridge between them as well as DMA. Then I talked about Command Lists where I proposed a big idea where the CPU builds the schedule for the GPU to consume. In an order that is efficient and ideally correct. And then I talked about how we can use fences to synchronize the flow of commands between the two processors. Finally, in that part of the talk, I talked about Descriptors very briefly, just to say that descriptors are, I say the hardware view of an object which is something like an address, the memory plus some metadata that describes how the resource should be accessed. Then in part two, I talked about Ring Buffers and I proposed them as a CPU to GPU streaming primitive, although they have more uses than that. You can go backwards too, like GPU to CPU or just lots of kinds of different code I would say. And then I talked about Parallel Command Recording, to say that it's possible with these new APIs to record command lists in parallel and submit them all at once to the GPU. And then I went a little bit deep in talking about the differences between regular and irregular cases and proposed some I think, maybe academic solutions to the irregular cases. Finally, I talked about Scheduling Work & Memory. To talk about how we can take a graph that represents a frame being rendered and officially convert that into a list of commands and send that to the GPU. That both optimizes the frame globally and manages the lifetime of the objects throughout the frame. I think actually, some people will say that if you don't optimize the frame globally then you're not using these APIs to their full potential so I think this is something that's relatively important. So before I finish this talk I'd like to acknowledge a lot of people who helped me, I guess, prepare this talk. Scott and Mauricio mainly, helped me review early versions of this talk that I pitched to them. And then there's a whole bunch of people here who just answered various questions and inspired various ideas that I've shared today so I'd like to thank them all for their time. If you look at the slides later, you'll be able to skim through these references. If you want to see all the papers and slides that I've talked about. But for now, that's all for today. So thank you for listening and I'll accept any comments or questions you may have. (clapping) - [Speaker] Hello, so despite all the talk you made about data in general do you think that so far there is room today to have abstractions in game engine to how you organize data on the GPU or is it always the case that for example I know I have a game with a lot of trees so I have to make all my rendering code consider that. Do you think there's room for an abstraction layer there where a game might not need to care about how data is put on the GPU? - I mean it's a very situational thing to consider. Depending on what kind of game you're working on, I suppose. Some things for example the frame graph that I showed like that's something that like exists with Frostbite so people do have, I would say like some little abstraction. And it's almost like, to quote Agora, it's a negative abstraction sometimes. It's an abstraction that actually improves the performance in a lot of ways, so I think that this is something we can aim for to create abstractions that actually improve the performance. Although, you know it's hard to say. If you want to render the trees it might involve a lot of tricky work and it might be hard to actually organize all sorts of different concepts including rendering trees and rendering all sorts of objects that's in the same framework, then it may not be straight forward, so hard to say. But I think it's a noble goal, so. - [Speaker] Alright. - So yeah, so actually I would say in practice one thing is to consider also the cost the time to even implement system abstraction. So if you sit down today and you start programming something like that, you'd be way better off not trying to abstract it, hopefully, until you realize that it may be useful. Just saying that it's some times better to ignore like you don't need a frame graph to like code something simple. - [Speaker] Right - It's one of those trade-offs. - [Speaker] The goal is not to write a new abstraction on every project, right? - Yeah. - [Speaker] The goal is for example, if I'm at my company I have 10 projects, I should write the rendering code once for all my projects, in the best case. - Right. - [Speaker] I guess, so anyway thank you. - Yeah, thanks. Any other questions? So I guess we're gonna call it. (clapping)
Info
Channel: CppCon
Views: 37,312
Rating: 4.9449735 out of 5
Keywords: Nicolas Guillemot, CppCon 2017, Computer Science (Field), + C (Programming Language), Bash Films, conference video recording services, conference recording services, nationwide conference recording services, conference videography services, conference video recording, conference filming services, conference services, conference recording, conference live streaming, event videographers, capture presentation slides, record presentation slides, event video recording
Id: mdPeXJ0eiGc
Channel Id: undefined
Length: 54min 32sec (3272 seconds)
Published: Thu Oct 12 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.