I FIXED VULKAN!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey what's up guys my name is channel welcome back to another hazel devlog so last week i'll put up a video right there yes week i'm actually so excited i've managed to make two devlogs in two weeks but anyway last time we talked about how i was having a bunch of falcon issues basically so the driver would crash falcon wouldn't really give me much information i tried debugging out i couldn't really get anywhere validation layers were giving me nothing and all of that and uh and today i'm here to tell you that everything's working so the first thing that i want to do is thank all of you and all of your suggestions i to be honest when i when i explained the problem in the devlog i didn't really expect as much helpful information as i got i mean not only were there tons of comments suggesting that i do this or that i also got emails and you know various other messages on other platforms being like hey have you heard of this have you tried that and by following your advice i solved my problem and today i'm going to show you how that happened and the key is a little something called nvidia aftermath so i hadn't actually heard of this before and again this is thanks to you guys that i mean you know this is why it's so great to share your development with the world right you get all these helpful um resources and uh feedback from the community but basically what nvidia aftermath is nvidia and site aftermath i guess they're calling it it's a simple library that you integrate into directx 12 or vulkan's game crash reporter um to generate gpu mini dumps when basically a crash occurs right so um what you can do is you basically download this library uh anyone can do this right you can you have to sign up for like an nvidia develo developer account but that's like totally free and it's pretty straightforward um this is completely free the license is pr is pretty permissive as well so in other words you can actually ship this with your game or with your app if you want to so that you can get kind of feedback um from like client crashes in like a distribution environment as well not just for debugging which is great but basically whenever you get like a crash in your gpu and um aftermath will actually dump all of the details of that crash to a file and then you can open it in with nvidia and site which is like a graphics debugging tool and actually see what's up so that's exactly what i did okay so just a little side note as i'm editing this i didn't mention this in the video but aftermath and video aftermath only works on nvidia gpus just in case that's a surprise um so a few days ago basically like two days after i put up that video and i read all of your suggestions i um i i had two issues actually with vulcan that i knew would cause like a device lost error and thus would crash the driver um and what happens is i i set up the uh nvidia gpu okay actually no i need to show you guys what i actually did so there's this um github repository and n-site aftermath samples um which basically has like a directx 12 and a vulcan example and it's pretty cool um there's all these files but there's the main file i think which actually has the uh well it's this one vk hello and site aftermath this actually has the usage of this little library they've written that uses the aftermath library which is like the aftermath gpu crash tracker um and uh so in other words this actually enables aftermath and um you know hooks up their little gpu crash tracker which is over here and then you like hook it up before you create like the vulcan device basically um so i looked at this example and this example actually features like a vulcan device loss crash you can actually see it happening over here what is it device lost error device loss so if error device lost happens in their example they just sleep for three seconds to make sure that the asynchronous um uh you know file writing has time to actually finish before like you exit in this case but whatever the point is that they've got a really nice sample um and that's one of the things that i love about nvidia's uh libraries is that they generally have pretty good samples um so this is what i based it off of and i basically grabbed these files i chucked them into hazel i hooked up the gpu crash tracker in much the same way and then what happened is when i got a device lost it dumped out two files to disk so i'm going to be using this one as an example i've got a bunch in this repository he also dumped out a shader file and so what these are is one of them is just a gpu dump file which if you open up um and site so we'll open up nsite here which you can also download for free um we can just drag this in and as long as you've got your shader paths hooked up correctly so what this shader kind of hash file is is actually like a debug file for the shader itself so aftermath can do a whole bunch this isn't really a video about aftermath this is more like a postmortem kind of crash you know situation i guess it's a devlog right if you guys want me to actually put together a proper insight um aftermath like tutorial or whatever let me know but uh this is more just kind of my experience in a bit of more of a high level view but basically this shader um this shader file contains like my understanding is it's got like the debug kind of symbols um of the actual shader itself i'm not sure if uh you might actually i don't think it has the source code i think you have to hook that up as well and i'll show you how in a minute but basically this has um information about where exactly in the like what the shader was doing all of the indications of like the fragment shader for example what what was happening at the time of the crash which is really cool and i'll actually show you that here so if you go tools options it's a bit weird this is global but under search paths right you can actually set this up because you can see at the moment i think i've got it in a few different situations here yeah so say for example shader object file not found right um and the reason that that happened is because i haven't hooked up my search paths correctly so for example for the driver shader output paths instead of this which is like a different branch i guess of hazel we can hook in let's change this to just be editor workflow which is the dump we opened that's a branch and if i select that and hit okay then you can see now it's actually saying assets shaders skybox.json if i double click on it i actually get source code and it actually tells me the exact line that it was that was being executed when this like happened um which is really cool now um i will say though that if you're looking at this being like how is the open bracket crashing this um you have to remember that the source code like in hazel for example and the source code in your engine potentially as well is might not align correctly so this is clearly a fragment shader right and we know that it's line 18 that caused the issue but if you look at this this is actually like a hazel shader which has type vertex type fragments so both shaders are in one file there's comments before that so we know that the text that is actually sent to like you know spur v to compile everything is this for the vertex shader and this for the fragment shader so really what's going on is it's saying line 18 of this text so clearly it's not that um and i can't uh if i open up like a vs code or something and i paste this in um and there might be like a space or something so like i always give or take a line but basically this is kind of what we're left with if we just paste that text in as it is and you can see that line 18 is that it's probably not that it's probably this right so we know that the line that seems to have crashed is this texture access which if you if we actually read the crash i kind of went forward into the shader but if you look at what actually happened in the dump info we can see that it was a page fault right which is like a memory access problem and then in the crash info you can see that there's some page fault going on here's the address here's the address of like this resource that it's trying to access which is what actually caused the issue um the size we can see here is 512 megabytes with height 2048 2048 depth 6 mip levels 12. so we have information about the texture itself that it's trying to read we know that obviously it's some kind of resource access that was a problem and if you look at the gpu state you can even see that it was in fact texture um and that the texture access faulted everything else seems good busy or idle right that's fine but something went wrong with like texture operations right and this is the actual texture so this actually by itself before you even hook up that shader source code has so much interesting information that i mean i can immediately identify that texture it's clearly going to be a cube map right the depth is six it's got six faces six layers mip levels 12 well the mid levels 12 correspond with like the width and height so i know it's a 2048 by 2048 cube map right there are only two of those um no there's only one of those actually in in the engine and that's like the environment map right so clearly accessing the environment map is problematic and why it looks like in the sky box shader when when it actually tries to draw the um you know the background i guess uh it's failing because it can't read the texture so um that immediately showed me um one of the problems that i had um which was and and this is like as i said i had two issues with vulcan just you know error device lost um one was somewhat weird and happening on random devices and we'll get to that in a minute but then there was this one which was like 100 reproducible and basically uh what this ended up being was um instead of what i was doing is i was deleting my old environment map so if i played with like my little praythem sky model or if i tried to load if i tried to try it it's great if i tried to load like another cube map or something like that um i i was deleting the old one but i wasn't submitting the deletion of that memory and that image view and all the other things into the render command queue so i was just doing it immediately so as soon as the ref as soon as the reference count hits zero and we are kind of left with uh you know a texture or a resource that has no references that should be deleted i was immediately calling those falcon delete functions and i guess what happened was this fragment chair was was either in flight or about to be rendered or whatever um and i guess it had already passed validation and it would have been like everything's fine and now we're going to render it oh wait it's been deleted i can't access that memory anymore or whatever and it was just giving me like a device lost with absolutely no validation or anything like that so uh what i did was i just submitted it to the render command queue like it should have been done and that means that it happens after like the render operation takes place and everything's done and that fixed it so that was the first issue the second issue um which was the most serious one in my opinion because this one was somewhat new and it was reproducible reproducible is always good not reproducible is always bad um so the second one uh happened on some devices and that should have been a hint now in retrospect because now that i know why it makes sense um so it probably uh means that i need to pay more attention and be a better detective or something but basically um what would happen is on my computer very rarely when i loaded a cube map like an environment map the way that happens in hazel i described that a little bit in the last devlog but basically we load like an echo rectangular image we filter it and then we generate like an irradiance map from that as well i mean we generate like all the lots oh sorry all the mip levels and everything um and so basically we run like three different compute shaders to make all that happen uh and i had been really like you know i've been really safe with that like um i knew that like potentially maybe you know uh synchronization issues could cause that usually in my experience synchronization issues so in other words if you don't finish writing to a compute shader if your compute if your computer is writing to a texture and you don't finish that operation before you start rendering that texture in a fragment shader usually that doesn't really cause like a crash or a device thing it just means like your data is missing or like you might see flickering you know stuff like that um usually it's not like a hard crash at least in my in my experience but anyway so just to be super safe i put fences everywhere okay so i think taking a look at is probably going to make more sense but basically what i did is when i when i submitted my actual like compute shader to the compute like command q in vulcan um i had made sure basically that uh after i do that there's this fence that we wait for after the submission so nothing else happens the cpu literally just waits here uh for however long that compute trader takes and then it continues so that means literally nothing else could possibly happen while we wait for this unless like some other stuff has been previously submitted to the queue but it wouldn't because like in in c plus land when we submit things we do the cube map first obviously because you've just decided you want to load it there can't possibly be already a submission to render the new one before you've anyway so basically uh what that also means um conveniently is that we can time this right easily we can just create like a cpu timer and just see how long this this takes because once you submit something to a queue if you wait for the fans to complete if you wait for the you know for everything to finish then you also know how long it takes which is nice so i was also able to see i guess how long this execution took and i mean i looked at it and it didn't seem like it was too bad like it you know for one of the shaders it was like a few milliseconds for another one it was like half a second and then like the irradiance one took like two seconds 2.2 seconds something like that and i was like okay sure whatever 2.2 seconds you know a little bit of time but you know we're filtering this cube map we're trying to make it pretty this is not like this is like a loading operation it's not a per frame operation obviously so whatever um but then one thing that i noticed was happening is that again as i mentioned on this computer which isn't like it's a 1080 ti it's not the the best card you can get these days but it's a pretty strong card um it was the fastest on this and i couldn't get the crash that happened very often on my laptop which is hooked up to that monitor in the background um that's got an rtx 2060 but it's a mobile chip uh because the laptop and um it was happening way more frequently not every time but maybe like when i hit f5 to run hazelnut and i was loading a cube map you know which i just had to do on startup it would happen like one every five times one every 10 times something like that um but then for other people on the hazel team like peter notably he was getting this every single time which was crazy uh i was like what like this you can't load any cube masks what's going on and he has the weakest card he's got a 750 ti uh which i think he says it took like four seconds or five seconds or i don't know it was way slower and i was like okay that's interesting so but i didn't i didn't realize that at the time maybe i should have been like on lower tier hardware it seems to be crashing more often um up to the point where it just crashes every single time um but i didn't notice that until nvidia aftermath again to the rescue i set it up on that laptop because again on here it happened so infrequently that it was like whatever and i got some pretty interesting results and so this is the dump from the laptop which i just copied so let's take a look at this one i'm not going to bother hooking up the shader because the problem is actually pretty visible here so and here's the thing right like when i first saw this it took me a little bit of time here when i first saw this i was like there is so little information here but it's kind of almost like a less is more situation not really but like my point i wouldn't be opposed to more information but the thing is this amount of information was enough for me to figure out what was going on right in both both cases it it's like deceptively little amount of information it looks like it's not much much information but it's there's actually a lot of stuff here and especially once you hook once you hook up the shaders and everything as well but basically um you can see and you can see this is like from thursday last week um this is what happened so device state is hung right to contrast this with the other one right the device state was a page fault right okay whatever memory access bad you know we can maybe take a look at this but this state was literally hung and in the compute channel it seems right um that's all that we really have here and then in the crash info again if we were to hook up the shadow we can see kind of all the different invocations of the compute shader um which of course kind of running in parallel and you can actually see like what it's executing i hooked that up on the laptop it wasn't that useful it was just random points in the compute shader basically um and you can see that over here you know we have all of these like pipelines being busy whatever and if you do hook this up as i mentioned you you would see where you were in the shader but none of that is really that interesting um we know which shader it is so again i didn't copy the shader object file but if i open it on that computer i can see that it's the environment irradiance.jsl shader right so it literally says kind of like what we had here with the shader location being an actual file right it also had that there so i knew which shader which compute shader was causing the problem um which was coincidentally the one that takes the longest so as i mentioned like over two seconds on this computer like four or five seconds for peter or whatever it was right um and device date was hung so what seems to be happening is that that compute shader is taking so much time that the driver is like something's not right which i mean like two three seconds or even five seconds for a computer i would not have thought that that would have been too much like maybe 30 seconds a minute whatever i expect people are doing some kind of compute operations on their gpus that are taking a long time um why is this you know why is it suddenly like oh two seconds two three seconds you did not just say three seconds i'm cutting i'm abort mission crash the driver i don't know why right but what i do know is that i lowered the quality of the environment um irradiance map compute shader so basically load the number of samples as i saw we as i showed we were doing like 2048 by 2048 for the rest of the q maps load that's like 10 24 because it's usually good enough right most of this stuff is just being used for like reflections and um you know 2048 is pretty overkill i just like to you know turn stuff up to like 11 and just see what's up 2048 is is admittedly quite a lot you saw it was taking 512 megabytes of vram i mean it's uncompressed data and everything but 512 megabytes of vram for a single queue map is quite a bit right so whatever you know let's turn that down we turned peters down significantly down to like um i think we were doing a quarter of the irradiance samples which is just basically how much how how much is sample so therefore you lower the samples obviously the computer is going to be way faster and we lowered it to like 1024 for the size and he never got a crash again so long story short um if you take too long in your computer shader you will lose the device um i did not know this uh i i didn't really generally find that out from my reset i even asked like friends of mine from ea you know have you seen this before um no so i don't know um i don't know how common this is i'm not sure if it's something i think i think some someone on my discord mentions something about like you can change this in the registry or something so that it it will maybe wait 10 seconds or whatever before being like okay i'm aborting the whole thing but that was my experience let me let me know what you guys if you guys have faced similar problems i guess before um huge thank you like to you guys honestly for helping me for putting me onto the right track for solving this because i feel like if i had not seen this you know you don't expect something to be like oh it's taking too long crash you know that's not usually something it just takes long that's what i'm used to um but we can see we can see here that literally the device state is hung um and by reducing that time by reducing like the the complexity i guess or how much time the computer takes that problem went away and i don't have it anymore on my laptop peter doesn't have it anymore um which is which is great and um so now what i have is like a little uh actually let me show you guys this is a bit of a hazel devlog i should probably show hazel right um but basically now what's happening is um i've merged everything back into the master branch i still have this up i can probably get rid of this um it's still here but um inside the settings over here there's some renderer settings and you can actually change the environment map size and the number of samples um it's actually like 64 times this but um it's the number of samples that you can run um on the compute shader if you wanted to uh you know beef it up or whatever so we have some drop downs there so that people can kind of configure this based on their the power of their graphics card um but apart from that you know stuff works pretty well this is vulcan as i mentioned this is a release um let's just load something for fun you know we have this old kind of game which um is missing its environment map but you know we can quickly just like generate um you know a nice looking environment i love this little dynamic sky thing it's pretty cool um and we can just hit play and uh there you go that's obviously hooked up to like c-sharp scripting and you know we're using box studio for physics in this example um it's changing like the material settings from the c sharp as well dynamically that's why this goes red if there's a collision um and everything is basically uh working in hazel as expected so that's kind of um that's kind of where we're at i guess um and this is all inside devmaster oh sorry this is devmaster i've just got that i've just got the folder named that it's all inside the master branch um inside hazeldev so if you guys do have access to that then feel free to check it out let me know if it's stable on your computer because my goal obviously is to get the master branch as stable as possible if you don't have access to hazeldev and you want to get access to everything you saw here and help support the development of hazel then you can go to patreon.com slash the channel and you'll get access to all the latest um you know source code that we work on um there's like a few active branches because there's some great people from the community helping out as well and we've even got some plans for some like demo games to show off hazel and what it's capable of because we are um we're moving through this pretty quickly so anyway thank you as always for your support i hope you enjoyed this devlog and i will see you later goodbye [Music] you
Info
Channel: The Cherno
Views: 54,685
Rating: undefined out of 5
Keywords: thecherno, thechernoproject, cherno, c++, programming, gamedev, game development, game engine, how to make a game engine, vulkan, rendering, gpu, graphics card, vulkan crash, vulkan device lost, how to fix vulkan, graphics programming
Id: EXgXMa5kapI
Channel Id: undefined
Length: 23min 31sec (1411 seconds)
Published: Mon Apr 26 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.