Simple Code, High Performance

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

In my experience, people write complicated code because writing simple code is hard. With many problems, it's often much easier to increase complexity and extend an existing structure to cover your use case than it is to reconsider whether the extended scope allows for reducing (and simplifying) code by reframing the problem entirely.

👍︎︎ 297 👤︎︎ u/Chousuke 📅︎︎ Oct 13 2021 🗫︎ replies

As much as I like listening to Casey and seeing what he achieves, he really needs to work on being more concise in his explanations of things. This video could likely be half and hour and lose very little of the actual detail and value he is providing.

👍︎︎ 113 👤︎︎ u/Iamonreddit 📅︎︎ Oct 13 2021 🗫︎ replies

Every line of code I write is perfectly elegant and beautiful, until two months later when I have to look at it again.

👍︎︎ 35 👤︎︎ u/hebozhe 📅︎︎ Oct 13 2021 🗫︎ replies

wait, is he writing on the board in reverse?

👍︎︎ 34 👤︎︎ u/NDk48P 📅︎︎ Oct 13 2021 🗫︎ replies

Most of the time complicated code is because there are complicated business rules.

Poorly run businesses don't understand that every time they say "Do this, except in that case do this other thing", they're making their code more complicated and less maintainable and slower.

👍︎︎ 15 👤︎︎ u/Edward_Morbius 📅︎︎ Oct 13 2021 🗫︎ replies

Is this guy writing in reverse?

👍︎︎ 3 👤︎︎ u/CrashingTheApex 📅︎︎ Oct 13 2021 🗫︎ replies

When I first came across this guy and Jonathan Blow, I noticed that they were very dedicated to performance and are obviously skilled.

But for whatever reason, something just feels off about them. I don't know what, to this day.

👍︎︎ 29 👤︎︎ u/nitrohigito 📅︎︎ Oct 13 2021 🗫︎ replies

This was interesting, but I don't know if I would consider the final product to be simple code. There is a lot of knowledge baked into the final 10 lines of SIMD processing (or however you would say this). Aside from learning how to interact closely with the processor, the main idea of the 2.5 hours is "Look at the unnecessary steps in your code and try to remove duplicative work" which probably doesn't need to be that long of a lecture.

👍︎︎ 10 👤︎︎ u/IlliterateJedi 📅︎︎ Oct 13 2021 🗫︎ replies

tl:dw?

👍︎︎ 23 👤︎︎ u/jhjhjh333 📅︎︎ Oct 13 2021 🗫︎ replies

Captions

hey everybody casey muratori here i am just continuing down the list of suggestions i got on twitter when i asked what videos should i post to try and help get the word out that our kickstarter from yeah the infinite is now live this is a really cool graphic novel series that we do here at mollyrocket that's all ages and it's a really neat story and the art is fantastic please check it out if you get a chance the link is in the description anyway one of the suggestions i got was to post the lecture i did at the university of tuente in the netherlands this was a lecture that i was invited to do during covid so it's actually just via video conference which means i have a really good recording of it because it's what was streamed you'll be seeing it exactly as they saw it when it was presented the lecture is about optimization and it's a case study in how to take something that was very slow before it took like a minute to do or something like that and how to get that down to basically instant and you know you can't always do that but sometimes you can and this is a case study of one of the times where that was possible so i hope you enjoy it and again check out the link in the description below if you get a chance without further ado here's my lecture from the university of tuente okay so this is a kind of unusual lecture i got asked to give it and it wasn't quite enough time for me to do a lecture in the traditional style which would be kind of like a one-hour thing where everything is set up exactly right and it's all fine-tuned and i said well i don't really have time for that so i don't know if i can really do it and they said well you can go ahead and take a two hour time block because that's how we normally do things anyway and if you want to do something informal that's totally cool so i was like awesome if i can do something informal and i can talk about whatever i want to then we can do something probably pretty cool it's just gonna be more like a class kind of a thing than a traditional lecture where there's like slides for everything and you know so we're gonna be drawing stuff out on the board and that sort of stuff but you know other than that it should be pretty smooth sailing so this is a stream directly to the university of tuenta i think i pronounced that roughly correct for an english accent i'm not sure i asked but you know my pronunciation is not great and this is something that is a lecture that's sort of general audience but to a technical you know a more technical audience so it's you know it's not going to be the most technical lecture but it's also not going to be completely non-technical and what i wanted to talk about because since i was allowed to talk about anything i wanted uh what i wanted to talk about was a actual concrete example a case study uh that kind of shows why i think software is slow and i do think software today is very very slow i don't know how controversial of a statement that is because you know a lot of times if i say software is slow maybe get some pushback but i feel like a lot of times now the pushback is is not as strong i think maybe people are coming around to the fact that software is a lot slower than it should be and so you know this lecture is kind of trying to take the next step which is to say well if software is slow you know why is it slow and maybe what are the things that people are not taking into account excuse me that people are not taking into account that you know if we started to think about we would actually be able to make software that was better right because at the present moment a lot of people i mean this is a lecture to university a lot of people actually study for a while to be computer programmers in theory one of the things that computer programmers would learn to do is to write software that's very good that has a high quality bar that is performant on the machines that they are using you would assume that some of these things would be true and yet even professionals in the field who have been working for quite some time so not only have they had the education but they've also had experience in the industry you can take people who've been working for a very long time and they are still not actually writing software that performs particularly well on the computers that they're using and so this is kind of a this is something we're going to have to fix because at some point there's going to have to be some awareness of the fact that we're at the very least spending a ton of computer time doing effectively nothing like we're burning a lot of power we're burning a lot of money uh especially when you think about paying for data centers and that sort of thing on software being slow and so it's just a question of how do we start to like push people towards learning how not to make software so slow how to make it faster by default and what are the mindsets involved in doing that so this case study is about the grass planting algorithm in the witness and the performance characteristics of that and the reason i picked that was because some it was something that was pretty slow but it was written by people who know how to program relatively performance-oriented code so it wasn't like a complete disaster it wasn't written in like an interpreted language on top of you know all these things where there was just tons and tons of stuff underneath it that i would have to piece through it was written you know in c plus plus it was written custom for that thing for the most part uh and so i can take it apart and show why it was slow and of course uh i can also show the one that i rewrote on that project to be fast and i can also show direct comparisons about how fast we're talking about when we're talking about fast so before i dive into that let me just try to motivate things a little bit here as i drop the pen let me try to motivate things a little bit here in terms of understanding why you might care if software is slow so some people are still under the impression that doesn't matter if software is slow it just let the software be slow and it's someone else's problem like yeah the user has to sit there and wait for this thing to do whatever it's doing but that's okay because the important thing is getting the code done or something right or or that you don't have uh crashes or something right there's a there's a mentality that says it doesn't matter if the code is slow so we shouldn't care about the fact that it's slow so i wanted to list just three brief things uh as an example of why you might care even if that's your attitude so the first one i want to say is windows ce if you've ever heard of this and a lot of you probably haven't to iphone um this to me is the number one best most obvious example of why performance matters so window ce was actually the thing that you used to use on a handheld device so prior to the iphone coming out and android which of course was after iphone prior to that shift windows ce was actually a handheld operating system with a series of handheld devices they were you know i don't want to say widely used but businesses a lot of business users used them and they effectively provided everything an iphone provided and more so you know the original iphone when it first came out had almost nothing on it it had like a few little apps that apple had written there was no app store when the iphone originally launched and it couldn't do a lot of things that windows c could do like windows c had little like office app stuff on there and all these things that you could do on there the difference between these two things was literally just the latency like it was literally just the fact that window ce if you go back and look at these devices they are slow they're clunky the touch is just kind of you know you click on it and it's slow and it's very very it it just feels horrible to use because everything you're doing right is very latent the touchscreen was lousy the lag was extremely large and it wasn't really very animated it was just a poor experience using the device and going to iphone they literally just introduced something with a nice touch screen that was like full frame rate you know 60 frames a second or at least 30 frames a second on everything you did when you dragged it would follow you directly right this shift was so powerful that literally just introducing a low latency device with less features than another device completely changed the industry that was it because now it was fun and interactive and it felt good to use your phone and it felt crappy to use this when things feel good people want to use them and performance specifically latency the amount of time between when you try to do something and when you see the results is critical for getting users excited and invested in something that you do and this to me was the best example because if you look at an iphone you cannot tell me that it had more features than windows cd ce it didn't it had less features but people liked it better and they liked it better because it felt better now this continues to this day what does an ipad pro advertise well have you ever seen their videos they're like artists you will love this ipad pro because our pen has a 30 millisecond response time they literally say this a 30 millisecond response time right there's a thousand milliseconds in a second a 30 millisecond response time is around 30 frames per second right 16 milliseconds would be 60 frames a second 30 milliseconds is around 30 mil 30 frames a second so this is basically advertising 30 frames a second on your drawing and they say this is much better than everything else you can see how responsive it is they show video and slow motion video of how far behind the pen the actual stroke lags this is advertising for apple and the only thing they're advertising is low latency okay number two and maybe i've gone on a bit too long so maybe we'll just do two maybe i'll skip the third hopefully this sells it a bit everyone knows everyone understands this intuitively but it's performance that is what's being sold here one other one is mapquest to google maps this one's very interesting to me because not only is it the exact same kind of shift mapquest was like a static map and you clicked on it and it would like refresh this was the mapping service most people used google maps came out you could drag the map around it would page in in real time you could drag it around you could zoom in and out right this is gone this is now the predominant mapping program they don't really do anything different i mean later not much later google maps actually added some features obviously that mapquest didn't have but by and large when it came out it couldn't do anything mapquest couldn't do it did the exact same things it was just low latency you can drag the map you can zoom it in that's engaging it's fun low latency equals success in a lot of cases even if you do less and the reason that i put this up here especially is to note that google maps is still extremely slow it's a very low performing piece of software even back then and definitely now but because their competition was even slower all they had to do was be slightly better you know if you can go to some from something that's static to something that's a hundred milliseconds or 500 milliseconds of latency that's still a pretty big deal so not only does getting it exactly right help not only just getting to 30 frames a second or 60 frames a second matter but getting from like a frame every couple seconds to several frames a second is also great you don't even have to get all the way before you start being very competitive to people because most people's software is extremely slow so just being slightly less than extremely slow is already a huge bonus right okay so i won't belabor this point anymore i was going to give a third one but you know what it doesn't matter so let's go ahead and start talking about our specific case hopefully some of that resonated and you're at least somewhat interested to know how software gets slow or why it might be slow and to start thinking about what is the performance that's realistic to achieve you know are there cases like this you know how much software in the world is slow right now and how fast could it be those sorts of things and that's exactly what we're going to be covering okay so let me go ahead and show you the grass planting stuff in the witness so the witness was a video game uh that was done it was the follow-up to braid uh in terms of the the lineage so jonathan blow who if you don't know him he's an absolutely amazing game designer and he did a game called braid uh which was a big success and one of the very early very important games in sort of the the transition of from of indie games becoming big sellers uh lots and lots of people played it and it was very popular and uh his next game after that was called the witness and it was uh brade was a 2d kind of platformer the witness was like a full 3d you know if you looked at it you might think it was a triple a game um it it you you could walk around this this huge island with tons of geometry and different things in it um you could stand up on top of of cliffs and look out over the whole island and you know it all rendered it was it's just a it looks extremely professional it doesn't look like a small project when you play it so it's a really it was a really big undertaking even though the team was quite small programming wise i think there were maybe four total programmers on it most of the time or something like this not very many but it was a very very good looking game now one of the things that uh happens on these types of games is that when you have huge worlds like this and they're supposed to look really rich and detailed is that artists need tools for putting things into the game usually over and above what you can buy so most of the time when we work on a video game we also work on editors that are associated with the video game and you can kind of think of this very analogous to something like when you build you know a building it's not uncommon to see like scaffolding right so scaffolding goes up around a building and that scaffolding is something that you built that has no purpose other than to help you build the building you will take it down after right and best case scenario you use that scaffolding on some other thing and that's exactly the same as what happens on video games typically we have editors built specifically for a project sometimes those can then be used on subsequent projects but they do not ship with the game most of the time they're for internal use only one of the tools in the editor for the witness in that series of things that were made for creating the world was a thing that allowed you to distribute meshes over a surface and the point of this is that a lot of times artists you know if you imagine something like trying to put grass or rocks or things like small pebbles throughout the world it would be incredibly tedious if you were going to actually pay artists to hand place every blade or every tuft of grass every pebble on a beach really what the artist wants is just control over the visual look of these things how clustered are they in certain areas how distributed they are do they change their size or their rotation these sorts of things they want some parameters they can control but they don't actually want to have to place each individual one because it's a huge waste of their time they don't care exactly where everything is they just care that it roughly obeys some nice distribution characteristics so that is what the system is we're going to talk about today and let me see if i can show you here uh what it looked like when you actually uh yeah well let me show you this first so here's an example of grass you can actually see um it's sort of like uh scattered around on the ground there are these sort of little tufts right and as you move around you can see that it just adds some nice texture to the ground if the ground was just sort of like flat polygons there it wouldn't look nearly as good it wouldn't look like grass it just looks like you're walking on like a sheet you know with a pattern on it and so the idea here is this is showing uh you know this this kind of shows depth right it kind of gives you this more this feeling of lush vegetation and so placing those things on the ground you can see how many there are it just be extremely tedious so this was the tool that that someone had made for doing that you can watch it run you can see it placing those little pink like dots in there right those pink placements are like grass that it's been placing and you can see that it takes a long time if you drag it it has to restart right and it starts planting them again i don't know if you can kind of see how that's happening right um and so it's very very slow uh you have to wait potentially 10 20 30 seconds for this to update and of course if you're an artist trying to fine tune the look of something it's a long time to wait every time you move or change a parameter you're waiting again for this entire process to occur and this is a video i just recorded just for this lecture using an old build of the witness but on a modern machine so this is actually probably you know 30 40 faster than it was at the time um and so it's important to remember that like that's actually better than it was it was actually slower than that in practice because the actual uh machines at that time were slower and this is running on a machine that's several years more modern so it actually isn't even representative it was worse than that now again if you are growing up with software today and you're used to using things on the web that probably didn't seem particularly slow i mean just the fact that you could drag the region in real time is something that most the time you can't even do on web things on web things often times you're just click and you sit there and you wait but if you are talking about what can a computer actually do and one thing i hope to convince you of today that should look very very very slow right that should look like something that we can improve and before i go into what algorithm that was using and how it works so you can understand it a little bit more i'd like to take our first question pause and ask just looking at what that was doing because i always want to gauge your intuition how fast do you think that should have been running so it took maybe and and that processor just for some reference that was a intel uh core i7 processor from about two or three years ago i think it's a 7700k if you want the specific one but it's a skylake core processor if you're the kind of person who knows processors um and it runs entirely on the cpu so it's not running gpu accelerated it's just a cpu algorithm and assuming that we're not going to put it on the gpu or anything like that because the gpu is already very taxed rendering the world and so on uh my question is stop for a second and think about how fast do you think it should run so if it's running at about 30 seconds to plant that region of grass right now if we were to push on it if we were to go and try to do a good job rewriting that to be fast how fast do you think it would actually go like the 30 seconds do you think it could get down to five seconds you know or like could we double this week we get it to 15 seconds um could we you know could we quadruple the speed could get to seven ish seconds right what do you think the speed would be of something like that based solely on your intuition and what you saw on the screen so talk amongst yourselves come up with an answer and the uh i've been told that the answer will then the the answer you sort of agree upon will get sent to my phone and i can see what it is okay so the students averaged out when they guessed they guessed around five seconds so we're gonna start out with a guess of five seconds is what we should be able to get that down to and of course this was just purely ballpark estimate for them they don't even know what the algorithm is or what's actually going on so we just a quick read to say what's our first intuitive guess okay let me now explain the algorithm and more specifically first i want to explain what the algorithm is trying to do so if you imagine how planting works well we're talking about something that we want to be randomized meaning the computer is going to randomly scatter things and of course anything that gets randomly scattered is just going to be using some kind of a pseudorandom number that the computer is generating to kind of place things around the world but when we are planting things we have certain criteria in mind that make it not as simple as just generating a bunch of random numbers as you might imagine by the fact that that took 30 seconds if all of it was doing was just generating a few thousand random points you wouldn't probably expect it to take that long or maybe you would i don't know again it depends on your intuition but essentially what happens is that when we talk about noise which is effectively what we're talking about here we talk about random things just noise a lot of times when we talk about random numbers what we're actually talking about is something called white noise and this is effectively a purely random kind of noise it's the kind of noise that if you do it in one dimension sounds like that white noise like on a tuner that can't get a signal you know that kind of thing white noise and what this is is it's when you just pick numbers that are designed to be as random as possible and by random in that case what we mean is not predictable there's not supposed to be correlations or patterns in these numbers if you know a few of the numbers or even a lot of the numbers you're not supposed to have any better idea of what the next number is going to be it's supposed to be truly random and the problem with patterns like this for when we're doing things like art is most things in the real world aren't purely random they don't obey a white noise or a pure random distribution they aren't simply just randomly picking things in the world and going places if you think about what happens with something like vegetation when something grows in a region it takes resources it takes water it takes light that means that there's always a bias for new plants that grow to grow slightly further away from that current plant than there is for it to grow right next to it so you won't see tons of grass all right next to each other and then like a big gap and some other grass right which would happen in a purely random distribution because the whole point is it's not supposed to be predictable it shouldn't look like any kind of uniform density when you look at it because it's supposed to be purely random so there actually is another way of looking at random numbers a different kind of noise and these are sometimes called like poisson distributions or poisson disk distributions you can look up these if you're interested in the very technical nature of them i'm just going to go over their properties and kind of how they work blue noise is a different kind of noise it's noise where it is still based on randomness so it's still based on picking random points but there is a requirement that no two points in the distribution are so close together that they violate some minimum separation so for example with white noise i can just pick random points and they go everywhere but with blue noise i'm saying when i pick random points i can even think of it as doing it just directly picking random points if the point that i would put down is too close to another point i don't place it so this means it still looks random and chaotic there's not a pattern to these points but they don't get very close together so they're kind of more uniformly filling the space now i have a diagram here hopefully that i can show you uh and i don't actually actually you know what i should do there we go sorry there's still text behind it but if you take a look at this diagram here um hopefully what you can see is at the top on the left and right which are really the only ones we need from this diagram the bottom just shows the actual repulsion region of the points but on the top you can see a white noise pattern on the left and a blue noise pattern on the right and now these patterns were just generated i have to admit i made these a long time ago because like i said this was a lecture you know just a pickup lecture so i had to grab things i already had um this was slides i made or satisfied i made diagrams i made for my some of my posts on the witness grass planting which doesn't really cover the performance at all but i do cover the noise stuff and when i made these i was not as hardcore about random numbers as i am today so what i can say about these is you know the random number generator used in these probably wasn't as good as i would want it to be today um it wasn't horrible but it probably wasn't the very best thing so i you know take it with a grain of salt but roughly speaking the top diagram on the left is what happens when you just randomly place points into a plane and so that's what the grass would look like now there's many problems with these like i said they're clumped together that's not really how plants or rocks even normally are placed right they tend to be unifor more uniformly distributed than that although random and if you look at what happens or something like that there's another reason in games why we wouldn't want that so besides the fact that it tends to look unnatural and weird it also has a very negative aspect which is when you're placing something that's supposed to cover a surface so don't think rocks think actual grass you know not bushes not rocks not ferns um but actual grass in addition to wanting it to look natural which all those other things wanted we also wanted to kind of cover the the field nicely because we have real budgets we have real uh hard constraints on how much we can render on any given frame you know how much the gpu can actually render and so if we exceed those budgets we will be in trouble that means we're trying to minimize the number of individual plants we place and still get a good looking coverage if you clump a lot of them together then even if you don't care that that looks a little unnatural you also have the problem of you get big empty spaces right so you end up in a situation where you see splotchiness where you want it to feel more uniformly covered you know so that is the basic idea behind blue noise and that's the kind of distribution we were trying to get and it's pretty much what you want anytime you use an art tool you almost never want white noise it's very rare to want a purely random distribution you almost always want a blue noise pattern or plus on disk pattern um and the you will notice the difference right away if you start using it for things it's just better almost across the board but um and here's the part that uh will surprise no one it's harder to produce so poisson disc distributions or blue noise are usually what you want for almost anything that you're doing it's true even for things like sampling like when you're doing a ray tracer you would prefer a blue noise pattern most of the time there's all sorts of stuff that we could talk about that are why you care about what kind of noise that you use but the most important thing to remember for this is just again that natural look and good even coverage are things that we really care about so when it comes to stuff for video games this is something that we really want to do okay so how did the original version that you watched that i played the video for on the screen actually work well let me turn that off for a second and let me talk here about how that algorithm was working so what happens is you probably saw there was sort of like a border region uh that happened uh sort of like up on the uh there was like a pink border right around the region where the grass was going to be planted and what the original version did is it just drew random numbers right it just picked random numbers and it just you know placed things down onto the surface right and when it placed these things down it just checked it did exactly the thing i said before it just checked it said okay whatever the the distance is that's supposed to be obeyed here just make sure there isn't anything else in in that sort of radius around the point i'm placing and if it's you know if there isn't i'll just place it so you know it starts placing these points and so on right and as it places the points it eventually gets to some level of time when it starts you know when this gets kind of filled up it starts getting to a point where everywhere that it tries to touch right like everywhere that it tries to place a point is already covered and at that point it can stop but it's kind of a little bit of a of a tough problem for an algorithm like this one because you never really know since you're drawing from random numbers uh you never really know if you actually ran out of places to put you know points or if it just seems like you did because your random numbers are not able to find the places that are still there and to give you an example of that let's suppose we took like you know these two points here right and there is um a place you know maybe this and uh maybe if we imagine one of these here there is a place that you could put another point right there well there's so little space for actually doing that at this point that having the computer randomly guess a number right in that one little place you could put it is incredibly unlikely so there are two things that happen with the algorithm the way it was originally written in the witness and this was written by someone else i had nothing to do with it it's just how it was at the time when you do an algorithm like this you end up with two problems one is that your the time it takes you to find a point right you when you're guessing it keeps going up and up and up because it keeps taking longer and longer to find where to put points because as you get more coverage every point you pick has a higher and higher percentage chance of falling within an occupied region and similarly if you take a look at the result here you're going to have to stop sometime and chances are you stopped when at least one of these wasn't found so it also has another negative uh aspect to it which is that it tends to leave gaps there tend to be splotches in it that it just never found but you totally could have put something there so it's really not a very good way to generate blue noise and when i was first asked to take a look at the grass planting stuff on the witness because i wasn't really on the witness team i just volunteered to help out occasionally on it when they asked me if i could take a look at it that's actually what they originally asked me to look at was can you try to make it so the grass planting algorithm doesn't have this gap problem and so let me walk you through that first because i think it'll give you some insight into the algorithmic side of the blue noise generation and then we can start talking about what the performance issues are on it okay so because right now i haven't really talked about speed much i sort of said okay it takes longer to find things but we're really not talking about speed much at this point right um so give me a second here to to wipe off the uh the board because i'm gonna need some more room uh to talk about that there we go um and uh i think let me see if i have i do so i actually have uh some some graphs here that i made uh when i was looking at this which show how many times it had to try to find points as it added little pieces of grass right so across the bottom on this graph you're just looking at the number of points that have been placed so you're going through and those are like number of grass plate you know number of little grass or rocks or whatever it was that you're placing number of those you've successfully placed and then the number the vertical so the actual bar that goes up is how many times you had to check right so how many times do you have to check to see uh if a point was occupied so you can think of it as uh the bottom going across the bottom is like one blade you know one one whatever they are one clump of grass two clumps of grass three four five six going all the way over to the right where you've got like a thousand planted or something right and then the vertical is just how many points that i have to check so you know at first i'm picking a random number any place i pick will probably be okay right so it should take you know one or two max as you start uh but then as you get more and more points placed and that probability goes up and up you end up really really ramping up your time and as you can see it actually clips off the top of this graph if i go show you the actual scaled out graph that's what it actually looks like so you can see how crazy super linear that growth is so you can see that this is an algorithm that spends most of its time at the end right in terms of if you assume these things are uniform cost the entire time through most of the cost comes at the end of the algorithm when it's desperately looking to place those final few points and it just can't seem to do it so that's the basic idea of the original algorithm and let me kind of explain how i went about modifying this now it's important to understand when you're doing optimization work and uh this in this case wasn't really optimization work it was just modification uh of the algorithm to be to be better coverage in in the first step but whenever you're doing work like this obviously context is important because time is limited especially on a project like this so when you're going in to improve something you have to scope that improvement to be achievable you have to say how much is it worth my time to go do this so we know we have to improve this algorithm but there's a question how much do we want to improve it since at this point i wasn't really thinking about optimization particularly much i was really just thinking about how do i fix the distribution i wasn't too concerned about generating the blue noise or the poisson pattern i wasn't too concerned about that up front in terms of like getting the very best generation algorithm that i possibly could what i was more concerned about is what is a fairly straightforward thing i can do not complicated easy to maintain you know a very simple thing anyone could drop in and mess with that will produce these patterns more reliably without those holes in them right so that was my goal and i mentioned this only because blue noise generation is actually i almost want to call an active area of research every so often there are new papers published about better ways to generate blue noise and the reason for that is like i said it's incredibly important pretty much everywhere it's important for sampling it's important for art like this generation tools it's important in a lot of ways and in fact i want to say that if you look even in nature like i said blue noise is everywhere and it's actually even in your eye your eye which is optimized for sampling actually has a blue noise pattern of cells in it the receptors that see what's coming in it's it's actually arranged in a blue noise pattern this makes perfect sense because cells grow and they pack right and they have a distance between them so they are kind of blue noise by nature but also that's why when you look at things fun fact it's why you don't see as many artifacts as you would see if you just had a purely grid arrangement of sensors in your eye your eye has to do a lot less anti-aliasing than it would normally because it happens to have a blue noise pattern on the actual retina um not my area of expertise but that that's at least something that i remember mitchell so don mitchell one of the the people who did uh some very very formative work in sampling uh i believe starts his famous sampling paper talking about that fact so it's actually kind of built into to us the blue noise pattern okay so what can we do first to fix that distribution and the answer in my case was i was like well we don't really have to use the very latest in you know poisson disk kind of algorithms because they can get pretty expensive and i want to say since that time there's actually i want to say there's been at least one paper written that was not so crazy and that you probably could have considered doing in a day but i not only do i not think that was out yet but i was aware of several poisson disc uh sort of creation algorithms that that make blue noise and i knew that i could do a fairly simple one that would get us what we wanted and it wouldn't be the most optimal in terms of the complexity i mean in terms of the runtime of the algorithm but it would still be very good at keeping the cost uniform and i thought that was probably mostly all we would really need so let me show you what that looks like and this is what i implemented to fix the gap and that sort of long uh expensive end of the algorithm problem okay so what i did was i said well all right when we place a point in the world so you know we we have this sort of grass region and we're going to place a point there we know that actually the region we're going to sample we already know that if we wanted to look for places where we might put a point we know right off the bat that they have to be outside this circle right so i put a point down it's got to be outside there because if it's not then i'm i already know that it's going to be rejected so if i just write something that restricts random numbers generation to a disk right which is pretty easy to do i can just say generate a random point in this disk and test that now if we just keep doing this that one you know for example well you know i kind of drew this disc a little bit wrong so we actually know that it can't be that close so we would have to add that radius right so it's really more like that right so we know it has to be like you know somewhere in this you know this disc here um and i think technically these are called an analyst because it's a disc with a hole in it right so i know that i can just pick points outside here and then when i pick one i know that i can then pick points in the same way in the analysts around that and what i do is i just say well when i step through these when i generate points i just keep generating points around the first point until i can't anymore so i just locally sample that one pretty densely after i'm done i have a new set of points and i just go around those sampling those dense densely right and it's just a wave front that expands outward so you just remember because you already you're gonna have to remember what points you placed anyway because that's how you check to see if they're too close so all you need to do is just sweep through those points checking each of their sort of analyst regions checking those for for new places to place and it just kind of grows outward from a point this has the benefit of always checking that region that's just far enough away which means that you never really have that problem of placing two points i mean you can occasionally but much less frequently of placing two points and never checking the region right inside them densely enough to find it because you're only ever checking those regions so if you place points like this and you just let them grow over time you end up in a situation where your diagram instead of looking like this one it changes now i'm going to leave the blue one up there but if you look at uh at what this distribution ends up doing and you know let me jump ahead real quick just to show you so here's what it looks like in red overlaid on top of the blue graph here's what it looks like in red to do this algorithm you spend more time during the expansion so you have a fairly constant test of like you know however many you want 30 40 50 or so you're typically going to test just to make sure you fill things in after the first few you are still rejecting a lot of points because this algorithm is still kind of dumb it doesn't try to not sample around where it already placed points there are poisson algorithms that do and that's why i said you can do more complicated things with by adding more complicated data structures okay so anyway as we go about placing these uh points you can see that we spend a pretty typical amount of time throughout the entire algorithm but it doesn't get worse right because we're always sort of growing just on the outskirts of the thing the time that we sample we tend to always reject the same number of points and so we really don't get into that situation where we get that explosion at the end so we can sample the whole space fairly consistently and we don't have to worry and if i zoom out again to show the sort of zoomed out diagram you can see how just how crazy that blue one is i'm still drawing the red one you just can't see it because the blue one gets so ridiculously expensive at the end that the you know the actual number of samples taken by the red one over time are just gone you can't even see them at that scale now i wanted to show you this diagram for a quick second this is what the actual drawing of these points looks like so on the top left you can see the original algorithm and the blue marks so the the blue uh dots are where like the random points it tried the orange dots are what succeeded and on the right you can see neighborhood sampling lot less blue and again it's not really because it wasn't still rejecting lots of blue points it's just because at the end the old algorithm just went nuts and had to sample like you know thousands and thousands tens of thousands of points all the time just to find one new one and so it massively over samples the space whereas the neighborhood version does not and on the bottom you can see a zoom in and what i've got there is orange the the solid orange in the center is basically where something has been planted and the exclusion region uh around it you can see is where you know nothing can touch that basically and then the points that are placed they would have to be be able to fit the entire thing in there so the you can see that there's really no place to put on either those diagrams there's no place in there to put more points because the the the larger sort of ring would not fit if you tried to place it down and if you look at the little blue dots you can see all the places that were tried in both algorithms that were rejected and there's just far fewer of them on the right and that you know right there is also kind of a speed improvement that we got for free just by trying to improve the distribution okay so let me move forward here and just say okay that is just the algorithmic improvement that was done a very basic algorithmic improvement that was done to improve the distribution it also happened to improve the speed of the algorithm because less points were tested and now you can kind of see what we're talking about in terms of the numbers of points that had to be tested and we haven't really talked about what that test entails but now that you can see the number of points that thing i showed on the screen i think was placing like a couple hundred grass things you could see that the the cost per plant in the old one went up to like tens of thousands or whatever right but in the new one you could see that it kind of hovered maybe around you know the 50 area 30 50 maybe up to 70 something like that so if you assume you're placing like you know a thousand of these things and you have to sample 50 points each time that's like you know what 50 000 tests you have to do so i roll that thing out because the only goal for the thing was to improve the distribution and it did that and hey as a bonus it was faster but it still you could watch it right you could still watch it kind of go out so it was a little bit faster maybe it was like twice as fast maybe it got to like you know that thing that took maybe 30 seconds maybe ended up taking something like 15 seconds to to do the thing maybe it even got down to 10. it's hard to say because we didn't take timings up at this point because the goal was still not performance but that's roughly what happened so now we get to the second time i'm going to ask you know that these things are sampling around 50 000 points let's say something like that you know that changing the algorithm to not have this horrible long tail meant that it got quite a bit faster down maybe to about 10 seconds so now the question is how fast do you think it should run originally you said five seconds so do you think again before getting all the information just intuitively do you think that now that we're here if we assume we're going to stick with this algorithm and not try to do a fancier algorithm how fast should it go do you still think we can get a 2x improvement from the 10 seconds down to the 5 seconds that you suggested or do you think that we won't be able to get that many gains when we start to look at the actual performance which is what i'll talk about next so let me know what your guess is is it still five seconds or did it change okay so the students answers dropped dramatically we went from five seconds to 50 milliseconds something like that they you know i mean obviously these are ballparks so you know something in the order of 50 milliseconds now the interesting thing about that is if they think 50 milliseconds then that means to me that this wave of students should make extremely performant software when they get out into the industry so like whatever i don't know how long it's going to take the students from this university to get to the point where they are actually making software but but i i want it all to be great so in like 10 years when i'm using like you know when i when i when i use it the average piece of software i actually can be super performant if they think 50 milliseconds because hey that's like pretty interactive right okay so let's take a look at the performance situation for this and we'll see if you stick with 50 milliseconds because again i'm about to give you a bunch more information about what has to happen so assuming we take a look at this here we know that what we're doing is we're placing these points and we know that we are randomly picking some points on you know to see whether or not they can be used and you know i mentioned the fact and maybe this is where the 50 milliseconds estimate comes from i mentioned the fact that you have to check to make sure they're not too close to each other um so we know that you know something like that is going to have to go on but here's where the actual cost of the algorithm comes in so when you look at a scene in the witness right and like you had that scene there you sort of have like some kind of a underlying set of geometry and it's pretty reasonably triangulated and then you know you've got some other things on here maybe there's like a wall or something on there and so on right um and you know you've got uh all kinds of stuff maybe you've got like a little like another piece of hill coming in here that overlaps with it that kind of joins together so you got these like gridded things and so this is a top-down view but when we actually go to plant grass what we have to do is shoot array downward into the world test for intersection with all this geometry find where we actually intersected and then we can go ahead and say all right now is that actually within some distance of other points that we've placed or isn't it and there's some other stuff that i'm glossing over here such as there are exclusion regions so you have to do some checks to see if you're in a region that the artist said don't place it in or things like that but for the most part what we're doing is we're going to be looking at the intersection of the sort of rays through the world that intersect the geometry of the world now in terms of how expensive something like that is i can actually show you if you go to wikipedia so nothing particularly fancy here if you go to the wikipedia entry for something like how fast is it to intersect array which is what this is with a triangle which is like what the meshes of the world would be built out of this is an example of a typical triangle intersection routine not necessarily the most highly optimized one but it was a very common one that you would have seen 2000s era for intersecting triangles with rays and you can see that the routine is not particularly complicated if you can read c plus plus if you can't i can walk you through it pretty simply it's not a very complicated algorithm effectively all it has to do is take a few cross products and dot products and really all it's doing is it's trying to figure out whether or not you're inside the triangle like sort of semibarycentrically and so what it's doing is it's saying well all right i'm going to go ahead and take the ray and i'm going to look and see whether or not from the perspective of the direction of the ray it falls inside the triangle from that direction because then i would know that it must hit right and it produces the hit by seeing what the intersection is of that ray with like the plane equation the triangle it's pretty basic stuff right but you can see that looking at it there's not much math in there and the only times it ever like calls out to something you might not recognize it's just cross product and dot product and if you've never seen those before they look you know like this kind of thing a dot product in 3d is as simple as this it's like three multiplies and two adds a cross product is a little more complicated but it's basically looks like this you know a times b minus c times d you know e times f minus it's it's exactly this but three times right so it's two multiplies and i subtract three times and that's really it like there's not a lot more to it than that so you can see that if you look at that routine you know you're talking about a handful of multiplies and a handful of uh you know ads effectively so if you were to sum it up you know i didn't sum up that particular tin because like i said this was informal so i just grabbed it but if you look at that routine you're not talking about a lot of math ops fundamentally for doing one triangle you're talking about i don't know i mean maybe there's 30 in there maybe or something i don't know how many there are it probably says maybe 30 maybe maybe 40 math ops in there of the type that are floating point multiplier floating point addition and then there's some branching in there which looks a little ugly especially because you know branches have a particular problem if they're unpredictable which we'll talk about in a second but if you look at that routine you're thinking about well okay you know you're going to have between you know 30 and 40 math ups or something like that in there and then you're going to have some control flow logic that has to be evaluated and when you look at the world of the witness i'll go ahead and pull that down so we can go back to the board here when you look at the world of the witness you know you're maybe talking about one of those regions i don't know it might have up to 10 000 triangles that are actually in play there depending on how wide the region is it could get higher i mean it could go up to something more like 20 000 it's hard to say but overall if you look at how many math ops you're doing versus how many triangles you're doing you know maybe you end up with some number that's something like you know 500 000 math ops or something like that i don't know and i'm purely making these numbers up but you can see that the numbers themselves are not particularly high however this number here if we assumed you know it was you know numbers of this magnitude 500 000 ops or something if you were going to test uh all of the geometry you know all the triangles and that sort of stuff uh in a particular region that you were planting grass in so even if you assumed you already narrowed it down just to the region you're planting grass in you would end up doing this many ops for every time you wanted to check a point so if we're still talking about something on the order of 10 000 points or something like that this number does start to get quite a bit larger you know you can imagine that if you were actually talking about 10 000 points you're talking about something more like adding those zeros to the back and then you've got the zeros that we already had and so something like five billion mathops right uh to test all of those triangles uh in the world and so you know i don't know it's hard to say uh if you were to sit down and look at this and say how many math ops would it really take to test all these triangles you know and you weren't going to try and do anything smart you're literally testing them all all the time you would look and you would say well five billion math ops i guess right maybe is is a ballpark even double it to be safe 10 billion mathops to test all these triangles or something so that number of course is not really very high is the problem no matter how sloppy we want to get with this you know 10 billion it's just not that high if you think nowadays about the kinds of performance that you can get out of modern cpus not even talking about gpus which are more optimized for floating point math than cpus are but just talking about the basic cpu that's in one of these machines it's not uncommon to say look the main cpu of a desktop machine like an artist used you know it might be something like three gigahertz and so when you're looking at performance numbers you can say well three gigahertz that means that the cpu will issue something like three billion clocks every you know second and on every one of those clocks it's going to be doing some work i mean assuming that you're not doing anything stupid it could be doing some work obviously in slow software it isn't but when we think about three gigahertz we're thinking fundamentally about a number that's like three billion and depending on what we want to say our response time is if our response time was one second we're talking about three billion of those units of work right and if we want to know um for any particular frame rate let's say we wanted to go at 30 frames a second well we know that we can just divide this and that's how many clocks we would have inside one frame right how many times the cpu is actually going to click over so if you imagine dividing this by 30 you would still end up with something very large right so we would still expect to see even at a very interactive frame rate like 30 frames a second 30 frames a second would be even faster than the guess that the students had of 50 milliseconds or so and they said you know it should be able to be faster than that i think was what they said so we're talking about 100 million of those clock cycles now 100 million clock cycles doesn't tell us very much about how many math ops we could do so for a purely back of the envelope calculation as we call if we want to know how many math ops we can do we need to start multiplying this 100 million which is already quite large right it's already up towards the 5 billion we might need to do the naive version of this thing in one second right or sorry to do it in a 30th of a second we have to start multiplying this number because we know that actually the cpu can do more than this so if you look at the way modern cpus are designed they are designed to be single instruction multiple data machines what that means is that when you issue something on a clock cycle it's not just going to do one math op it can do up to eight math math ops on that cycle on just one of its math units right so because of that which we'll talk about that in a second as well because of that single instruction multiple data a clock is not one math op it actually starts by being multiplied by you know nowadays maybe eight avx registers on x64 you know the kind of machines that we're talking about they can do up to eight at the time of the witness it may have been like four but on the machine i showed the video of it would be like eight so you would expect you know eight hundred thousand math ops in one thirtieth of a second but that's just on one of the math units the actual cpu's math units there are two that can handle things like floating point multiplies so this number would actually get multiplied by two again because on every clock the cpu can actually start two multiplies because that's two things in it that can do those multiplies so that's up to 1.6 billion now in a 30th of a second right and then finally if we really want to go for broke it's got multiple cores so a typical cpu nowadays usually has at least four cores so if i multiply this number by four now well i don't actually want to have to do the math but i mean well all right i will what is it something like this this is off the cuff so all this may be wrong but that's 6.4 billion all of this stuff could be wrong i'm just purely doing it off the top of my head as i often do when i'm just wondering how slow something is and why but already we see that if we just do some basic thinking about how fast this thing should be even if i'm assuming ridiculous numbers like we're testing 10 000 triangles on every point for the raycast i already come out with a number that says that if i was using the cores of this machine and the multiply unit maximally i actually match that number at a 30th of a second so why was that thing that we looked at you know why was it taking so long why was it taking 30 seconds originally and then it only got down to maybe like 10 seconds or something after removing the really long tail part of it so given this hopefully everyone understands kind of where those numbers are coming from and in fact you know what i can probably show you um i think i might have grabbed some slides yeah so you can actually see some things here uh and so i'll i'll kind of walk you through what this is so one of the things that you can do um and i'll talk about this a little bit later on when we actually look at the performance stuff is there's a great site called uops.info this is like one of my favorite sites of all time uh ups.info is a site that actually has performance timings for almost every instruction of almost every x64 cpu you might care about and if you go look there this for example is the page that you will get if you look at the performance uh of the multiply on x64 the floating point multiply on xd4 skylake architecture which is what most machines were or are at the time right right when the witness shipped or so that would have been the common machine um it's very common now it's getting less common now people are switching to the amd zens or whatever but skylake was the the architecture that was intel's architecture for about five years probably longer than five years actually um so if you know yes it'll be six years now because they're still shipping so five years i guess coffee like yeah it's five years let's just say five years if you look at what these numbers tell you you can see the latency on there and you can see the throughput on there you can also see some stuff like number of micro ops and these other sorts of things so we can talk about later and the port usage but what you can see the important number here is the number for throughput and i know that i talked about latency a lot and the problem is when we talk about whole systems and software we care most about latency usually in servers and stuff you might care more about throughput latency and throughput are obviously related latency is how long it takes you to from the start of doing a particular thing to when you get the result of that thing back throughput is like latency but instead of just measuring the total amount of time it takes to get something back it asks how much does each additional thing cost after that and the reason that this matters is because it accounts for overlap so your throughput can be much faster than your latency because if something you know if it takes you one minute to do something that's a pretty long latency but if it takes one minute to do it but every second you can start doing another one and stack up as many as you want then it actually only takes one second to do every additional thing right so you can end up doing a throughput of a very small like it can take a very little amount of time to do more of something than it does to just do the one thing because you can overlap the computation cpus nowadays are very highly overlapped and so that's why you see like the latency of one of these instructions can be four 11 right clocks as you can see they're up at the top and we can talk about why they're different why there are different ones but the throughput is 0.5 right 0.5 throughput means that you can do two of them every clock it takes 0.5 of a clock to do it so the latency of something if you start a multiply the minimum latency is four cycles if you start a multiply you won't get the multiply result back for four cycles but since you can start two every clock and on the next clock you start another two and they all overlap you actually end up with a throughput of 0.5 because you can always do 2 every clock sustained actually you can see that the measured loop speed was 0.53 so you actually can get very close to that in practice you can get that many multiplies done so anyway you can see that we can get this information very accurately now um you know we don't we aren't privy to the actual details of intel's architecture in a lot of ways they don't publish everything the same is true of amd you're not going to get a complete schematic you know you're not going to get like here's a complete simulator of the thing and exactly how it works but you can get pretty good performance numbers so you can do these kinds of calculations and what's more is we can also now talk about well okay this math op stuff that i drew out on the board it assumes that we actually have the data we need to work with but we all you know if you've done any kind of work with performance before you know that getting the data into the chip is as important as being able to actually do math on the chip so when we look at those numbers that say hey we can do to multiplies every clock or you know those sorts of things that assumes that we actually have the data in the chip to do it and one of the big problems with modern chips is that the time it takes to get memory right from the main memory of the machine into the cpu to actually perform computations can oftentimes be the thing that actually slows us down we can't do more multiplies if we haven't fetched the numbers we need to multiply so you can also get numbers like these which tell you about the throughput you could expect like how much memory you can actually expect to get from various stages of the memory architecture of the machine cpus obviously have different levels of caches l1 cache l2 cache l3 cache sometimes even l4 cache and then main memory these are buckets of memory of increasingly larger size that are increasingly further from the cpu so the l1 cache is the smallest piece it's what the cpu can work with directly it only takes about four cycles of latency to get something from there and then there's the l2 cache which takes like 12 cycles of latency to get something from there and this is on a again on skylake and then there's the l3 l4 main memory when you get up to main memory you're talking about potentially hundreds of cycles to get stuff from there so you can also see that if we're able to do so many math ops right if we're able to do this kind of ridiculous level of math ops we have to also take into account the fact that we may not be able to actually feed this thing enough stuff to actually run at this rate and so we have to ask the question well if we're talking about something like 10 000 triangles or something like this how much space do those actually take up and will they fit in one of these caches and we know from looking at that diagram that we have 32k of l1 we have 256 k usually of l2 and then l3 might be up to 2 megabytes it depends uh and then usually there's main out there which is slow right and so looking at these we would start to think well okay if we have 256k of l2 cache you can see out here we've got numbers on there that says okay what's the the line size what's the peak bandwidth on the peak bandwidth we may say well let's look at sustained bandwidth because sustained bandwidth might be what we need if we're just running this thing full out and you can see that there you would expect to get around what is it 81 bytes per cycle on the l1 cache and about what does that say 29 so we'd expect 81 bytes per cycle and 29 bytes per cycle out of l2 and the question is well all right if we're only getting 29 bytes per cycle out of the l2 well how many floating point numbers are going to fit in 29 bytes per cycle right i mean you know call that 32 bytes per cycle or something figure that means we're going to take 32 and you know divide that by four um or something like this we can figure out how many floats per second we're going to be getting from one of these caches uh if you assume optimal right um so i don't know like let's say 8 times 4 is 32 right so let's say we're getting eight floating point numbers right out of the l2 cache we would assume that eight floating point numbers every cycle to feed into our multiplier well that's kind of not enough probably so because if you think about this if we're saying we're going to do eight wide multiplies we need eight floats and eight floats to multiply together so it depends on you know we wouldn't be able to be just strictly streaming from memory now the good news is when you look at it like that triangle routine if we went back there it's not super important but not all the multiplies load new data some of them load new data some of them don't so we may actually be able to do this because if we can load uh eight floating point uh numbers per cycle let's say we may end up getting to the point where well we do multiplies with some things we loaded and then we like do some ads to add them together and then we do some checks we may luck out uh and find that that's plenty because a lot of our work happens with stuff we've already loaded so we know we couldn't stream this directly we couldn't be streaming all of the numbers in because we would hit a cache bandwidth limit probably pretty quickly because these caches are not that large 32k and 256k but at the end of the day assuming we're doing a fair number of math ops with the data we load we probably could actually you know we could probably actually sustain that out of l2 at least maybe not l1 depending on how many trials we're talking about okay so hopefully all that makes some loose sense and now um yeah so now what i'd like to do is just say given all of this information that i've given you so you've seen the triangle routine we talked a little bit about the fact that there's caches and math ops we know we've got to do all these triangular intersections there's stuff we can do you know we can structure algorithms there's callings all these things before we talk about the actual performance of this routine two things one is your performance estimate basically the same as it was or has it moved and the second one is do you have any guesses as to why it was so slow compared to what this number would have suggested because i did the back of the envelope calculation and it sort of seemed that like at 30 frames a second at least the raw math kind of checks out and there's a bunch of stuff we didn't talk about like there's cleanup that has to happen and other things that have to go on so it probably you know we're going to have to factor in a bunch of overhead so maybe assume you know two three four times overhead something like that we would at least expect this thing to run in like one second at this point right so factoring that in do you have any guesses as to where the 10 seconds comes from why was it so much more than our estimate what do you think was causing that problem and do you have any updates to your estimate now okay so the students change their estimate based on what i've said so far back up to 500 milliseconds which is about half a second um still not too it's about 10 times slower than they said before but still massively faster than you know the the 10 seconds that it was running at and uh they said that they think the cause of the slowness in the 10 second version were things like a bad uh well you know what let me read it directly i want to make sure i don't misquote them so bad allocation and misses in lookups so let me go ahead and erase the board here and uh let me just say a few quick things so i threw a bunch of stuff out there um you know obviously i i just threw out a bunch of stuff here i was like cindy opps this and cash that and you know i don't expect everyone to know these things at the moment right because people might be watching this who haven't had very much experience and they may not know too much about these things we're going to kind of look at these in a little more detail later um assuming that we have time uh but the important part to take away from that was not so much what i did it's that i did it the important part of most of that was that i actually took some time to say gosh how much work would this thing be and what you know what do i expect the the chip could probably do given the amount of time that we have you know should it be this slow that sort of thing now there was a number in there that you may not want to bother with which was the core count because if you just want to know why something's slow you can usually assume that it's probably single threaded i mean sometimes it's not which means it will only run on one core which means all of those numbers would have been divided by something like four or potentially eight however many cores you were multiplying it out by but the point is simply that you know even if you don't know any of the stuff i just said if you're like what was that voodoo that he did on the board sorry i'm gonna try and get this a little cleaner here um even if you have no idea what that stuff was hopefully you could see that actually though i wasn't doing anything particularly strenuous right you did not see me doing things like going well and then we take the double integral of blah and using you know the uh the the ryman hypothesis or something like this right like i i wasn't i wasn't doing math that you didn't already know how to do like in grade school right so hopefully you can see that even if you don't know what any of those numbers were and you've never heard of uops.info and and everything else right um hopefully you would agree that it doesn't look like something that you couldn't learn you know it doesn't look like something that you're going to have to be some kind of amazing super knowledgeable programmer to do because all i did fundamentally was take a few numbers you can look up and multiply them together to get sort of a ballpark estimate of how many math operations this thing could probably do furthermore sometimes people have already done them for you you can oftentimes look up this is what the number flops is you can look what up what that is you can see how many floating point operations per second a chip is rated at doing you don't even have to multiply right so that thing i did where i just kind of you know was like all right well you know maybe we've got this many cycles and maybe we've got this many math ops and so on really that's just a simple basic understanding you can gain about any chip and once you know it you can just kind of use that as a ballpark like okay what am i asking this thing to do if fundamentally i'm asking it to do 10 100 1 000 times more operations than that number we can expect things to take 10 seconds a minute whatever because you are actually asking for that similarly on those cache numbers you know again it might be confusing but you can kind of see like they're rated at particular speeds right they're rated to go at particular speeds and you can go how much data is this thing you can it's just counting it's just bookkeeping right so i don't think it's actually particularly difficult to do what i just did it's more the case that developers just don't do it and they're not used to it in class you know in a university setting since that's you know what you folks are in i would expect by now that many of the things you do in a class would stop and do that first you would always have in your mind what is the task we're doing how much computing resources does this thing take of each kind and therefore how fast would the optimal version be and you don't do this because you think you're going to hit optimal you're never going to hit optimal for reasons that that aren't even necessarily the time even if you spent all your time working on just this one routine for like a year you wouldn't necessarily hit optimal anyway because optimal may not be actually possible hardware has all kinds of weird intricate limitations that you only discover when you actually try to push something to the limit and so you're never going to get the theoretical maximum of things anyway or at least almost never so you're not doing this because you think you'll ever hit it you're just doing this to go what is the order of magnitude performance i can expect out of this thing should this be 60 frames a second should it be 30 frames a second should it be every half second every one second every 10 seconds every minute every hour and you should be able to be able to determine that from an algorithm if you say here's the algorithm we want to do it on this much data you should be able to get that order of magnitude estimate and i would say nobody who comes out of a university and goes into programming or nobody who's been programming for a few years professionally should should be unaware right you should be able to do this thing and currently i would say the opposite is true most people either can't do that they don't know how or they don't do it so there are probably a friend of people who do know how to do it but they just don't do it that's just not what they do so they have no idea when they're writing code usually they're just like i don't know yeah it just does that was how fast it ran and if it seems too slow i'll like speed it up a bit but i never bothered to see how fast it should run and it's critical to be able to do that which is why i started with that okay so again important thing not to be able to do what i just did right now the important thing is to recognize its importance and go that is something i should learn to do and that it's not that hard at the end of the day to learn to do it if you can program because if you can program you know what operations the computer has to do if you know what operations you have to do you can go learn about how those operations happen on the hardware you're talking about and you can get the performance numbers for it usually sometimes it's harder if you don't have any performance metrics but luckily these days you usually do because someone's usually done it before you and put it up on the internet thank you internet okay so now we get to the question is why is software slow i would argue the feedback loop reason why software is slow is that people don't do that estimate i just said so they have no idea so the the ballpark the meta reason why software is slow is because nobody knows that it is they've never checked to see how fast it should be so they don't know that it's slow so they don't fix it right and like i showed before your competitors might come in and fix it for you the iphone may come and kindly inform all of the windows ce vendors that their software was slow because poof they're gone right but if you're wondering how you can be the software company or the programmer who is the iphone coming along to move out the windows ce people as opposed to being the window ce people who are sitting there for a few years thinking everything's dandy until someone just wipes you off uh the face of the software universe well back of the envelope computations like that tell you what your competitors can manage because they're telling you what the optimal thing is now sometimes there's algorithmic optimizations that no one may have thought of and those can be much harder to predict because you may be able to make a dramatic improvement in the speed of something by changing the way it's done i'm not even talking about that that requires a certain level of creativity and finesse that is really hard to teach and is partially based on luck and inspiration so it's hard to say everyone should learn to create brilliant new algorithms i mean no like it's really hard but everyone absolutely can learn to do back of the envelope computations and know that their code is extremely slow that's not that hard okay so more specifically why is the code slow in the witness the original code well here is what the code looked like for doing one raycast and this is a small sampling of the code i'm just scrolling through it right now this is only some of the code involved in testing one point so just testing one point i'm still this is not a looping video this is all means just scrolling through code involved in that okay so hopefully we can see from that is that there was a lot of code there now my fear is that you just watched that scroll by and by the way that was only like 10 of the code that i scrolled through my fear is that you looked at that and didn't think that was a lot of code because sometimes i get the sense that people don't think that like 10 000 lines of code or 20 000 lines of code is a lot of lines of code but it's a massive number of lines of code right it's a huge huge huge number of lines of code you can write an entire game in that number of lines of code so like that's a lot of code meta reason one level down why code is slow because people write way too much code they do this in two ways number one is they reuse a lot of code reusing code is something that people have been taught to do and there are some reasons why you may want to reuse code but the problem with it is when you reuse code you usually don't think about all the stuff that that code is doing that you didn't actually need to do so in this case that was at play we'll see that in a second the second one is when they write code they tend to focus on goals ancillary to just doing the task i don't know why this is i'm not quite sure how we got to this place in software but what you will see overwhelmingly if you look for software programming advice if you listen to what people say in programming courses overwhelmingly nobody talks about how to actually write code writing code that actually does something is almost never discussed and not only is it never discussed but it's never even seen to be acknowledged that knowing how to write code that actually does something is useful instead you will see lecture upon lecture upon presentation upon conference devoted to things that have nothing to do with actually writing code that does anything it'll be like let's talk about how this class hierarchy should be are organized this way because it's cleaner you will literally find entire conferences filled with hour-long lectures by many different people where not a single one of it has anything to do with writing anything that a computer can actually do it will all be about fictional constructs that we as programmers have piled on top of actually having the computer do things and the problem is when people's heads are in that space when the thing they're thinking about is an abstract notion of clean code that people concocted out of thin air that no one's ever come up for with even a metric for like there's no number right it's just this weird abstract loose thing that someone's like oh that code is ugly or messy and this code is clean when that's what they're thinking about in their head you end up with the scroll i just saw because you just reuse codes if it's there because you've been taught that we're using code is good you just type in things like classes or you use templates that are there to like do things for whatever you want to do and at no point during that process were you going hey wait what does a computer actually have to do and why am i not just writing that i know that sounds simple but as far as i can tell it's true people just generally don't sit down and program stuff that the computer does they tend to sit down and program things that have nothing to do with what the computer does and the actual act of doing the thing you wanted to do is like a vague afterthought so let's take a look at this in practice when the old algorithm wanted to do one raycast so remember we are going to be raycasting tens if not hundreds of thousands of times to test points in the world in fact we know that it will be like at least order fifty thousand but possibly a hundred thousand possibly more we're going to be doing this operation a hundred thousand times here is an example of me just stepping through some of the code this code here you can see in update planting it's calling place grass get ground hit is for one ground hit that called something called energy manager find by segment which called find segment in array i which called auto array expand which called operator new which called malloc base and the two things you can't see above it because all i did was get the function names are actually in the ntdl like the windows heap management so they're in ntdl okay remember this is something that can happen for one raycast here's an example of actually starting to test some triangles so you can see that we're after get ground hit has done whatever it was doing in that last stack trace here we're doing raycast against entities we have raycasts against entities single entity test ray test rate i test rate test rate iterative test rate test face this is a stack trace for one raycast you can see there's a kd tree involved here in the raycast right as well as some things like the nv library is like some kind of open source nvidia library that's doing those raycasts right so this is an example of a stack trace for one raycast and this is a release build at this point okay so let's talk a little bit about what's going on in this code and maybe i'll i'll turn off the actual code for a second and just talk about what the process was that happened in the old algorithm and when i went to go optimize it i literally sort of did what you just saw me do on the slide which is i step through the code in the debugger to see what it does now this is a very good thing to do if you step through code in the debugger not with the f10 key not with step over but with the f11 key step into that shows you every single thing that it does even in a release build right in a fully optimized build if you hit that f11 key you will see every line of code that has to happen now those back of the envelope computations that i talked about those back of the envelope computations assume that every raycast that we're going to do we basically just do the math ops that were required right and we're not going to do other stuff as soon as we start to do other stuff all of that back of the envelope falls apart so when we start to look and say how much work are we doing to cast one ray against one one ray against one of the one ray against the batch of triangles that we need to test for this part of the world you have to ask yourself if our back of the envelope combination was based on just the math necessary for ray versus triangle if we start seeing that much code ray versus triangle's not even relevant anymore if that much code is involved in one raycast we don't really need to ask about ray versus triangle we've already spent the amount of time we probably had just in one of those stack traces let alone the fact that there were multiple it was going like in and out of those stack traces okay so let's talk about what the original routine was doing so the original routine when it said i want to test one of those points what it would do is it would say okay step one i want to gather entities so the world is made up of these entities they're basically things you place like a tree brick wall um they also include things like the ground so the ground is like patches of entities that people put together it's how level designers create levels if you've ever used unity or unreal interesting you know placing entities into the world is how people create game worlds so it's going to gather entities and what this routine did is it said okay usually i think it was a quad tree um which you don't need to know what that is but maybe you do i think it was a quad tree but doesn't matter the point is it went through the world and it said here is a line segment and this was again this was code reuse at this step there was a function that said for a line segment give me back an array well technically it like passed in an array handle and it said fill this array which was like an auto array template thing in a header file somewhere fill this automatic growing array thing with all of the entities that overlap this segment filtered by whether or not they have things that the grass could hit right so what it did was it looped through this quad tree i think it used to be a quad tree i don't really remember because again as you'll see none of this matters but for figuring out why this was slow it goes through this and it says okay there's all these entities in the quad tree we narrow it down to roughly where the segment is we start looking to see if their bounding spheres overlap this segment for those that do right we we remember them so we end up with an array that maybe has like if this was a b c d e f whatever it has like a and d so we end up here with an array yeah i don't know what that horrible symbol was so we end up with an array that has a and d in it right so we have a and d in an array and that comes back from this gather entities thing we then say okay we want to loop over these entities and see if they have meshes in them that are for colliding right because we're going to have to like probe those for colliding and when we did this quadtree gather we had some flags set it was testing these entities flags because they have like these fields in them to see whether or not these things these entities right to see whether or not they uh could have had collision messages so we kind of know that they must have some kind of collision meshes but they may have different kinds so the next thing we do is we step through here and we go okay you know uh find the meshes so we loop through these things meshes and for every collision mesh they have we then call raycast on the mesh finally raycast on the mesh is actually a kd tree so there's kd trees for each of these things and the kd tree is basically a spatial partition designed for ray casting that filters the number of triangles you test against so that you only test against ones in an order that makes it likely you'll find the hit before having to iterate through all of them so these are the steps that happened for a raycast so you could say this was every raycast we would do this so every time we want to check a point we first gather all the entities by walking the quadtree finding out who overlaps the line segment that is the thing we're testing we put those into an array we loop over the array looking for meshes that are contained within the entities that are in that array we raycast against those which eventually is a katy tree traversal and the kd tree traversal will return us the closest hit and we'll merge those hits together this is every ray cast right now why was this happening right like why is it being done this way because before we talk about maybe how it should be done or what's wrong with this we have to ask why does this happen and remember this is not the worst case of development this is actually the game studio where people are used to performance-oriented programming so it wasn't you can't say that these are people who don't know that there are performance implications to things they do they did know but why does this happen well the reason why this happens in particular again is code reuse because when somebody goes oh i need to make some grass-planting thing for the editor they're just like well the easiest way to do that would just be to call the raycast that's in the game and of course the raycast that's in the game this is what it does what else could it do right so the the source of a lot of slowness is exactly what you would expect it's calling something that does what you want it to do but not in the way that you would want it to do it the more code we reuse and these days code reuses as an all-time high usually the slower our program gets because the more ill-fit the things are for what the actual task was on a case like the witness it's really not much of a cardinal sin it's a small team they didn't have a lot of time until somebody complains you might do code reuse in the editor the player is not going to experience that but at some point when someone complains you got to start looking at it and that's of course when i started looking at it so this is what i saw okay so probably last question for the students as we move into the final part of the lecture looking at what this is doing what do you think is wrong what are some things that you think perhaps perhaps in order of badness what are some things that you think are wrong about this way of doing raycasting okay so i was glad to see that the students kind of guessed or at least one of the things that they said was actually the most fundamental thing to recognize about an algorithm that looks like this there's a lot of things we'll talk about in a second here but the biggest one is just the fact that there's a lot of duplicated work going on here for a number of important reasons so let's take a look at how this works if on every raycast we gather up all the entities that we're going to be involved in this particular raycast well the world has 60 000 i don't even remember how many entities there were tens of thousands if not a hundred thousand entities in this world and they are in a tree of some kind so there is a spatial partition that will pull them out so we're not going to have to look at all of them but if on every raycast we're at least considering a subset of them we're doing a tremendous amount of work on every raycast to narrow the number of entities down but what we know is that we're always doing all of our ray casting in a very small area of the world there's no such thing as grass that's planted across the entire world at once because of course the grass has to be localized and customized for every particular region and so we never have that case so this first part of the routine here is really bad because it means we're doing a tremendous amount of work to gather geometry and we have to do it every single time we raycast so the biggest thing we can do to start to simplify things here is do this ahead of time do this as a pre-pass right so if we do a pre-pass where we say let's just narrow the entities down to the ones we actually care about and so we never have to do any kind of walks in a quad tree or anything like that that are complex they may have mispredicted branches in them they may have all sorts of performance penalties working instead of that we do a pre-pass first and also we don't spend a lot of extra cash we're not touching a bunch of data we don't care about we just narrow it down to the data that we actually want and that logic can continue downward right well we could just find the meshes why stop at entities we find the meshes we pull the meshes out and we just do a pre-path say here are the only meshes that we care about right and we can go one further and say well when we get down to a kd tree a kd tree is actually kind of terrible for this uh kind of of operation because the k kd tree is designed to handle ray raycasting from arbitrary directions it's expecting the rays to just be completely uncorrelated it's designed for general ray tracing but we know we only ever cast rays in one direction they just go straight down our rays don't go in any direction they just go straight down so a kd tree which is optimized for arbitrary array traversal is also wasting a ton of our time because we don't need the flexibility that it provides so almost everything almost the entire set of things that was actually involved in the code we saw none of it needed to happen puree more or less the final part of the kd tree thing where it actually loops over a small chunk of faces and tests is the only part we really needed to do per ray and everything else could actually just be done in a pre-pass this was obvious to me when i just started stepping through it it was very clear that that was going to be the case because this is the kind of code that i started seeing just when i looked at those stack traces and i start looking and stepping through the code and saying where are we i see stuff like this now if you look at this kind of code it just looks like normal c plus plus code and normal c plus plus code or you know modern c plus code whatever you want to call it the kind of c plus code that has templates everywhere and stuff like this is always slow just pretty much by design and the reason for that is that the templates are don't really know much they're you know templates are pretty weak programming paradigm they can't really do very much they're not like full meta programming they can do things like analysis they just do what they do an auto array allocates memory it has to because it's a growing array and so what happens when you see code like this where you just got auto arrays happening and people calling things and then you got like you know multiple layers of call stacks that much work per a is completely impossible to make performant if you're calling a memory allocator in the middle of a raycast it's game over right that's going to be an extremely low performance raycast no matter what you do so when you look at stuff like that right off the bat i just like okay this is obviously just the problem here before i even have to get into it is just that this is written like a regular semantic i just wanted it to do the thing correctly program that had no concern for how fast it would run that's not entirely true because for some reason when i actually went and looked at the code that gathers the entities it was multi-threaded so somebody thought it needed to be faster but i don't understand why because it's a very slow thing kind of by design so i'm not sure why threading it happened maybe because threading it was easy at that particular time i'm not sure but point being it's clearly code that's not written for performance it's code that's just written to do the thing and then move to side and so what i want to talk about now is well you know what let's not quite use that slide yet what i want to talk about now is how do we move from something like this to something that's appropriately built for the problem so rather than just calling a raycast that exists in the game that's for doing other things how do we actually build something that's going to be fast and does what we needed to do okay well the first thing at least for me when i looked at that problem is i was like well we really don't need ray casting per se because if you think about what ray casting usually implies it's tracing your way through the world to see what it hits but we don't really need to trace a ray through the world to see what it hits because we know that we're really just picking points in space in a 2d pattern and seeing like what the height of the world is underneath them right so if you think about this problem a raycasting in 3d is like a whole 3d thing that's why there's like a kd tree and entities overlap testing all that stuff was happening it's an arbitrary like directional problem but we know that we are always looking at the world top down when we're doing this casting and we know that when we ask for a particular point and we say where is that in the world we know that we're just talking about doing a raycast straight down so we also just know implicit in all of our routines the x and the y like if you thought about the ray the x and the y are always zero right the directional x and y they're always zero you're just going straight down it's z one right or z negative one right so we know we're always casting straight down if we know we're always casting straight down to me that just says this should be organized like a grid because if i know i'm always casting straight down i can just take the region that i'm trying to do my grass planting in put a grid over it and just say look for some density of grid that's appropriate for how big the grass planting area was or even not we could just say it's always 16 by 16 or something let's just make a series of grid cells we'll run through all of the entities once that are involved in this we'll take out their meshes we'll take the triangles out of the meshes and we'll put every triangle that overlaps a grid square into its own bucket right with all the other triangles that overlap that grid square now we have gone from a giant process that walks the cod tree allocates a race base puts things into it looks those things finds the meshes takes the meshes looks at kd trees walks the kd trees finds the triangles and tests triangles to something that literally in one lookup based solely on the input x and y that we are testing immediately gets just an array of triangles to test so we can literally replace like the entire thousands of lines of code that were working before every raycast before we ever tested a triangle like literally thousands of lines of code probably it might even get into the tens of thousands i don't know because you'd have to include the nt memory manager and malloc like all of that is on the table right so thousands and thousands of lines of code that had to execute before we even tested a triangle we can replace all of that with one lookup back comes our pointer we're ready to test our triangles so we can literally eliminate almost a hundred percent of the code that was being used in the old routine we can erase it it's not like we did it more efficiently it's not like we optimize it it's just gone goodbye and so that is just moving code to the appropriate time to do it so rather than putting it inside a loop it's outside the loop so if you think about this from maybe some of you have had complexity theory right you imagine that maybe you have something like n is the number of uh of um i'm trying to think of like the best way to write this in complexity theory but let's say that n is like the number of raycasts that we're going to do you know what i'm saying um and then we have whatever our uh like this entity traversal thing here which is predicated on like the number of entities and the number of meshes but we'll just call this like you know meshes or something and this was very costly by putting the gather step so putting this inside the raycaster we were effectively multiplying its cost by the number of raycasts by moving it outside of the raycast we are replacing that number with one because we only did it once so we went from something that was not squared because it's not the number of raycasts that's going but it was multiplicative so we went from something that was multiplicative which scales which is quite poor because as either of these numbers go up you are sad we replace that with something that is linear just in the number of raycasts and that's exactly what we wanted right just from the standpoint of complexity it's pretty easy to see why moving something out of a very high number loop and furthermore when we're talking about practical performance we also care which of these numbers is high this is a number that is going to be very high so anything that we put in here times n is really bad right any kind of thing that has to happen in here even if it's just the constant even if it's just the constant that multiplies it's really bad so not only were we putting something in that have like a high constant we were putting something in that scales with a number of entities that scales the number of enemies in the region right it goes up with the number of ray casts all of a sudden for no reason we completely eliminate that it's gone that alone is enough to get us much faster than we were but we can still do better than that because the original algorithm actually looked quite a bit like for testing the triangle versus the array looked quite a bit like that general algorithm that i showed before right but again what do we know about a problem this is not a general raycast this is a raycast where we know the ray is pointed directly downward always so after we pull out from our grid here are the triangles that we need to test we can also say well how do we optimize that test to be as fast as possible so let's go ahead and look at that and i'm going to have to wipe off the board here to have some room the one thing i have not solved with light boards is how do you get them erased quickly i have no idea if somebody could make a light board someday that would you know you'd push a button and somehow it would like electrostatically discharge all of the stuff on it like these were just like you know little like particle kind of things that are stuck to it and then you discharge it and they just fall to the floor or something you just put a tarp under it and off you go that will be the light board innovation maybe some of you there at the university of if i i don't know if i'm really saying that right doesn't matter my pronunciation will be bad no matter what my accent will be bad no matter what um maybe you've got some engineering folks there maybe you can can design that and then everyone can use it because that's like the only problem with these things is that you spend a tremendous amount of time trying to clean them and in the middle of the lecture you can't even really get it that clean you just kind of have to leave leave it you know half clean right so how are we going to do this well now is the part where i talk a little bit more in depth about simdi i'm going to avoid talking about multi-threading even though this problem would be very multi-threadable because i never multi-threaded the routine it was so fast after i did this stuff that nobody cared so i never even multi-threaded it it was single threaded um so we could talk about that after but i never even bothered multi-threading it um so single threaded though what's the most important thing for us to do now well there's two things we want to keep in mind when we're doing a routine like this one the first thing that we want to do is like minimize the actual ops right we want to make sure that we're not doing math we don't have to do we've already done the work of making sure that we're just pulling out the triangles we might actually hit so we've done a good job of getting rid of all of that cruft but now we need to know how do we actually test the block of triangles quickly we want to don't do math we don't have to do number one and number two right is we want to take advantage of of width so we want to basically design this thing so that it is going to be really good for cindy so what cimdy some of you probably it seems like the answers i've been getting back it sounds like it's like you you guys know stuff right it sounds like like like i'm not talking to people who are totally unaware of the stuff i'm talking about so a lot of you probably already know what this is but effectively what happens inside a cpu and this has been true for a long time is that the cpu has to actually do real work everything we want a computer to do the cpu has to actually do it it's not theoretical and so there's different parts of the process involved in how it does that work one part of the work is figuring out what is the work that's actually like a big part of what a cpu has to do we call this like the front end of the cpu and it involves things like taking an instruction and figuring out which instruction it is because it takes work and space because you've got to store the code because it takes work and space to figure out what you're supposed to do over time chips have found or chip makers hardware designers have found that it's more efficient if you can issue one instruction and do work on lots of values at the same time because then you save a bunch of that work right so all the work involved in like setting figuring out the x what the instruction actually is sending that somewhere having someone schedule that instruction actually execute the instruction gather the things for the instruction all of that stuff you have to do whether the instruction operates on one thing or many things so by having that entire pipeline even into the back end right it's not just about the front end but the back end as well it has to gather operands and such right having all of that operate on more than one value saves a lot of time on the chip because you had to decode the whole instruction anyway you might as well if it decodes to a multiply you might as well get eight multiplies instead of one so that's where we get the idea of single instruction multiple data and what single instruction multiple data says is well normally we would think of having an you know a thing in inside the computer when the cpu is talking about a particular register it's going to do like an operation we would think about it as saying well okay it's going to do something where like we have you know a floating point number and the floating point number is like 2.3 right um and then we have another floating point number and that number is like you know 7.9 and we're going to multiply these two things together and we're going to get a result right and i don't want to have to multiply 2.3 and 7.9 together so someone else can do that but we get the result then it comes here right maybe i should have said 2 times 7 so i could have written 14. right so inside the cpu this would be a floating point multiply it's multiplying one thing by another thing and getting the answer right and so what we've found what ship designers say and i don't know anything about chip architecture so far be it for me to say that this is true but what they say is that well if we could have just done more of those right if we could have just said let's take a bunch let's take eight different numbers right totally independent these numbers right and do these multiplies altogether so in one step we get all of them that's way more efficient if you can find a way to do the same set of operations but to many numbers we can go much faster because now we're not spending all this time decoding and preparing instructions because where we used to have to do it eight times to do eight multiplies now we can just do it once and do eight multiplies now this comes in flavors sse avx avx-512 different chips have different ones of these these are four wide these are eight wide these are 16 wide floats but it's all the same principle it's just decode one instruction apply it to many things so we want to be able to design our architecture for this routine architecture is a strong word we want to be able to design the way it works to maximize the potential for using these why because if we do it one float at a time we are like 1 8 the speed not half as fast you know not four times slower but eight times slower we're one-eighth the speed if we just do it one float at a time so being able to go wide gets this 8x speed up if you have an avx 512 chip you don't really get 16x because the clocks are different it's that's a long story when avx if avx 512 becomes actually performing at some point then you would get 16. on a gpu you do often get full 16 they have full speed fix 16 but they're clear clock rates blah blah blah blah none of that matters the point is on a normal machine today eight wide is not that hard eight wide is a thing that many cpus can now do okay so in this case we happen to luck out we know we want to do things by eight and it can be kind of challenging sometimes to do that for some routines but in our case we're just bulk processing triangles that's all we're doing when we bulk process triangles all we have to do is make sure that we take eight at a time so we'll just grab eight at a time do the test on them all at once and then we don't have to even really think that hard to parallelize our routine across to to eight wide the routine this is much harder when you deal with things that aren't quite so clearly bulk so the first step is we're going to just say process eight at a time problem solve for simd second step what do we actually solve so remember we said we want to minimize the number of math ops well what are the things we actually need we only need to compute two things we need to compute did we hit the triangle and what was the z of it that's what we're doing right we're casting array down into some kind of mesh we're going to say oh we hit a triangle and here's like how high it was right it's the z value why do we need the z value well we already know the x and y so we want to know what point it hit we know the x and y because whatever the x and y of the array is it can only go straight down so we know the x and y it's the starting it's what we got so the only thing we need to compute is did we hit anything and if we did where right you could even bake this down to just one thing if you wanted to now the actual routines slightly more complicated but only slightly in that we also want to get an index back because we actually need to look up some properties of this triangle such as what material it was and things like that because these things can affect the grass planting in the full version of the algorithm we don't really need to talk about that much but it's not quite as simple it's almost as simple but not quite so we need the index and the z and you can think about these as already telling you this because i could just load the index i could pre say that the index was some weird value like you know f frame it's like all ones so if you see that it wasn't a hit right so we don't really need this we can overload these by sticking bogus values in them and if you don't get clean values out you know that there wasn't a hit so i really don't need this once i'm saying i'm returning these other two things but i need these two the index and the z so that's what we need to compute how do we compute the index and the z value well if we think about this we're already in a pre-pass we're already running through these triangles and we're putting them into these buckets for testing based on the grid so when i do that pre-pass why not also transform the triangle into something more convenient than a triangle right a triangle is like three points testing for a ray intersection against that is actually somewhat difficult you saw the routine before but i know a couple things one i actually don't care about the z of the triangle at all i just need to be able to reconstruct where i hit it but for testing it it's purely top down right so if i look at this triangle i can say well what if i just pre-compute the vectors that i would dot product against and this probably gets too technical for some people out there so i apologize i'm not going to go this already very long so i'm not going to go into detail how that works but why don't i just pre-compute two basis vectors that if i dot product with them i will get the u and v of the barycentric coordinates of this triangle right so i want to be able to say a u and v coordinate these are very simple things to think about u and v coordinates just say look if i walk out this far along this vector and then this far along this vector here right i get a point in the triangle furthermore if i go all the way to here that's u equals one if i go over here that's v equals one right and anywhere in between here and here u plus v would equal 1. various into coordinates if this was going to be an 8 hour lecture i'd explain them i can pre-compute exactly the transform in just a few variables exactly the transform that transforms from an xy point right in our that we're sampling directly to uv coordinates using just a couple multiply ads and then i can just test to see whether or not the uvs are inside the triangle and i'm done furthermore to compute the z all i have to do is also pre-transform to say if i know the u and v which i know i'm going to solve for to find if i'm in there just how do i multiply out the u and v to get me back my z here's what that looks like so if you look there m256 is just the eight wide float so the reason for that you can think of that as just saying float but we're storing eight of these at the time and all i'm storing there is one two three four five six seven eight nine floats the top ones x y z those are just the base vertex of the triangle right so you can probably still see this underneath it's that vertex right the u u x u y v x v y are just two vectors you could think of them as two different vectors that you dot product against to get your uvs out and then finally the zu zv is just what you use to convert a uv coordinate which we know we're going to get when we test into the z value at that point in the triangle that's how we pull out our z hit that's it all we have to do is store that it's about 4.5 cache lines worth of data right 4.5 cache lines to store eight triangles and we can load these test against them as eight in one go and we get out what we need so let's take a look at that routine here's what the routine looks like now remember this routine here which is maybe a little bit hard to read if you're not used to it because it's actually written with the intrinsics each one of those intrinsic calls is just one operation on the cpu so mm-256 sub ps just means subtract these two things eight wide so if you look there there's only a handful of things 15 instructions i don't even know 20 instructions probably 20 instructions or so to test eight triangles right so we went from that huge monstrous piece of code that was doing all these things per raycast to just on every raycast we do one index into an array to pull out a pointer and then we loop over that pointer running this and that's it now if we take a look at the actual assembly language it fits on a slide that's all we need to do that's it right there and i've got notated on the side like what each step does i just threw those in there last night so that people could read it if they wanted to at their leisure but if you look at it it's like trivial there's so few things it's like those mnemonics on the side may be foreign to you but like v sub ps it just means subtract two things v mole ps means multiply two things so it's like two subtracts a few multiplies a few adds then we just do a few comparisons and then we you know leave we repeat the loop right that's it now if we actually look at what happens here um it's kind of like probably a little bit too far down the rabbit hole to talk about what this schedule is like so i'm going to skip that but just pulling back out a level of detail remember all that code i scrolled through that happened on every raycast here is that code in the new system it computes two grid indexes to see where in the grid it looks up and then it calls that triangle routine you just saw to loop over the triangles all that code i scrolled through calling malloc using auto arrays all that stuff it's just it boils down to that now okay so this versus this right and that code there was like thousands of lines of function call stuff down the rabbit hole this one has no function calls it just computes the grid right i inlined it i think that grid call actually is a inline function normally but i cut and pasted it in there for you so you could see it written out all the instructions so this had to do all sorts of stuff to get to the actual triangle testing thousands and thousands of lines of code and function calls and everything else to get to the triangle testing every raycast this has to do one simple calculation and a lookup okay so here's the actual routine running and what you can see here is that you can just actually drag it around your guess of 500 milliseconds is actually not bad um it's extremely simple to move now you can you can almost get it it's not quite 30 frames a second but then again the editor is kind of slow at updating the meshes so actually the bottleneck here is not that triangle stuff anymore if that makes sense i can run that for you one more time so you can see the final speed uh or maybe i can't i don't really know i've never tried to do a slide presentation this way so i don't actually know it looks like it can't wrap around or something so again going from something that takes 10 seconds more really on that machine to something that you can just kind of move around and it plants pretty much instantly as far as you're concerned and again once we have this we could have gone way faster right i could have then multi-threaded this i could have cleaned up the gather code because the gather code which is taking a ton of the time is actually just the same as the old gather i just only ran it the one time and then i processed the triangles so that really didn't get any faster but you can see that the difference is staggering so rather than something that you sit there and watch and you wait it's actually something that's just instant now um i'd be happy to take questions we can obviously go into more detail on any of this but that was already a ton of information but i just want to close the non-q a part now by just saying in general software on today's hardware can pretty much always be very fast the hardware is kind of amazing um and even the networks are amazing so there really isn't ever any reason anymore for things to be slow the reason they're slow is because we don't spend time making them fast sometimes there's a very good reason for that i feel like on the case of the witness there was a very good reason that routine was slow and that's just because the person who was making it the first time was just trying to get it done so they get back to what they were doing and it was for internal use right the same people who made that routine probably spent a lot more time optimizing stuff that was actually going to run in the game because they knew that had to be fast in general though that step that we then took which is somebody came after the fact because it was too slow figured out why it was slow wrote a faster version and put it in place so that people could actually get good results fast results from the software they were using that part pretty much never happens nowadays it seems to me that most of the time people only do the first part and nobody comes along to do the second part there may have been a good reason why something was slow the first time you did it one of the reasons is you had to do it to the side like that where you didn't have time to actually go focus on it another reason might be that you're exploring possibilities you don't know that this is something you're actually going to ship or use so you just hack something together to see whether it does the things you want it to do and that's all fine what's not fine is that increasingly we are never taking that second step no one else on the team or the or the person who originally wrote it is never going back and going we need to make this actually be good now because people are actually using it instead they're just letting everyone waste all their time they're burning up you know i don't even want to know how much heat is being generated right now because nobody does these extra steps so they're just wasting tons of data center time tons of money they're wasting so many user hours on these things and at some point i feel like we got to start focusing on performance and to me it seems like the incentives are there everything that i've seen from computer history suggests that if you can identify an area where people are shipping slow software that's unresponsive and annoying to use which is almost all software currently in existence if you come out with something that is reliable and fast that has that low latency really nice interactivity people switch they switch it's an avenue of attack that competitors can exploit so i think as we see more and more competitions start to build up because remember when you start something like the internet or whatever there's an initial rush of people just doing things because no one's ever been able to do them before right no one was ever able to make you know twitter or something because the world wasn't all connected by a network so yeah the first things of those there's not much competition most of them are lousy right most of them are slow most of them are buggy whatever but they'll still succeed as that ratchet tightens and there's not so much low hanging fruit anymore the benefits for competition on performance start to go up because what else are you gonna do you can't come along and say check out this new messaging service that i made it's exactly the same as the old one in just a slow who cares right you need competitive advantages and i think what we've seen is that when someone enters a market and demonstrates an awesome version of something that used to be crappy people notice it's not actually the case that people don't care they do it's just right now they don't have any options so of course they use something like twitter with all of its problems and it being super slow and unreliable and all these other things because what else they gonna use but i don't think that means performance doesn't matter i think it means performance doesn't matter when there isn't anyone else doing performance i think once you have a competitor enter with performance there's a lot of evidence that suggests that that is a powerful market position so i would say i don't think this is just for show i don't think it's just for fun i don't think it's just for being honorable i don't think it's just about doing a good job i think there actually are tangible reasons why as software developers people should start to care for real how fast their code runs but we'll see maybe i'm wrong about that but it certainly seems like that would be true i'm happy to take questions now if people are still sticking around i will check the discord and i'll also pop in on the twitch chat but thanks for watching okay so i've got some questions from folks both students and the twitch chat and i'll try to answer these uh as best i can so the first question was actually asked during the lecture and was queued up and that was asking effectively what about fractals so going all the way back to the original when we're talking about blue noise patterns for distributing plants throughout the world or grass or rocks uh how about fractals do we consider using fractals are they good is there a problem with fractals like you know etcetera etcetera and what i would say about that is twofold one no nobody really considered using fractals for this as far as i know i i could be wrong about that but i don't think so two is yes i do think there's a reason why you want blue noise instead of fractals fractals are like self-similar and they tend to have a certain regularity about them you can do things to fractals to make them appear more random but they generally don't look like those kinds of distributions that you see in forests like they don't look like a blue noise distribution if there is a fractal that produces something close to a blue noise distribution that might be a very interesting idea i just haven't seen that fractal so when you're talking about using fractals to simulate things in nature at least all of the fractals i've seen they're never talking about this right this particular thing where we want things randomly placed but so that they have a specific factor there of not encroaching on each other's sort of uh space i haven't seen fractals that reproduce something of the quality of blue noise if you know of one i would love to hear about it so do please do uh send me a note because i don't follow fractals particularly closely and i you know this is not my job every day right like i don't go to work and think about how to do nature placement algorithms this was just me optimizing a routine that was used in the witness editor so i could easily have missed some of the state of the art on fractals since a stop it's not something i normally do okay uh so another question was how did the original version know when to terminate so that original version that randomly picked points how did it know when to terminate and this is that you know one of the problems with that way of doing blue noise is you don't you don't know how to terminate so it's usually ad hoc usually it's like if i test this many points and i can't find anything i'll stop that's sort of similar to the way you would do it in the other algorithm only it's very unreliable because you're trying to guess something totally random whereas the other one that's a wavefront kind of approach that that was the one i replaced it with you can tune that number to a very specific number that will always work whereas the the one that's about guessing how many times you should stop on the last iteration is kind of arbitrary and hard to test um so it's not great you can do that the other thing you can do is you can say well i'm only going to place x number of these things and the user places that number in and at that point the user is kind of responsible for finding the right number that which is not great you usually don't want that but it again it doesn't have a very good termination criteria whereas the other one does which is nice um so now moving to some other ones uh two questions uh so the first one is do i think ides cause some of the bad habits i was talking about um and two things one i would say i don't know that this is necessarily i wouldn't necessarily use the term bad habits i would use the term lack of good habits and i know that's a pretty odd distinction to make but reusing code or cutting and pasting things or calling a library they're not bad habits per se because it's not like you don't want to reuse things sometimes right i mean even take for example something like a math library um it would be crazy if you hand wrote insimdi all of your routines all the time because most routines don't need to be that fast so you're probably going to want a dot product and a cross product and you're probably going to use them i certainly do and they don't really cause much of a problem so it's not so much bad habits as it is lack of the habit of actually looking at the things that you do and go is this slower than it should be and when it is rectifying it that it's more like getting a good habit and it's not about bad habits because most of the habits that cause this are not in and of themselves bad it's the fact that you're never checking them right using a library isn't bad if the performance you get out of the library is roughly in line with what the machine is capable of sometimes a library will do like you know let's say you have a library for loading and decompressing a png well that's like a one shot thing i just say decompress this png it might be able to be as fast as it can possibly be right it doesn't need to really be specialized for what i'm doing i just want this rgb image out it's roughly right there's not much modification i really make to it that library may just be peak performance someone may have already optimized it and it's really good right so it's not a bad habit to reuse code it's a lack of a good habit it's a lack of saying when you reuse code or when you write code to do something are you actually checking to see whether it's like ten a hundred a thousand times slower sometimes than it should be that check is the good habit you need so to the extent that there's a bad habit the bad habit is just not checking but i would call that lack of a good habit as for whether i think ides are implicated there not really an ide is fine um there's nothing particular about an ide that that discourages you from doing back of the envelope calculations and to be honest uh godbolt for example if you ever seen godbolt it's not an ide but it's sort of like an ide right it's actually very good for learning about performance integrating things in can reduce the amount of time it takes you to ask questions about the performance of your code so to the extent that ids are a problem it would only be that they don't have as many features as you would like for performance but that's not inherent in the fact that it's an ide so i don't think ides are really implicated at all here other than to say there's things they could be doing to help that they're not doing and godbolt is a good uh example godbolt is an example of of having a integrated system that really helps with performance investigation so you know i think there's things we can do with ides that would help far from being a hindrance i think they could actually be a help just right now they're not really except for godbolt um so uh another question why cmd first instead of multi-threading um so the answer to that question i can give two separate answers the first answer is that you pretty much always do cmd first because you have to do the cmd before you do the multi-threading anyway because the multi-threading is just going to run the cmd code right so i usually sim di something first because you're going to want to run if you're doing performance you're going to want to do both and you're going to want to simply optimize the thing and figure out what actually has to happen because multi-threading something assumes you've already figured out what the algorithm is and since the algorithm is going to be dependent on the cmd at least the base part the inner inner loop kind of has to be known before i start thinking about how i'm going to multi-thread it so usually the answer is that but in this particular case it's pretty simple the core count at that time was pretty low two or four it wasn't 16 or 32 right so you know you can get an 8x speed win or 4x at least but an 8x speed one for going wide you ain't gonna get an 8x speed win for multi-threading so kind of right off the bat you know you get more from cindy than multi-threading nowadays if everyone at the team had a threadripper on their desk and it's 64 wide well that's not true anymore 64 64 thread or 64 core 128 thread um sure that number is a lot more compelling now so you're going to be thinking heavily about the multi-threading as well but even so i'd probably do cmd first because i have to know how that core throughput works before i can design the algorithm that actually does the work distribution right so just something to think about there um now finally this one's a little bit more in depth i'll go check for more questions in a second here so so this doesn't have to be the end but the last question that i saw when i was grabbing stuff off camera there is how do you check for the uv in a triangle test case they were just curious about that so let me just say really quickly how that works it's not that hard so if you think about barycentric coordinates uh let me just kind of kind of show since since it was asked and we're now just in overtime so we can do whatever we want we don't have to be careful or we don't have to be um we can be the opposite of what we're talking about we can be inefficient and and sprawling right um so in terms of what we uh what we're talking about with barycentric coordinates i just kind of threw that word out but i didn't really talk about it inside the the meat of this presentation um but i can kind of talk about it now berry centric coordinates are actually very simple to understand they just take a little while to get used to using because they're not probably if you haven't my recollection from my programming career anywhere is when you first start thinking about multi mult uh i call them multi-way blends n-way blending i usually use think in my head i think n-way blending um when you first start thinking about these things it's very unintuitive and then it becomes intuitive later now part of that was because it was not explained well back in my day um nowadays like you've got explanations out there i mean i've put explanations out there that i think are much clearer and so it's easier to kind of learn than than it was um so it's it's unintuitive at first if someone explains it well but though you could probably get it faster but it does take a little while of working with to really understand what's going on but it's a pretty simple concept at its base um if you think about 3d graphics at all or even just 2d graphics um you even not even think about graphics just math we're very used to the concept of basis vectors whether anyone's ever even mentioned the word basis vector to us or not and the reason for that is we're always sitting there um when we're doing math even in grade school even when we're little um we're always talking about things like x and y right and when we talk about a point you know it it may seem second nature to you now but there was no reason why you had to say that the way we specify where a point is is that you walk out along the x-axis a certain distance and then you walk up to the uh along the y axis a certain distance and you arrive at the point hence the x y notation for 2d points because that's what we're doing right we're saying go this far along x go that far along y x and y are the basis vectors for the standard cartesian plane kind of stuff that you see in math class and you may never have thought about it that way but when they first taught it to you that's what they taught you to do right go out however whatever the x coordinate is go out this far whatever the y coordinates go out that far plot the point so you always thought of it as basis vectors their vectors you go along to get places and at no time are people using like polar coordinates and stuff like this usually nobody's talking about well go start pointing in this direction and lift that far up and then go that far right we all are now just mostly talking whenever we talk about graphics we talk about two basis vectors x and y three basis x y z and how far you go along them so it's it's what you intuitively do you don't intuitively think usually about like polar coordinates or any other there's so many other quantities we could use this to when we use so the idea behind the idea behind barycentric coordinates especially for triangles is that we're just going to reproduce that same thing but just for the triangle so instead of thinking of this as an arbitrary space that we're just using to describe points what we're going to do is take a very specific two vectors u and v which are the same as x and y you could call them x and y if you wanted but we usually reserve x and y for talking about like the world right the main space but we're going to take two vectors u and v and we're just going to say that our triangle is this beautiful right triangle right and the u this point is one that is one right along the axis so our triangle literally is this perfect nice right triangle centered at the origin well not centered based at the origin right angle at the origin its v coordinate at the highest point is one so this is this point right here would be zero one right zero 0u 1v this point the other point on the triangle is 1 0 right 1 u1v barry center coordinates just says it would be really great if we could always just talk about it this way because how simple would all of our math be and so for example in this case i want to know if a point is inside the triangle well first of all i can trivially test to make sure that that both values are greater or equal to zero right so i can say if u is greater than or equal to zero and v is greater than or equal to zero i cut out all of the space right outside of the triangle that way and then the only thing i have to do is get rid of the triangle get rid of this part right i've cut out everything on this side i've cut out everything on that side just with two tests and all i need to do is find out if it's past there that turned out to be extremely easy as well why because we know that the line between these two right smoothly goes from 0 1 to 1 0. so as i move between them right and what would this be right here that would be 0.5.5 right it's a smooth line on a right triangle perfect 45 degrees i know that everywhere on here my u plus my v is going to be equal to one right because i know that i went all the way to one to get here and if i start walking backwards i've got to go all the way to one to get here so i trade off my two values right as i go between them and any point that i find on here it's always going to have u plus v equal to 1. so if i just demand that it's less than or equal to 1 right then this is sufficient for saying i'm inside the triangle three tests that's it so it's really really convenient if you knew you were in this form the problem is you're not in that form pretty much ever right all the time you're given things in the actual world the actual triangles are not these perfect triangles set up uh at the origin in exactly this form but like i said we have pre-processing so what i did for this routine is i just said look let me just pre-compute the transform that maps each individual triangle in the world right i've got some arbitrary triangle you know let's just compute the thing that maps it so that you know the base is here this is here and this is here and i can always do that right i can always come up with a spatial warp that puts the world of the actual triangle into this perfect right triangle it's just a straight affine transform right and that's all my routine was doing it affine transforms not the triangle because we already know that you know we took the triangle we have the alpha map it just takes the point so if the point that we're testing is here it just maps the point with the affine transform right and it gets like there and so now we know the uv coordinates they're just whatever the result the transform is so thinking about it this way is kind of handy this is the kind of thing you do when you are heavy on pre-pass these triangles aren't moving right because we we've fixed them all we're going to do tons of raycasts against the same geometry they may have been moving in the real world but we're fixing them right now because we're taking a snapshot in time and planting the grass on it right so we're gathering all the triangles we can compute these we're doing a pre-pass we just compute the transform even though that's kind of expensive we compute the transform once and then on every raycast it's just a simple mapping and the mapping it's a regular transform it's like a matrix this 2d matrix transform right um so it's literally just you know a subtract to to base this vertex here and then two dot products to warp the space into a right triangle and then this is the exact question you asked is what's the check it's that check now if you want to see what that looks like um i can grab that what well or maybe i can't hold on a second yeah so here is that actual routine and what you can see uh if i if i kind of try and point directly to it do you see these comps here there's like a v comp there's there's a couple of v comps one two three four right those four v comps are what's doing the check three of them are these although they're phrased differently i like to do things in cmd with the same kind of check because if you want to port to multiple platforms you want to just be able to write the code for the the mnemonic for that check and not have to support them all so i believe the way i check it um in fact i think i wrote it on there how it was checked right it's less than they're all less than or equal to yeah so we flip these right and we do like zero less than or equal to u or like zero less than or equal to one to v or whatever right um so you you might flip them around to get them all going the same way but three of the tests are that are this and then there's one more test you may be like what's that test for that test is there because remember we're planting the grass by shooting array downward we want the closest point that we hit so the fourth test is are you closer than the closest point we've already found so we do four tests three are in are we inside the triangle and one is are you closer than the previously closest hit and you can see those in the assembly it's just four comparison operators now we didn't really talk about this but in cmd one of the other things you'll notice about this routine actually if i go back it's branchless there's no if statements in this routine so how does that work well the answer is in cmd if i know that i'm going to batch process a bunch of triangles and furthermore in this case i not only know that i'm about to batch across a bunch of triangles i know i can't early out i got to test them all because there always could be i've already called down to a small set that's right around me and i have no idea what order they go in right i'm not gonna jump around i'm just gonna blow right through them as fast as possible to avoid doing any kind of weird unpredictable patterns or anything like that so i'm just going to blow through like 10 20 30 triangles as fast as i can um maybe maybe a few more i don't know 100 200 whatever the optimal size ends up being you'd shrink your grid right till you got to a good size but if you look here it doesn't do any branches so how does it get the answer and what happens is you just do it eight wide so you test eight you test the array against eight triangles at a time and you just leave the closest result in a a sort of state register you've got one register that just sits there and it keeps the closest hits eight closest hits it's the eight lanes of the computation you still just have the best hit eight the in the eight ways of the thing right so we're actually not testing for the closest hit we're testing for eight closest hits in what amount to eight separate test phases because we're getting buckets of eight in we're testing them and we don't want to have to collapse we want to just leave it that way so we just keep overwriting using masking test and mask right we just keep using test and mask to gather in the eight lanes the eight best hits for those lane you know one for each lane and then only when we're done testing all the triangles do we then collapse that so we have a little thing it's not on this routine there's a little thing at the end of this routine that just looks at which is the best one of the eight and picks that one um so that's that's how that part of the routine works and i know that's a little bit hand wavy but you can imagine i can i can make it a little more concrete for you if that's confusing let's suppose um i have two floating point values right so um i've got you know uh i've got a a equals five point three and b equals 7.9 so just in a regular programming language i want to know which of these numbers is less right so if i were to write it out in like if form so like i was going to actually do a predicate um i would say something like you know if a is less than b um then you know whatever i want the result to be result equals a else result equals b this is how you would normally you know do a test so this doesn't work if you're doing things eight at a time because this answer there's eight answers if i have eight floating point numbers i compared to eight other floating point numbers there's eight answers to whether it was less or not so the way these sorts of things work is when you ask for the test it actually just produces a mask so when you do that test you get a mask back with ones everywhere the test passed and zeros everywhere it didn't so this will just be like all ones this will be all zeros not in floating point in binary these are the bits right so then what you can do is you can change anything that you were writing this way so instead of ever writing a less than b or because right instead of writing that way you just write the result and we assume we've got this mask right the result equals the mask remember it's ones where it was true right so we want a anywhere it was true so the mask ended with a produces only a's values where the mask was won we can or that with not the mask and b that produces b's values only where the mask was false where it was zeros and this branchlessly right there's no if in here just using binary combines the values properly and this even works with just one value right if we imagined instead we've computed a mask of all ones you know if if a was less than b this would work but so everything looks like this and that's why if you look at the routine you see all those and and an or right that's i assume are in there let me comp the blend oh so there's no ore because it uses a v blend so this construction this construction here is actually a built-in operator um a built-in opcode on x64 it's called blendv and it literally does this for you in in one instruction so it's usually faster than doing it manually depending on what you're doing okay hopefully that explains how that that works all right i think we've handled everything sorry but i forgot that the slide was up there it's hard because i can't see very easily i i this first time i've done this so i don't actually have a good way i i need to like figure out some way to place something i can actually see so i can see what what's up there for slides because i don't normally do that um normally i just do lightboard stuff on the lightboard i don't have slides so i don't have to worry about it but it would be nice if i could see it the problem is if you put something in front of the light board it'll reflect so you kind of have to figure out a way to put it somewhere where it won't reflect it's it's dicey all right anyway that's it for questions so thanks everyone for joining me and uh yeah hopefully there was at least a little bit of a glimpse into like why software tends to be slow and generally speaking how you can go about combating that it's really not that hard it does take learning to learn how to optimize stuff but it actually doesn't take that much learning to know how to tell it needs to be optimized and so that's actually a good thing to learn even if you don't know how to optimize yet because you could always ask somebody right you can always try to grab someone more senior or something and say look this thing's really slow what can we do about this right like like i know this could go a lot faster so like what's going on here or whatever it's just just knowing is a huge part of it and the techniques and learning how to optimize will come the more practice that you have so that's it thanks for joining me and uh for those of you who are regular viewers on the handmade hero channel i'll see you this saturday for those of you who aren't thanks for stopping by peace everybody that's the end of the lecture i hope you enjoyed it and there was some interesting stuff in there for you again the kickstarter is now live so please check it out the links in the description below and if you have any ideas of stuff that you'd like to see on this channel while we're doing the kickstarter promotion please don't hesitate to hit me up on twitter it's at c-m-u-r-a-t-o-r-i at cmuratory on twitter and i will put it in the rotation thanks everyone for watching and i'll see you in the next video

Info

Channel: Molly Rocket

Views: 94,581

Rating: undefined out of 5

Keywords: Handmade Hero

Id: Ge3aKEmZcqY

Channel Id: undefined

Length: 170min 14sec (10214 seconds)

Published: Wed Oct 13 2021