3D Gaussian Splatting

Video Statistics and Information

Video

Captions Word Cloud

Captions

testing testing testing testing testing testing testing testing testing testing testing testing I don't know this is the right microphone check here okay try this testing testing okay I think this works alrighty how's it going Luke how's it going Bob hope you guys are having a good Monday we're gonna start off this week with a little bit of a computer vision stream today so we're going to be reading this paper called 3D gaussian splatting for Real Time Radiance field rendering this paper has been making the Rounds Around the internet because it is a different way of doing kind of 3D uh rendering kind of like a Nerf if you want to think of it that way but in a different way and it's basically using these uh gaussians so a different kind of prior the Nerfs which is resulting in a kind of completely different look and apparently it's faster it's better and I don't know that's what we're going to figure out is we're going to read this paper and see if they're actually correct in their claims that it's better and faster whether they just cherry-picked a bunch of stuff and that's why it looks better so 8th August 2023 relatively recent work coming out of inria uni University I don't know how where that is we go it's actually down south of France all right it's probably where the cool kids go it's probably very nice weather Max Planck Institute is quite famous though so this is a like a secondary advisor from the Max Planck Institute all right so here in the picture they have a picture of a bike couple different variants that they're comparing to of course ground truth is going to be the actual image you have some neural Radiance field you have plenoxyls this is uh neural graph instant neural graphic Primitives I think is instant NGP and then this one here and Peak signal to noise ratio about the same for all of them to be honest and the train time I think that's the money here is 48 hours for this Nerf here versus 51 minutes for this ah 3D gaussian splat John Egan how's it going uh okay so let's just Dive Right into this abstract Radiance field methods have recently introduced revolutionized novel view synthesis of scenes captured with multiple photos or videos so the whole point of uh something like a Nerf or something like this technique is creating novel views which basically means new images from a different point of view of a scene so a scene is just any kind of 3D environment with some objects in it maybe it can be object-centric in which case there's a central thing or maybe it's just not object object-centric which means maybe it's a picture of an inside of a room right most of the stuff is on the walls and on the outside but you have multiple pictures of this scene or this object and you want to be able to create some kind of abstract representation of that scene or that thing such that you can create you can create novel views of it right so 2D images of that thing but from a different angle of which you didn't have a picture or photo right so that's what novel view synthesis of scenes from captured from multiple photos and videos is uh okay however achieving High visual quality still requires neural networks that are costly to train and render so this is kind of a dig on neural Radiance Fields where you are using a neural net to learn a specific Radiance field that is encoding the image or the color and the opacity of different points in space right that's where the whole Vector field comes from and unfortunately for Nerfs you have to basically train a neural net for every single scene every single object and even those scenes and objects are at a specific point in time so if you were to change the lighting if you were to move objects within that scene you would have to train a new Nerf so it's kind of annoying to have to do that for unbounded and complete scenes rather than isolated objects so this is kind of their way of describing what I'm describing as object-centric so uh Nerf computer vision if I could spell right but this would be the example of something that's object Centric right there's a single object here in this scene it's just this little Tonka truck and you have it from a variety of different angles this is also object-centric you see this man's face but this scene here is I think some of the scenes that they're considering are not necessarily going to be object Centric so the stuff is all kind of more splayed out and it's usually harder to do those scenes rather than these object-centric scenes so this would be a an example of a not object-centric scene you see how just how much harder it is to do that okay 1080p resolution rendering no current method can achieve real-time display rates so yeah you can't really do that with Nerfs that's why they're not uh super wide it's wide super mainstream at this point right video games still use your texture and meshes they're not using Nerfs and part of that is because the real time isn't quite there yet we introduce three key elements to allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high quality real-time novel view synthesis so this basically means that once you've created this representation of the scene can I get these novel views in real time uh how's it going Christopher testing testing first starting from sparse Point clouds so sparse Point clouds is just a point Cloud that is produced or that doesn't have a lot of points right it's sparse produce dream camera calibration we represent the scene with 3D gaussians I think this is going to be the Cornerstone of this entire paper is they're basically going to rather than representing the scene as a Radiance field which is what a Nerf does they're going to represent the scene as a set of 3D gaussians so a different prior or different starting point different assumption that they're going to use for this entire thing that preserve desirable properties of continuous volumetric Radiance fields for scene optimization while avoiding unnecessary computation in empty space there is a bunch of tricks that people have come up with for Nerfs that also avoid unnecessary computation in empty space and the continuous here means that you'll be able to uh evaluate at every single point in this space or every single Pixel once you actually decide on the novel view that you want to render we perform interleaved optimization and density control of the 3D gaussians okay so multi-step optimization process notably optimizing anisotropic covariance to achieve an accurate representation on this of the scene okay so some kind of tricky optimization objective maybe a combination of multiple losses and objectives and possibly multiple steps that are alternating back and forth generally that's not necessarily a good thing if you have a complicated optimization process with different loss functions and different kind of back and forth steps that usually means it's going to be slower it's going to be more annoying it's going to be harder to implement it's going to be harder to understand so we'll see how bad it is for this paper third we develop a fast visibility aware rendering algorithm that supports ionisotropic splatting and both accelerates training and allows real-time rendering so I think the splatting which is here at the beginning is I'm not exactly sure if there's a formal definition for that but scrolling through this paper I think what they're referring to is that uh any object so let's say that this uh little uh black blob thing here is the actual object you want to uh create or represent with these gaussians and these gaussians here represented as 2D gaussians as these green splatty Splats I guess but you can combine them and and then create kind of a representation of that object with these gaussians so you can see how here they took this gaussian here and they stretched it out in this Dimension and then they stretched it out in that Dimension they rotated a little bit and then you can create this kind of I guess representation of it is there a more formal definition of splatting what does splatting mean in the context of gaussians see what chat CPT tells us here okay let's go back up to the top so that was the abstract we got some keywords here I've got some addresses some emails if you guys want to spam these people and let's look at the first figure here our method of Thieves real-time rendering of Radiance Fields with quality that equals the previous method with the best quality while requiring optimization times competitive with the fastest previous methods okay so it's fast and high quality that's basically that's I guess that's the metrics that you care about key to this performance is a novel 3D gaussian scene representation that's that's what we're going to figure out here with a real-time differentiable renderer so differentiable render is rendered that you can take the derivative of which offers significant speed up of both scene optimization and Novelty synthesis for comparable trading times which you have similar quality while this is maximum quality trading for 51 minutes we achieve state-of-the-art quality okay we'll see what they use to determine that their quality has reached state of the art unfortunately for Quality especially for images they're it's very subjective and there's a variety of different quantitative metrics that people can use to determine the quality of things but at the end of the day those quantitative metrics are Never As Good As a subjective quality measure so state-of-the-art in quality is not a super strong claim because you can just find some random quantitative quality metric that makes the claim for you but it could be the case that humans looking at your thing don't agree that it's actually high quality okay uh just saw the sample output on the mentioned link so cool they are able to achieve such high FPS with low training time yeah it should be interesting but sometimes it's like these papers like there's so much cherry picking there's so much kind of concealing and and misdirection that uh you gotta you gotta actually go through them to see if they're actually doing what they say they're doing okay meshes and points are the most common 3D scene representations so meshes are what you think of when you think of a uh something in a video game right it's basically a collection of points that are connected and then you basically have these little triangles and those create the rough 3D shape of an object but points are part of a mesh and then usually with meshes you have what are called textures which are basically an image that wraps onto a specific compassion that's 99 of video games and CGI is done using those types of representations because they are explicit and a good fit for fast GPU and Cuda based rasterization so this is the original purpose of gpus right gpus were originally for graphics processing that's what Graphics Processing Unit means right and gpus were designed to basically render quickly these uh texture and mesh representations of objects and scenes because they're very good at basically once you have these meshes right these little triangles mesh triangles right and you have a light source you can say okay well if the light is coming in from this angle and I know that the surface has this particular angle because I know the exact angle based on the fact that it's a triangle and it has a normal Vector then I can calculate the way that the light is going to bounce and I have some properties of this surface and I have the specific texture which is telling me the color of that surface and therefore I can tell you what this is going to look like from this exact point of view at this exact angle and that kind of physics based uh rendering is What all video games use but the key uh reason I'm bringing that up is that that type of physics-based rendering is just a bunch of calculating uh simple vectors and and the path of light and all of that and those simple operations someone at some point said hey why don't we create a specific uh process here that we can attach to your normal computer and this processor will be able to basically parallelize these simple computations so that we can get you the right uh pixel colors based on these textures and meshes and that company was Nvidia and eventually they made these gpus and then someone was like wait a second these gpus are also very good for training neural Nets so kind of it all feeds into each other that's a big problem in super resolution papers psnr and other metrics are often not well correlated with perceived fidelity yeah agreed Luke I think psnr uh for Shay Inception distance even structural similarity index ssim that's another one like none of those are true uh quality like end-all be-all numbers okay in contrast recent neural Radiance field methods built on continuous scene representations so it continuous scene representation is a representation of the scene that is continuous you can evaluate it at every single point in space right a Nerf defines a little 3D Volume you could write a little cube of 3D space and you can say at any point in that space you can say what is the color and the opacity which is just a fancy way of saying how see-through is it at this point in space typically optimizing a multi-layer perceptron the multi-layer perceptron that they're referring to here is this one here in the nerve so in the Nerf you're training this little uh neural network here F of theta and each of these you could think of the little layer and there's three layers here in this figure so a multi-layer neural network which is also called a multi-layer perceptron the perceptron is just kind of a historical Legacy term that's what the original guy called it but it's just a neural network uh using volumetric Ray marching this is the actual way that they uh render the Nerf right so Ray marching is this what's going on here this red line is array and then you're marching along that Ray and you can see how you're sampling at different points along your uh continuous scene representation okay similarly the most efficient Radiance field Solutions today built on continuous representations by interpolating values stored in voxels or hashes or grids or points okay while the continuous nature of these methods helps optimization the stochastic sampling required for rendering is costly and can be result in noise so the stochastic sampling required for rendering they're talking about how you have to basically sample little points inside this volume and you can see how for every pixel in this novel view that you want to generate you need to create array and then for that Ray you need to sample a bunch of points so you can imagine how if you want a really high resolution uh image here you're going to have a ton of pixels and each of those pixels is going to have a ton of a ton of points along the ray so it can kind of get out of hand very quickly we introduced a new approach that combines The Best of Both Worlds our 3D gaussian representation allows optimization with state-of-the-art visual quality here they are again putting state of the art in their research paper 500 times so that they can get approved and not rejected by the paper reviewers and competitive training times while our tile based splatting solution ensures real-time rendering and state-of-the-art quality so tile based here kind of indicative that they're probably gonna do it in such a way that you don't have to do it per pixel like a Nerf you can maybe maybe they're gonna break up the image into these little patches or tiles and then get all of the pixels in the tile at once or something like that uh blah blah blah our goal is to allow real-time rendering for scenes captured with multiple photos okay so I take a bunch of pictures of some scene or some object and then I render in real time new images of that scene or that object and create the representations with optimization times as fast as the most efficient previous methods do you need to know the exact camera position like a Nerf probably they haven't mentioned it yet but pretty much always with these things if you can come up with a different a new an alternative to Nerf and to these 3D gaussian splatting techniques that do not require camera positions and works just as well that would be huge that would be absolutely massive because all of these techniques right now require a camera position for every single picture that you write the multiple photos or videos need to have precise camera positions usually that's done with an actual and older technique actually right here a camera calibrated with structure from motion right but the structure from motion is itself a not perfect right that's an actual older algorithm that says okay given these cameras I'm going to find these little points and then I'm going to kind of use rough triangulation to give you an idea of It kind of where the camera image is but I don't actually know exactly where the camera was I'm just giving you a guess so all of these uh techniques whether it's a Nerf or this gaussian splatting are all based on a guess so if you can figure out a technique that doesn't require you to start from that guess which has some noise in it then that could be huge uh okay so they're comparing here to Nerf which requires 48 Hours of training time but this is kind of like the worst possible Nerf like I think I've seen Nerfs we've even read some papers on this channel where there's there's Nerfs now that are much faster they have all kinds of fancy like kind of ways of uh uh voxelizing and hash functions in such a way so that it's not actually doesn't actually take that much time to train anymore but it still takes time to train which is annoying uh okay solution Builds on three main components so the first component is going to be these 3D gaussians which are they're prior that they're going to use to represent this scene we start with the same input as previous nerves cameras calibrated with structure from motion which is going to introduce noise unfortunately and initialize the set of 3D gaussians with the sparse Point Cloud produced for free as part of the sfm process okay so This Ss structure for motion process is gonna basically say here are the pictures that you gave me here are some points in those pictures that might correspond with each other and because I can basically do triangulation basic camera geometry I can tell you uh here are the points that I think are real 3D points in space and those are the sparse points and it sounds like they're basically going to initialize the set of 3D gaussians to those points or to those sparse points produced by the sfm in contrast to most point-based solutions that require multi-view stereo data we achieve high quality results with only sfm points as input okay so I guess multi-view stereo data stereo refers to two cameras multi-view even more cameras right so that's just going to increase the accuracy of these points note that for Nerf synthetic data set our method achieves high quality even with random initialization so a nerve synthetic data set what they're talking about is that in a lot of the Nerf papers they're not actually using real scenes they're basically using 3D objects and the reason they do that is so that they can basically take pictures from every single angle and they know the exact camera position right if you're doing this synthetically which means that you're doing this in a video game you can put the camera wherever you want right so you can put the camera on this side and you know that it's perfectly 90 degrees and you can put the camera on this side and you know that it's perfectly negative 90 degrees so a lot of these Nerf papers use synthetic data sets because it allows you to basically have a perfect uh picture and Camera position for everything so this ship this Lego this microphone these material these are all perfect synthetic data sets so they're saying that if we actually use the same synthetic data set that most of these Nerf papers use we can actually achieve the high quality look with random initialization So Random initialization refers to the where do you choose to put these 3D gaussians okay we showed that 3D gaussians are an excellent choice since there are differentiable volumetric representation differentiable means that you can take the derivative they're continuous there's no discontinuities there it's per it's like smooth and then there you go smooth everybody wants smoothness but they can also be rasterized very efficiently by projecting them to 2D okay so that's kind of interesting probably the 2D projecting to 2D what they're referring to is that they're going to have a 3D representation kind of like uh you have in this Nerf where's the 3D representation but then you have this 2D projection which is the image so any 2D image is a 2d projection of a 3D scene so sounds like they're going to be using these 3D gaussians but then projecting them into 2D to get the actual final image which is the novel view that the quality is going to be uh determined by by applying standard Alpha blending Alpha blending is uh Alpha is basically the see-throughness of something Alpha blending is uh if you guys ever use like Adobe Photoshop or something like that that's the same kind of idea like certain things are more see-through than other things and allows you to put a bunch of things together on top of 2D that are at different layers and then have one final image using an equivalent image formation model as a Nerf okay I don't know what they mean by the image formation model in nerve the second component in our method is optimization of the properties of the 3D gaussians so these 3D gaussians are going to have a 3D position an opacity okay so that's actually similar to the Nerf and an isotropic covariance so the covariance is basically uh the variance along each of the three spatial Dimensions because this is a 3D gaussian so the the variance is basically the spread of a gaussian right so how wide it is variance of 2D gaussian see there's a good pick for this yeah here we go so if you have a 1D gaussian there's only one variance right so here you have a mean and then you have the variance for a 1D gaussian if you have a 2d gaussian you have uh two possible variances right because you have the variance on X and the variance on Y and that's what this uh term here is right so this is the variance in X and the variance on y that's what the diagonal is but this uh the covariance and specifically The covariance Matrix which is this Matrix here if you have a 2d gaussian it's going to be two by two and what the diagonals are are the variants for each Dimension but then there's this term here the covariance and what that is is there's some amount of the variance that is uh a function of both X and Y right so this diagonal here is just the variance only in X variance only in y right the independent X and Y but here there might be a non-zero value if there's zero values here it basically means that the variance in axis independent of the variance in y but that's generally not going to be the case like most of the time these covariances are going to have some value which means that there's some relationship there's some kind of interaction happening between the X and the Y dimensions so here these are going to be 3D gaussians so this covariance Matrix here is actually going to be three by three so you're going to have X Y and Z on the diagonal and then y and x y and z x and z and so on all the different covariance between this but you can just kind of think of the 2D example of this and then project it into 3D with your brain if you're able to do of doing that okay and spherical harmonic coefficients okay this is spherical harmonics this these kind of pop up occasionally and I'm just going to be real I don't even fully understand what they are but they're going to be using them interleaved with adaptive density control steps where we add occasionally remove 3D gaussians during optimization the term splatting isn't a standard term splatting has been used to refer to a technique where you splat or spread out data onto an image Okay so I think splatting doesn't have a specific formal definition it's just literally the same uh mental image I have of just splatting something on a 2d surface like you would Splat some paint on a 2d surface I guess here they're splatting a 3D gaussian onto a 2d surface maybe one of you guys know what uh splatting is uh the optimization procedure produces a reasonably Compact unstructured and precise representation of the scene a one to five million gosh as that is holy one million gaussians for every scene that seems a lot Okay the third and final element is our real-time rendering that uses fast GPU sorting algorithms and is inspired by tile-based rasterization so maybe they have their own custom Cuda kernel here and with any kind of deep learning algorithm or any kind of uh any kind of deep learning or computer vision algorithm in general there's the speed that it that you're going to get just from a theory standpoint of like okay how many times what is the optimization process how many iterations are you gonna have to run and so on but then there's the actual speed and what I mean by that is that there's certain algorithms that you can basically compile them into these Cuda kernels that uh are the code that actually runs on the GPU and sometimes you can compile things in a way that makes them way faster right so uh sparse softmax let's see if I can find this uh sparse attention something like that there's this bear with me for a second not sparse attention uh fast attention yeah here we go so this is a little bit of a tangent here but I think it proves my point so you have the attention computation in a Transformer right and the attention computation of a transformer if you were to actually do uh the actual math which is basically you're taking these value matrices these query matrices and these key matrices if you were to actually uh do all of those steps in a GPU this is how much time it would take right it would take about 20 milliseconds you do a matrix multiplication then you get rid of some things then you put them through the softmax then you do another mask and then you do a matrix multiplication but somebody figured out a way called flash attention that they were like hey wait a second every time I'm doing these Matrix multiplies I'm taking things out of the memory I'm putting them into the memory I'm taking them out of this uh memory that's closer to the Matrix multiply processor and taking it out and putting it in and they're like wait a second there's actually a much faster way of doing that where I can put things in intermediate memories and stuff like that and look at this this fused kernel right this Cuda kernel that implements the same thing as this attention mechanism here is significantly faster right and this is a huge speed up right so ultimately what I'm trying to get you guys to understand here is that there's a difference between kind of the theoretical like here's how complicated my algorithm is and like sometimes some of those algorithms if you actually write them and put them into a Cuda kernel and you're clever about that you can have them run significantly faster and with these type of uh 3D scene reconstruction uh papers that's actually a huge part of whether or not you're going to achieve real-time rendering so real-time rendering it's not going to be about the assumptions that you make and like the optimization process it'll feel like at the end of the day like 80 of it is going to come down to the fact of like is your algorithm capable of being put into a Cuda kernel that makes it drastically faster for some weird reason because of the way the memory works and the way that the The Matrix multiplies are done so if they figured out a fast GPU implementation of their thing that could be huge okay thanks to our 3D gaussian representation we can perform in isotropic splattering that respects visibility ordering thanks to the Sorting in Alpha blending and enable fast accurate backward pass by tracking the traversal of as many sorted Splats as required so it kind of sounds like they probably have something like that figured out the introduction of anisotropic 3D gaussians as a high quality unstructured representation of Radiance fields optimization method of 3D gaussian properties interleaved with adaptive density control that creates high quality representations for captured scenes okay so you have optimization method you have the original uh 3D gaussian prior and then you have the fast differentiable rendering approach for the GPU so those three together are why this method works right if they would have just stopped here this paper would have just taken forever it would have produced something that is sure it's high quality but it takes like 10 hours to run on GPU no one cares about it and it gets shuffled away and forgotten about but they went that extra step of like hey let's actually try to make this fast on the GPU by implementing this in a really optimized Cuda kernel and that's how they're able to get this state-of-the-art speed and state-of-the-art quality all right so that's what this paper is going to be about so these are the three key contributions of this paper multi-view captures achieve equal or better quality than the previous uh quality best quality previous implicit Radiance field approaches we can also achieve training speeds and quality similar to the fastest method and provide the first real-time rendering with high quality for novel view synthesis that could be huge we'll see if Nvidia has a paper for this technique in uh six months okay related work unfortunately I don't like these related work sections because they're just a citation fast but let's see overview traditional reconstruction then discuss point-based rendering and Radiance field discussing their similarity ratings fields are a vast area so we focus only on directly related work for a complete coverage please see this excellent survey work okay so you got some surveys survey papers are different from novel uh algorithm paper so some papers are what's called survey papers and those papers they're not introducing any kind of new algorithm or new approach or new data set or new Benchmark or anything like that they're basically just saying hey there's a thousand papers on this thing we're just going to basically pick the best ones and tell you all the different techniques these survey papers are generally like extremely long and can be very dry so that's why I haven't really read a ton of them on stream but I don't know maybe we should start reading some survey papers traditional scene reconstruction and rendering the first novel view synthesis approaches were based on light Fields first densely sampled so you've got some 1996 papers and unstructured capture 2001 structure from motion to 20 2006 a collection of photos could be used to synthesize novel views sfm uses a sparse Point Cloud during camera calibration initially used for simple visualization multi-view stereo produced impressive full 3d reconstruction algorithms 2 on these 2007 2013 2008 2018 2021 all these methods reproject and blend the input images into the novel View Camera and then use the geometry to guide the reprojection so all of sfm and multi-view stereo are basically based around this idea of geometry right they're just looking at triangles and the position of things and the relative distance between things and then using these triangles to get the camera positions and where other things in the scene would be these methods produce excellent results in many cases but typically cannot completely recover from unreconstructed regions or from over reconstruction when multi-view stereo generates inexistent geometry recent neural rendering algorithms vastly reduce such artifacts and avoid the overwhelming cost of storing all input images on the GPU outperforming these methods on most fronts okay so that's where we're at with scene reconstruction and rendering traditional scene reconstruction and rendering neural rendering and Radiance fields so you got the Deep learning techniques for novel view synthesis started around 2016. I think alexnet was 2014. Alex net was one of the first kind of computer vision deep learning successes no it's not even 2014 it's 2012 so it was four years after that 2016 is when you started seeing these Nerfs used to estimate blending weights volumetric rendering representations for novel view synthesis were initiated by Soft 3D deep learning techniques coupled with volumetric Ray marching were subsequently proposed I guess the first NERF is maybe 2019 so this 2016 paper is more just deep learning but CNN's used estimate blending weight so they were using it for different things building a continuously differentiable density field to represent geometry this is the base uh this is the prior that you use for a Nerf right so in a Nerf you're saying okay there's some volume of space and that volume of space has a density or an opacity and that opacity is defined at every single point in space and that's what a field of density is and it's differentiable because it's continuous rendering using volumetric Ray marching has significant cost due to the large number of samples required to query the volume large number of samples is each of these little dots here is a sample so you're going to need a lot of them uh introduced important sampling and positional encoding to improve quality but used a large multi-layer perceptron negatively affecting speed the success of Nerf has resulted in explosion of follow-up methods we've read a bunch of them on this channel if you're interested but maybe you shouldn't actually watch those videos because if this technique is better than maybe Nerfs just disappear and just become a footnote in uh the history Uh current state of the art in image quality for novelty synthesis is MIP Nerf 360. well the rendering quality is outstanding training and rendering times remain extremely high we are able to equal or in some cases surpass this quality while providing fast training and real-time rendering most recent methods have focused on faster training and or rendering by exploiting three design choices use of spatial data structures to store neural features different encodings and MLP capacity different variants of space discretization code books encodings hash tables so people have been basically trying all kinds of different uh things on top of Nerfs to make them faster and or more high quality uh most notable of these methods is instant NGP this one actually has a very good uh NGP very good GitHub repo this is like a very look that 13 000 stars on this GitHub repo but it's basically a Nerf that is very fast very high quality look at that look at this look at this look at this little Lego truck which uses a hash grid and an occupancy grid to accelerate computation and smaller multi-layer perceptron to represent density and appearance and then planoxyls uses sparse voxel grid to interpolate The Continuous density field so you can see how hashgrid and occupancy grid and then a sparse voxel grid so both instant NGP and planoxyls are based on uh and I ideas that are looking at the actual voxel itself the actual way that space is uh being represented and saying hey is there a way that we could represent this space that's faster so this paper is kind of similar in that way right where they're it's just even a step further to that but going back to the base assumption I think is the is the way to to make progress here damn my like brain is kind of fried I don't know about you guys but I feel like I'm not making sense uh interpolate a continuous density field and you're able to forego neural networks altogether both rely on spherical harmonics the former to represent directional effects directly and the latter to encode its inputs to the color Network so the color network is the network that produces the color at a specific point both provide outstanding results these methods can still struggle to represent empty space effectively depending on in pinning in part on the scene and capture Type image quality is limited in large part by the choice of the structured grids and rendering speed is hindered by the need to query many samples for our given Ray marching step the unstructured explicit GPU friendly explicit GPU friendly 3D gaussians what they're GPU friendly achieve faster rendering speed and better quality all right so what are we going to go into here Point base rendering in Radiance field we're still in the related Works section so this is probably just going to be an explanation of what a Nerf is so we've seen that a million times at this point but whatever let's just let's just read their explanation of Nerfs point-based methods effective effect efficiently rendered disconnected and unstructured geometry samples disconnected and unstructured geometry samples is a very fancy way of saying a point cloud in its simplest form a Point sample rendering rasterizes an unstructured set of points with a fixed size for which it may exploit natively supported Point types of graphics apis or parallel software rasterization on the GPU huh well true to the underlying data Point sample rendering suffers from Holes causes aliasing and is strictly discontinuous so any technique that is based on these Point clouds you're going to get these holes right because you're not there's going to be parts of your uh scene that you can't get the correspondences right therefore there is no points there and therefore your reconstruction of that is not going to have any kind of is not going to know what's actually in that point in space causes aliasing and is strictly discontinuous seminal work on high quality point-based rendering addresses these issues by splatting Point Primitives with an extent larger than a pixel okay so here they're finally introducing the word splatting and this is the idea of okay well if you have a point cloud right which is a set of 3D points that represents something here you have a point Cloud for a house right maybe this was taken with a drone or something each of these points is infinitesimally small right so maybe what if you were to take all of those points and kind of just make them a little bit wider right say that okay those points don't just represent the color of a very very tiny infinitesimally small point in 3D space they represent uh the color and properties of like a little area of 3D space circular or liquid disks ellipsoids or surfels circles are kind of a cool look if you guys have ever seen that surfels yeah it's basically a point Cloud but you have each each point get you kind of spread it out so now it has a radius like and it also has a normal Direction so here's a cube represented in surfels and you could compare that to a cube represented in meshes here's a mesh for a cube right in meshes you have points and then you have the normal Vector kind of just happens because of the way that the points are connected together but in a circle each point itself has a normal vector there has been recent interest in differentiable point-based rendering techniques points have been augmented with neural features and rendered using a CNN resulting in fast or even real time however they still depend on multi-view stereo for the initial geometry yeah and as such inherits his artifacts AKA your whatever noise you're introducing with this multi-view stereo process which is how you're creating this point Cloud you're not going to be able to get rid of that noise because the techniques are assuming that that multi-view stereo Point cloud is ground truth and really it's not over or under reconstruction hard cages hard cases such as featureless shiny areas so would it what does it mean by uh featureless shiny areas or thin structures so if we were trying to multi-view stereo a let's go over here so we have this house right uh maybe that's not a good example maybe this is a good example but let's say you were to take a picture with a cell phone of a white featureless wall right it would be very difficult to look at those different pictures and and know where your camera was because there's just nothing in that white featureless wall to compare against right so the more uh kind of like textured and the more edges and little Corners there are in a picture and a series of pictures the easier it's going to be to use multi-view stereo to reconstruct that 3D uh scene right if you have something that has almost no features it's going to be very difficult to reconstruct that I'm trying to find like a good picture of this but I really can here so normally that's not a huge problem because a lot of the things in real world have a ton of texture but the is that that's kind of cool point-based Alpha blending and Nerf style volumetric rendering share essentially the same image formation model specifically the color C is given by volumetric rendering a long array so the final color is you add up basically all the colors along every single point along this Ray accounting for the fact that some of them have high opacity so you see here how the opacity here they have it with the sigma here and you can see how at some point the opacity Peaks which means that the color shouldn't account for the fact that what's behind it so kind of as soon as you have one point that has a very high opacity you kind of don't care about the samples after that and actually there's a bunch of Nerf papers that use this technique to make it faster right so like once you kind of know that you're seeing the surface of something then you kind of don't need to sample further points along that Ray but here you have the final color is the sum of all the points from 1 to n where n is the number of points along that Ray each point has some opacity each point has some color that's with the CI that's what this opacity is and this is the actual Ray marching here but you can basically just uh think of this image uh samples of density transmittance and color are taken along the ray with intervals Delta I this Delta I is the distance between these points here and can be Rewritten as this with Alpha I equals 1 minus exponent negative Sigma ISO Alpha I is basically just how far along you are and it's based on the distance see it's a function of the distance and then also the opacity a typical neural point-based approach computes the colors C of a pixel by blending n order points overlapping the pixel okay so this is how you would do it in Nerf this is uh Ray marching volumetric rendering with Ray marching as you see it in a nerve and then here so that's equations one and two and then equation three this is going to be Point based rendering where you're basically saying okay uh there's some pixel and that pixel is going to have a bunch of points that are close to the pixel or overlapping the pixel so think about a point cloud like this and then you're saying okay well what's what's going to be the color uh at this pixel here so let me actually go and see which points are in some neighborhood of that pixel and then I'll basically just do the average of that and that's what you're seeing here they're just blending points with some distance uh around the point CI is the color of each point Alpha I is given by evaluating the 2D gaussian with covariance here you go here's a 2d gaussian multiplied with a learned per Point opacity Okay so there's an opacity for each point we can clearly see that the image formation model is the same however the rendering algorithm is very different okay I don't know what they mean by the image formation model is the same I guess they're saying that for every pixel you're basically kind of averaging a bunch of guesses of what that pixel would be maybe that's kind of what they mean by that the rendering algorithm is very different nerves are a continuous representation implicitly representing empty and unoccupied implicitly representing empty and occupied space so what they mean by that is that you can evaluate uh this Nerf at any point in space and it'll that opacity is basically telling you whether or not it's empty space or it's occupied space whether or not there's an object there expensive random sampling is required to find the samples so you need to sample many points in order to get a guess for one pixel I guess that's the image formation being the same in both this point-based approach and then the Nerf based approach uh with consequent noise and computational expense in contrast points are unstructured discrete representations so unstructured in that there's each all points are independent you can take a point Cloud which is a set of points and you can they have some order to them and you can shuffle them and they have a new order and it's the same exact point Cloud so there is no structure to the order of those points those points are just each their own little individual point and it's discrete in because there's some limited amount of points that is flexible enough to allow creation destruction and displacement of geometry similar to Nerf this is achieved by optimizing opacity and positions as shown in previous Works while avoiding the shortcomings of a full volumetric representation so okay A little comparison between Nerfs and these point-based approaches Pulsar achieves fast sphere rasterization which is in Star inspired are tile based and sorting renderer Okay so they're tile based and sorting render which we still don't know what exactly what that is but it's inspired actually on this 2021 work called pulsar so that's kind of cool given the analysis above we want to maintain approximate conventional Alpha blending on sorted Splats to have the advantage of volumetric representations what is an approxen approximate conventional Alpha blending a rasterization respects visibility order in contrast to their order independent method okay so visibility order is things that are in closer to you are going to be more visible than things that are far away so if you have something ahead of you that is semi-transparent you're gonna pay attention to that before you pay attention to the thing in the back right so there's and the Nerf uh accounts for that as well right there's not just the notion of the opacity but there's also the notion of the order so you're going to see through the points here and even if this last Point has a very high opacity it depends because it's the last point it's probably not going to have much of a contribution so there's it's not just the individual uh see-throughness it's also the order at that you're getting because you know where you're trying to render it from uh in addition we back propagate gradients on all Splats in a pixel and rasterize and isotropic Splats I'm gonna look up anisotropic here and isotropic what does it mean me come here and isotropic what does it mean okay so it comes from the Greek word and meaning not and then ISO meaning the same so not the same property being directionally dependent okay so basically uh in a Nerf the uh this little multi-layer perceptron here knows what the direction of the ray is so it's Direction dependent if you're looking at something from this angle it's different than if you're looking at from this angle so the Nerf it has to account for that right the representation has to account for the fact that okay where are you viewing this from these elements all contribute to the high visual quality of our results in addition previous methods mentioned above also use cnns for rendering which result in temporal instability portal there's no time here nonetheless the rendering speed of pulsar and adopt serve as a motivation to develop our fast rendering solution while focusing on specular effects the diffuse Point based rendering track of neural point catechistics damn overcome this temporal instability by using a multi-layer perceptron but still require mvs geometry as an input the most recent methods in this category also do not require mvs and also use sh for directions however it can only handle scenes of one object and needs masks for initializations so yeah you definitely don't want that you definitely don't want a technique where you have to say here's a picture of my scene and then here's a mask of everything that's background everything that's foreground while fast for small resolutions and low Point clouds it is unclear how it can scale to scenes of typical data sets we use 3D gaussians for a more flexible scene representation avoiding the need for MBS geometry and achieving real-time rendering thanks to our tile-based rendering algorithm so it does seem like they're tile-based rendering algorithm is really just coming from these two previous works here pulsar and adopt which had some kind of fast rendering solution which is also tile based so this is maybe a little bit less original than they're making it seemed it seems like it's more just a inspired by a previous approach uh catacoustics is an awesome word yeah caustics I think is the interaction of Lights with like glass I'm pretty sure yeah so this type of stuff so uh you know who actually is really good at this if you guys have ever heard of two minute papers this is I heavily recommend this YouTube channel but this is like a guy who uh he basically does what I do he like reviews papers but he does it in little tiny he basically does these like super short summaries but uh the guy who owns this channel he actually comes from a uh computer Graphics background so occasionally some of the papers that he talks about here are these kind of papers where they're trying to solve these caustic problems and I remember he has a couple I'm not gonna be able to find them just right off the bat but there's a lot of super fancy kind of uh work that people have done for decades about old school uh kind of physics-based rendering right where you're actually like looking at the way that light actually interacts with a piece of glass in order to create this kind of caustic effects that you would get maybe at the bottom of a pool here or like something like this when you see it through a glass and like this this is like very complicated right because the way that light is interacting with a glass object it's like you could even have to account for like Quantum effects in there right because the way the light bounces and it's just incredibly complicated so hats off to the people that actually understand that garbage uh okay let's get let's get back to here they employed Point pruning and declass densification technique so Point pruning is just when you have your point cloud and then you basically get rid of points that are maybe just spurious or weird uh densification adding more points uh during optimization but use a volumetric ray monitoring and cannot achieve real-time display rates the domain of human performance capture 3D gaussians have been used to represent captured human bodies okay so 3D gaussians might not even be original to this paper obviously makes sense that 3D gaussian is a very kind of basic primitive so people have probably come up with 3D gaussians as a prior for decades I'm sure there's even earlier papers from like the 70s or 80s that have something that's using 3D gaussians more recently they have been used with volumetric Marching for vision tasks neural volumetric Primitives have been proposed in a similar context while these methods Inspire the choice of 3D gaussians they focus on the specific case of reconstructing and rendering single isolated objects resulting in scenes with small depth complexity in contrast our optimization of anisotropic covariance are interleaved optimization and density control and efficient depth sorting for rendering allow us to handle complete complex scenes including background both indoor and outdoor with large depth complexity so there isn't kind of a formal definition for large depth complexity or complete scenes like those are all just kind of a little like subjective words but I think largely what they're trying to say is that we're not just going to have some results like these Nerf papers where there's just one thing in the middle of the thing and everything else is background and it's just this magical object hovering in 3D space they're going to be doing more complicated scenes such as this I would call this a complicated scene because there's a lot of different things here you have the background itself changes but then you have very complicated interactions such as the the lighting on this TV right like that that is some Next Level and not only to have but actually if you look here there's the reflection of the TV contains the table which itself is reflecting something so I would call that a complete complex scene overview okay so now we've finished the uh related work section where they kind of showed us previous papers previous research that people have done that is similar to this and now we're going to go into the actual thing that they're doing in this paper so the input to our method is a set of images of a static scene and right off the bat that makes you kind of sad right because one of the things that none of these papers including this paper and every single Nerf paper address is the fact that these are static scenes right ideally we want to get to the point where these are time varying scenes right it's not just a single static thing it's like a like a video right I want to get to a point where we have Nerf videos you have like a Nerf video that you could put on a VR headset and then explore that video itself right so the fact that we're still just dealing with these static scenes a little bit annoying combined with corresponding cameras calibrated by sfm so that says also unfortunate because it means that you're using the sfm to give you the camera positions which means that you're only as good as the sfm which produces a sparse Point Cloud as a side effect from these points we create a set of 3D gaussians these gaussians are going to be defined by a position which is going to be three numbers a covariance matrix which is going to be a three by three Matrix and then an opacity Alpha that allows for a very flexible optimization regime this results in a reasonably compact representation of the 3D scene reasonably compact representation of the 3D scene and it's also you're going to be able to kind of trade off the uh quality and the uh speed right because you're going to have to pick how many of these 3D gaussians you have so if you pick 10 gaussians can be very fast but probably complete but if you pick 10 million gaussians can be very very good but uh very slow so having a variable like that a hyper parameter that you can choose that allows you to kind of trade off between fast and cheap and then uh high quality and slow is generally very useful in part because highly anisotropic volumetric Splats can be used to represent fine structures compactly representing fine structures so fine structure might be like the the a post in some kind of bed or something like that something that's very thin and long or something that's very small the directional appearance component color of the radiance field is represented via spherical harmonics okay so here you can see that uh each of these 3D gaussians has an opacity and it has a position and a covariance which is just kind of the spread but it doesn't have a color so where is the color going to be stored and the color is going to be stored in these spherical harmonics following standard practice our algorithm proceeds to create a Radiance field representation via a sequence of optimization steps of 3D gaussian parameters interleaved with operations for adaptive control of the gaussian density the key to our efficiency is our method is our tile-based rendering that allows Alpha blending of anisotropic Splats representing visibility order thanks to fast orderings fast sorting okay so this fast sorting which they're doing on the GPU is going to tell you which gaussian is in front of the other gaussian which is going to be important right because for the Nerf you start from saying I have this pixel this pixel means that I have this Ray and that Ray means that I have samples along that right so you kind of start from the pixel then the ray then the points and you have to then sample for each of those points but here they're going to have a bunch of gaussians but those gaussians you're not going to know the order of those gaussians from a specific view that you haven't looked at before so you're gonna have to sort these gaussians and it seems like they can do that sorting on the GPU quickly out fast renderer this is probably a typo this probably says R fast rasterizer that's what it probably should be are fast rasterizer also includes a fast backwards pass by tracking accumulated Alpha values without a limit on the number of gaussians that can receive gradients okay interesting so in a Nerf you're there's some hyper parameter which is how many samples am I taking along this Ray but it sounds like the with this technique you can have you're not limited to that specific number of sample points you can basically if you have a million gaussians and all one million of those gaussians are in this specific order for this one specific novel view you could push gradients into every single one of those gaussians of course that's probably going to be not ideal right generally if you're going to do some kind of training process like this knowing the exact number of gaussians so that you can basically fit it perfectly in your GPU memory there's an advantage to that so I wonder if they actually have a hard-coded limit on the number of gaussians per uh backward pass differentiable 3D gaussians our goal is to optimize the scene representation that allows high quality novel view synthesis starting from a sparse set of sfm points we need a primitive that inherits the properties of a differentiable volumetric representations while at the same time being unstructured and explicit to allow very fast rendering we choose 3D gaussians which are differentiable and can be easily projected into 2D splats so you can take a 3D gaussian and then splat it so kind of similar how uh here we have a 2d gaussian and we're splatting it into 1D so you can see here this is a one-dimensional Splat of a 2d gaussian so they're going to be doing the same thing except they're going to be doing a two-dimensional Splat of a 3D gaussian which looks like this right these little gaussians a representation of similarities to previous methods that use 2D points and assume each point is a small planar circle circle with a normal this is a circle so we were actually just looking at that whereas our surfels here we go little circles look at these look at these circles approximating a surface with a bunch of circles okay with the extreme sparsity of sfm points it is very hard to estimate normals yeah you can't you can't just connect the points and assume that the resulting triangle gives you an estimate of the normal similarly optimizing very noisy normals from such an estimation would be very challenging instead we model the geometry as a set of 3D gaussians that do not require normals or gaussians are defined by a full 3d covariance Matrix defined in World space so World space just means that there's some Global origin and you're basically saying uh there's some zero zero zero point and then all of these gaussians the position of them the covariance the every number is based on that origin that's zero zero zero point centered at a point mean mu so the point mean mu is going to be a three-dimensional vector and then your 3D covariance is going to be six numbers so here you have a gaussian here you have the covariance what is X is just any point I guess ing is multiplied by Alpha in our blending process Alpha is kind of the opacity equivalent we need to reproject our gaussians into 2D for rendering demonstrate how to do this projection to image space given a viewing transformation W The covariance Matrix Eep Sigma Prime and Camera coordinates as given as follows J is the Jacobian of the affine approximation of the projective transformation okay so jacobians are basically a matrix that has the derivative of all the terms in a thing for every different thing in the thing see if there's a better definition of that there's Cobian determinant of that's not what I want I just want yeah basically this so whenever you think of a Jacobian think of this it's basically df1 dx1 df1 dxn df2 DF uh df2 dx2 do you have an Associates basically just like the partial derivative of the function with respect to all the different things so here the Jacobian is going to have a term for x and a term for y and a term for Z and so on and here your viewing transformation W that's going to be a uh see if I can give you a good picture to that homology no uh affine transformation computer vision yeah so an affine transformation is this Matrix here and it can take a 2D image and then basically warp it like this so you see there's a matrix that represents this warp so here for example in this little thing their affine transformation has this R which is a rotation Matrix and then B which is a translation right so you can have different uh types of Transformations represented in this affine transformation so this viewing transformation is basically a uh transformation that takes that 3D gaussian and then gives you that that 2D slice kind of view from whatever the view direction is in the view direction is just going to be like where are you actually trying to get this 2D picture from so here in this Nerf example right the Nerf is the 3D Volume you want an actual 2D image that there's some transformation to get this 2D flat plane to the gaussian that would be defined there in this case okay also show that if we skip the third row and column of that we obtain a two D two by two variance Matrix with the same structure and property okay this is kind of sketch so they skip the third row and column of the covariance Matrix which means that the covariance Matrix here is a three by three because it's a 3D gaussian but here they're saying if you skip the last row and column then you get a two by two but that's just going to be the X and Y huh that's kind of weird like aren't you going to lose information if you do that because it's going to be the X and Y but the X and Y specifically in the world Space X and Y which is going to be different than the X and Y in your view uh coordinate frame it seems a little sketchy I don't know what's going on there an obvious approach would be to directly optimize The covariance Matrix to obtain 3D gaussians that represent the radiance field however covariance matrices have physical meaning only when they are positive semi-definite yeah so you can't have negative variance so any covariance Matrix let's go here right these values here can't be negative that doesn't make any sense to have a negative variance so you know at least the diagonals are going to be positive and then the there's also some requirement on these covariant terms as well the positive definite requirement which is like uh some specific what is the positive definite where is it all definite Matrix here you go m is positive definite if for All Points X in this case they're going to be 3D points the transpose of x times MX is greater than or equal to zero so that's the formal definition of positive definite and then you have here A bunch of other ones positive semi-definite negative definite negative semi-definite but the the reason that this covariance Matrix is positive semi-definite is because it's representing the variance of a gaussian which has specific properties like the fact that you can't have a negative variance which means that this Matrix has to be positive semi-definite for optimization of all of the parameters we use gradient descent okay that's basically the same technique that everybody has cannot be easily constrained to produce such valid matrices uh okay so they're saying okay well if we're trying to use Greening descent to find this covariant Matrix because at the end of the day we want to basically uh figure out this magical set of 3D gaussians that have these properties that represent this scene we're going to basically use gradient descent to get the position of these gaussians to get the covariance Matrix of these gaussians to get the opacity of these gaussians and to get the uh coefficients of the spherical harmonics which end up being the color so we're going to have to get all of those and eventually we want to basically learn those using gradient descent however if we just start taking little incremental steps this covariant Matrix might just be nonsense right because it has to be this positive semi-definite it has this constraint because it's a covariance matrix so how are we going to use gradient descent to get valid covariance matrices all right so we opted for a more intuitive yet equivalently expressive representation The covariance Matrix of a 3D gaussian is analogous to describing the configuration of an ellipsoid okay so a 3D gaussian is an ellipsoid I guess that kind of makes sense that's the in all their pictures here they're kind of showing a a gaussian as a little ellipsoid here they're 2D ellipsoids but obviously in 3D they're going to be a 3D ellipsoid so you have some scaling Matrix S and some rotation Matrix as r you can find the corresponding covariance okay so they're basically saying The covariance Matrix of a 3D gaussian which we have here where is it where's our where's our gaussian no caustics definite Matrix I think it's because we got rid of this yeah so we have our 2D gaussian and then the tutigation has some spread right it has a variance but another way to think of that spread or that variance is basically pretended it's a circle pretend it's a little ellipsoid and then instead of the variance what we're defining is basically the radius of this ellipsoid and the same way that there's a there's going to be some variance in X and some variance in y and some variance in z now you're basically going to say okay there's an ellipsoid and that ellipsoid is going to have some radius and X some radius and Y and some radius and Z okay so if we have an ellipsoid then we can just say okay well all ellipsoids are this unit ellipsoid that's scaled and then rotated so they're they're going from we want to get a covariance matrix for a 3D gaussian that's hard to do because there's specific requirements on covariance matrices so lets let us instead pretend that these 3D gaussians are these little ellipsoids and we're going to take uh and we're instead going to find the scaling which is kind of like the stretching and then the rotation of that ellipsoid and then we're going to pretend that that Matrix is the covariance reminds me of SVD singular value decomposition yes I kind of see what you're saying to allow independent optimization of both factors we store them separately a 3D Vector s for scaling and a quaternion q yes dude I love quaternions dude quaternions are my favorite I think uh rotation matrices are also popular but there's Euler angles there's rotation matrices and then there's quaternions and I feel like quaternions are the best way to represent rotations and you can convert from quaternions to rotation matrices and back to quaternions so uh these can be trivially converted to the respective matrices and combined making sure to normalize Q to abstain a valid eunuch quaternion okay so here obviously this s is going to be a 3D vector and you can't multiply a 3D Vector with a quaternion which is four numbers so anytime they're going to get these covariance matrices they're going to convert these quaternions into these rotation matrices so that the rotation Matrix which is going to be three by three the dimensions match with this scaling vector and thus you can get the final covariance Matrix which is going to be a three by three to avoid significant overhead due to automatic differentiation during training we derive the gradients for all parameters explicitly what does that mean for that details of the exact or in appendix a details of gradient recall that Sigma over Sigma Prime are in the world and view space okay so Sigma Prime here which is the covariance Matrix that's the 2D 2 D version of it is inside the view space so view space is this uh image here versus World space is the actual 3D World space Q is the rotation s is the scaling Q is the quaternion W is the viewing transformation and J is the Jacobian of the affine approximation of the projective transformation we apply the chain rule to find the derivatives okay so you want to find what is the partial derivative of this covariance Matrix with respect to this scaling that's equal to the partial derivative of the covariance Matrix with respect to this other covariance Matrix so this is just the chain rule here more chain rule using U equals JW and sigma Prime being the symmetric upper left two by two Matrix of so this is where they're only taking the top part denoting Matrix elements with subscripts we can find their partial derivatives okay so why are you doing the chain rule here it's because you can't actually get the partial you can't actually get this derivative but you can get this derivative right so anytime you're using the chain rule it's because you can't find what you really want so you break it up into more things and then you can get each of those things individually so here they're saying okay well we can actually get what this is and then we can get what this is so if we can get what this is and we can get what this is then we can get what we really want which is this right we see derivatives that since Sigma E's RSS transpose R transpose we can compute m equals RS and then rewrite Sigma equals mm okay so they're combining the scaling and the rotation into one Matrix m and then that allows them to re Define uh d-e-d-s so they're going to further break this down into dedm dmds so another chain rule since the covariance Matrix and its gradient is symmetric the shared first part is compactly found by this is the covariance symmetric I guess it is for scaling we further have that to derive gradients for rotation we recall the conversion from a unit quaternion with a real part and imaginary part to the rotation Matrix r so this is just how you convert from a quaternion quaternions have four parts they have uh r i j and k and then here how you get each of the values for a three by three rotation Matrix if you have the four numbers of a quaternion we find the following gradients for the following components okay so you chain roll you chain rule again and then you now can get the partial derivative with respect to each of the quaternion elements which is this all right pretty cool story this representation of anisotropic covariance suitable for optimization allows us to optimize 3D gaussians to adapt to the geometry of different shapes in captured scenes resulting in a fairly compact representation that keeps they keep saying compact here but it's not actually compact right because if you had 10 million gaussians that would not be compact the compactness is going to be how many gaussians are you using optimization with adaptive density control adaptive density control changing adapting the amount of gaussians that you have based on where in 3D space so it makes sense that in certain parts of 3D space you're going to have more gaussians in certain parts of 3D space you can have less gaussians right so there's going to be some uh it's going to be scene dependent too so some scenes maybe have stuff in the corners and you're going to need 3D gaussians in those Corners to represent that stuff uh core of our approaches the optimization step which creates a dense set of 3D gaussians accurately representing the scene for free view synthesis in addition to positions at positions P Alpha and covariance we also optimize the spherical harmonic coefficients representing the color of each gaussian so you're only going to have one color for the gaussian which means that these gaussians can't be big right they're gonna have to be very small gaussians like here it's making it seem like these gaussians are like big right but they're not that each gaussian has the same Alpha and it has the same color so really these gaussians are going to be very tiny really this is almost more similar to a circle it's almost like a 3D circle is basically what this is uh to correctly capture the view dependent appearance of the scene boo-boo wants to be on camera the optimization of these parameters is interleaved with steps that control the density of gaussians to better represent the scene okay let's actually read or actually let's let's look at this optimization first and then we'll look at this figure because I think this figure just describes what this text is going to do here the optimization is best based on successive iterations of rendering and then comparing the resulting image to the training views in the captured data set okay so basically you're going to optimize get a final picture compare the final picture to the ground truth picture get a loss push that loss and then re-optimize so there's kind of this back and forth process inevitably geometry may be incorrectly placed due to the ambiguities of 3D to 2D projection our optimization thus needs to be able to create geometry and also destroy or move geometry if it has been incorrectly positioned hmm okay so basically not all like this is kind of interesting you're you have some set of gaussians and here I would have thought that you have the same set of gaussians and then you're just moving them around but here the word create and Destroy means that they're going to be adding new gaussians or potentially removing gaussians over the course of this optimization process I think this is annoying because it means that there's a different amount of gaussians it's like generally when you're doing anything with a computer if you know the exact amount of things that you're going to have the entire time it's a lot faster than if the amount of things is changing over time so destroying and creating gaussians seems like it might be a bad idea quality of these parameters is critical for the compactness of the representation since large homogeneous areas can be captured with a small number of large anisotropic gaussians yeah so if you have one big white surface you can pretty much just represent it as one big gaussian that has very large variances in some dimensions and very small variances in a different dimension we use SGD for optimization taking full advantage of the standard GPU accelerated Frameworks and the ability to add custom Cuda kernels custom Cuda kernels for some operations following recent best practice are fast rasterization is critical to the efficiency of our optimizations as it's the main computational bottleneck so kind of what we were talking about at the beginning of this paper one uh these type of optimization problems if you can frame them in such a way that the bottlenecks are implemented in extremely fast code it makes a huge difference we use a sigmoid activation function for Alpha of course your uh see-throughness your opacity your Alpha blending or whatever you're going to need to have that in a zero to one range hence you're going to use the sigmoid in order to do that obtains smooth gradients and an exponential activation function for the scale of the covariance for similar reasons we estimate the initial covariance reason as an isotropic gaussian with axes equal to the mean of the distance of to the closest three points so these points are the structure for motion points so they're initializing the gaussians from the point Cloud that is sparse that you're getting from the structure from motion and you're you need to have your initial guess needs to not just be the center or the mean of gaussian but also the variance of the gaussian the variance in each of the three axes X Y and Z so basically the distance of the different points in the structure from motion guess is gonna it's gonna be the starting point for the variance of these gaussian uh blobs we use a standard exponential decay learning exponential decay scheduling techniques similar to planoxyls but for positions only the loss function is combined with a dssim so ssim is a structural similarity index it's a way of measuring whether two images are similar I actually don't really like it but it's like it's an older one at this point yeah here we go it's it's actually kind of complicated because you're looking at images and you're saying okay the ssim between two images an image X and an image Y is the combination of a bunch of weird including luminance contrast and structure so you're trying to say two images are similar here's a better example of it but how similar is this image to this image this image to this image this image to that image ssim will give you a number which tells you how similar they are but that number accounts for potentially different contrast a different placing but there's a lot like this is a some hard-coded human design thing so there's definitely nothing sacred about this so here's an original image of Einstein and then here's a couple different variants of that image of Einstein and then here they're uh showing you how uh even though the psnr is the same for these images the ssim between these is different so this is the most similar with an ssim of 0.98 and this is the least similar with an ssim of 0.69 so their loss function here has this ssim because they want the uh resulting image to match the training views in the captured data set right so at the end of the day they want their little gaussians when they project them into 2D and add them all together with this Alpha they want that resulting image to be as similar as possible to the actual ground truth image and that's where this ssim loss is going to come in but they also have this L1 loss which is just the distance the difference of every single Pixel we use uh Lambda equals 0.2 in all of our tests so this is just some hyper parameter that is basically uh blue color for the hyper parameter but L1 loss for all the pixels so you're taking pixel in reconstructed image pixel in ground truth image what's the difference add that to the pixel 2 and reconstructed image pixel in Grouch with image what's the difference so this is kind of a very primitive loss to be honest these are kind of simple losses here and then your hyper parameter just trades off between both of them okay maybe we can look at this figure too now let's look at this figure too optimization starts with the sparse structure from motion Point clouds and creates a set of 3D gaussians so you're starting from this sparse Point Cloud so you see how there's a sparse number of points here and creates a set of 3D gaussians here's your set of 3D gaussians and it kind of looks like you're starting from each of these points we optimize and adaptively control the density of the set of gaussians Okay so first thing you do is you project them into 2D once you project them into 2D which is this transformation here right this viewing transformation and then this uh projective transformation you can get a 2d version of that which is the the image that 2D version you differentiably tile rasterize this is the thing that they are doing quickly in the GPU that gives you the image once you have the image you can compare that to the ground truth image and you use this loss here right L1 plus L2 you get that loss you chain rule all the way back that gives you the gradients and the gradient so you can see either flowing all the way back so they're flowing through this differentiable tile raster and that's why differentiable is is key here right if this tile rasterizer was not differentiable you wouldn't be able to flow the gradient through it like this right so the fact that it's differentiable is critical because it means you can chain rule all the way back right so you chain rule back here into this adaptive density control you chain rule back here into this projection and then you chain rule back into the 3D gaussians themselves so based on the loss this reconstruction loss of your image here you're changing uh the values of these 3D gaussians and the values are basically their positions their covariances their uh color and their Alpha all right and that's basically it we use our fast Tau base renderer allowing competitive training times to state of the art fast radius Fields once trained our render allows real-time navigation for a wide variety of scenes so if you can real time render based on a novel view this is the novel view here so the camera this means that I'm asking my 3D representation to say okay well what is the view from this camera and what is the view from this camera and what is the view from this camera what is the view from this camera so like if that process is fast right if you can very quickly say okay I already have a 3D a set of 3D gaussians that are pretty good and I you ask give me a camera position I'll project it into that I'll quickly rasterize it with this tile rasterize and I'll give you the image if I can do that at like 30 frames per second that means I can fly around in my scene at 30 frames per second which means I can navigate through that scene okay 5.2 still in the section 5 here so this is still what they're doing we start with an initial set of sparse points and apply our method to adaptively control the number of gaussians and their density over unit volume okay so this is the Adaptive density control allowing us to go from an initial sparse set to a denser set that better represents the scene and with correct parameters after optimization warm up we densify every 100 iterations so every 100 iterations you're changing removing a bunch of gaussians and then adding a bunch of new gaussians we remove any gaussians that are essentially transparent okay so over time right you're using this uh the gradient signal coming from this image reconstruction loss and you're pushing that into the parameters of these 3D gaussians and some of the parameters of those 3D gaussians are this Alpha right which is basically how see-through it is opacity here so if you have some gaussians here that are basically entirely see-through then they're probably not doing very much and we could probably just get rid of them so that's what they're going to say every 100 iterations let's go through all the gaussians that we have and then we're just going to get rid of the ones that are basically transparent which means that they're the ones that over time the gradient descent was saying make that one more see-through because it keeps us up make that one more see-through make that one more see-through and then over time they're like okay well this one seems to be so see-through that it's not really doing anything at this point let's just get rid of it it's kind of like a pruning our adaptive control of gaussians needs to populate empty areas okay it focuses on regions with missing geometric features but also in regions where gaussians cover large areas Okay so that we've this is they removed gaussians based on the opacity or transparency and now they're going to add gaussians in areas that don't have a lot of gaussians or maybe just have one gaussian that is kind of super big and super spread out we observe that both have large view space positional intuitively this is because they correspond to Regions that are not well reconstructed and the optimization tries to move to the gaussians to correct this interesting Okay so if uh if these gradients that you're pushing are telling you to hey this gaussian over here move it 50 feet in that way right then you're probably like something's going on here because it's telling me to move this gaussian a large amount so there's probably not a lot of information here so why don't I add another gaussian in that area in order to basically better reconstruct that area uh since both cases are good candidates we identify the gaussians with an average magnitude above some threshold position so you have a couple different this is not necessarily good because you have basically these hard-coded uh thresholds here so you have a threshold t-pose you have a threshold uh Alpha as well for next details for small gaussians that are in under reconstructed we need to cover the new geometry that must be created preferably to clone the gaussian by simply creating a copy of the same size and moving it in the direction of the positional argument okay so in order to add gaussians they're not adding gaussians that are entirely new and from scratch in the way that they would when they initialize the scene they're adding they're basically cloning these gaussians so they're basically going to if you have a gaussian that is getting told to move a lot by the gradient then clone it and then just move the Clone on the other hand large gaussians and regions with high variance need to be split into smaller gaussians okay so you can split a large gaussian we replace such gaussians by two new ones and divide their scale by a factor of 1.6 another hard coded number which we determined experimentally so this optimization process has four or three different hyper parameters here that are all just hard coded you have this position threshold you have this transparency threshold and you have this uh kind of scaling divide Factor that's not good that means that you have to explore a larger the more hyper parameters you have the harder it is to find the magical set of them that produces the best results uh in the first case we detect and treat the need for increasing both the total volume of the system and the number of gaussians while the second case we conserve the total volume but increase the number of gaussians similar to the other volumetric representations or optimization can get stuck with floaters close to the input camera interesting okay so you can get a situation where you have just a gaussian that's very close to the camera that kind of tries to control everything so you maybe have to have some additional regularization to penalize gaussians from kind of like drifting towards the camera an effective way to moderate the increase in the number of Consciousness is to set the alpha value close to zero every n equals three thousand iterations okay so more more hard-coded hyper parameters here and this is interesting because it's like there's different they're having to deal with the fact that you have these gaussians right and the gaussians have a transparency and then they also have a color so you can see how the gradient optimization if you're trying to say match this black thing you could be like okay well make this gaussian a little bit bigger or make the gaussian a little bit less transparent or make the gaussian a little bit closer right so you can see how there's like you could get to the same or you can take gradient steps in different directions that are both making your loss better right so if you have like a a gaussian that's representing this object you could make the gaussian bigger by increasing the variance or you could make it come closer to the camera and by coming closer to the camera it's it's kind of effectively bigger right so it's kind of a tricky optimization process the optimization that increases the alpha for the gaussians to remove gaussians with Alpha less than that gaussians may shrink or grow and considerably overlap with others but we predictably periodically remove gaussians that are very large in words World space and those with a big footprint and view space uh this strategy results in overall good control over the number of gaussians so they're basically these are regularization of the gaussians you don't want them to get too big you don't want them to get too close and so on the gaussians in our model remain Primitives in euclidean space at all times unlike other methods we do not require space compaction warping or projection strategies for distant or large gaussians our adaptive gaussian densification scheme under reconstruction so let's say you have a gaussian here that's getting told to move a lot you clone it and then you move just the Clone so this is the part that they were saying here cloning the gaussians and then over reconstruction so maybe your gaussian is too big so here you have a huge gaussian we're going to split it and that's going to be the uh division with this scale factor here fast differentiable rasterizer okay so that's how they're optimizing these this set of gaussians over time to eventually approximate this scene and the losses that they're using are actually relatively simple losses uh for they're just image reconstruction losses and they're projecting that all the way back and changing this set of gaussians to eventually mimic the real scene but now kind of the secrets the second kind of magical piece of this paper is okay well this is only going to be useful if you have a way to turn those 3D gaussians into a 2D image in a very fast efficient way and that's what they're going to show us here they're fast differentiable rasterizer for gaussians our goals are to have fast overall rendering and fast sorting to allow approximate Alpha blending including for anisotropic Splats and to avoid hard limits on the number of Splats that can receive gradients so the if you have different Alpha values that's going to tell you okay well if this is partially see-through and if this is partially see-through which one is in front of the other one right so the order matters when you're doing this Alpha blending oh we designed a tile-based rasterizer for gaussian Splats inspired by recent software raster rasterization approaches that pre-sort The Primitives for an entire image at a time avoiding the expense of sorting per pixel Okay so this is if you were going through your 2D image pixel by pixel you would have to sort the gaussians for every single Pixel and that would be very slow so they're going to pre-sort uh the gaussians for the entire image and then do the alpha blending at each pixel or fast rasterizer allows a fish impact propagation over an arbitrary number of blended gaussians with low additional memory consumption requiring only a constant overhead per pixel our rasterization pipeline is fully differentiable and given the projection of 22d can rasterize anisotropic Splats similar to 2D splatting and the fully differentiable is key there because you need to uh be able to push the gradients back through that our method starts by splitting the screen into 16 by 16 tiles so interesting that the 16 by 16 that's the exact size of tiles that are patches that are often used in Vision Transformers everybody loves the numbers 2 4 8 16 32 64. this is always very popular numbers and then proceed to call let me just highlight this in blue to make sure my highlights are correct and then proceed to call the 3D gaussians against the view fullstrom first strum frustrum frustrum is the point here uh camera geometry pinhole model frustrum is the point in a triangle let's see if I can find frustrum is there any picture here camera Center I think this might be the frost from right here uh we only keep gaussians with a 99 confidence interval intersecting The View frustrum initially we use a guard band to trivially reject gaussians at extreme positions Okay so you have a view right you have a specific 2D View and then you have this kind of set of 3D gaussians so first they're going to use guard banding which basically says okay if the 3D gaussian is all the way over here I probably don't give a about it it's not going to appear in my image so therefore let let me not even use it for this sorting those are the means since Computing their projected 2D covariance would be unstable we then initiate each gaussian according to the number of tiles they overlap and assign each instance a key that combines view space depth and tile ID because it's breaking up this view into these tiles and then they're saying which gaussians are visible from which tile we then sort the gaussians based on these Keys using a single fast GPU Radix sort okay so basically the entire reason this is fast is because gpus can do Radix sorts let's ask uh Eli five GPU Radix sort all right let's break this down imagine you're trying to sort a deck of playing cards and you want to do it differently efficiently the approach is the Radix sort to explain a GB erratic sort we'll start with the basics of a Radix sort uh numbers have multiple positions like ones tens hundreds instead of trying to sort all the numbers of one Radix sort breaks it down by sorting numbers based on positions it'll first sort based on the ones and the tens and so on the order of the numbers doesn't change this ability is what makes the Radix sort work okay so it's a sorting algorithm that is you can do it in parallel because you're looking at each digit position separately so you can do it quickly you guys know about Radix sorts 16x16 is also the macro block size for MP4 and h.264 seems to be a sort of magic number for image compression or representation interesting yeah I think it just comes down to 32 is too big and 8 is too small swimming problem to smaller more manageable steps gpus consenting Q use their spatial processing power to handle these steps super quickly okay so basically just some fast sort on the GPU there's no additional per pixel or ordering of points and blending is performed based on this initial sorting um or Alpha blending can be approximate in some configurations however these approximations become negligible as Splats approach the size of individual pixels interesting so these Splats are tiny though it's like multiple times through this paper like in my mind I always feel like these gaussians are big but it's like multiple times they've kind of note made references to the fact that these little gaussians are actually very very small I found that this Choice greatly enhances training we produce a list for each tile by in identifying the first and last depth sorted entry that Splats to a given tile we launch one thread block for each tile each block first collaboratively loads packets of gaussians onto shared memory and then for a given pixel accumulates colors and elf values interesting so then once you have this order of here all the gaussians in this tile sorted from the first gaussian to the last gaussian first gaussian to the last gaussian then we're going to actually do it we're going to uh do the alpha blending but accumulating it from both sides so I'm almost like a kind of like a start from the end start from the beginning and then just meet at the middle maximizing the gain in parallelism for both data loading and sharing when we reach a Target saturation the corresponding thread stops okay so basically keep adding these gaussians until eventually you get to a amount of opacity or saturation that is high enough that it's not worth looking at the gaussians behind it because you wouldn't be able to see them anyways so actually very similar to kind of what's Happening Here with the Nerf where once you get to a certain level of opacity you don't actually give a about what's behind you don't need to sample the points after that because you're not going to be able to see them so like here this Ray two that's going through here once you get to this point here you can actually stop and you don't need to sample any more points along that so similar thing that they're doing here that once you get to a certain amount of alpha to a certain saturation then you can basically stop but that also means that because you're also doing it from the back you're potentially wasting a bunch of compute right because instead of doing it from the front and the back and then stopping once you get to an alpha couldn't you just do two of them at the same time threads in the tile are queried in the process of the entire tile terminates when all pictures have all pixels have saturated appendix C Orting our design is based on the Assumption of a high load of small splats we optimize this by sorting Splats once for each frame using Radix sort we split the screen into 16 by 16 pixel tiles and create a list of Splats per tile this results in a moderate increase in gaussians however it's amortized by the high parallelism of an optimized GPU Radix sort and this is the original badass dudes that uh implemented a super fast Radix sword on the GPU we sign a key for each Splat with up to 64 bits where the lowest 32 bits encode its projected depth and the higher bits encode the index of the overlap tile so you're gonna have a bunch of these splats each Splat is going to have a key which means it's you're gonna have some lookup table of all the dictionary of all the gaussians right all the gaussians are going to have the actual properties of that gaussian and then a key that tells you or that allows you to very quickly get that property if you know what the key is right that's what a hash map is all about and that key the first 32 bits of the total 64 bits that you're wasting on that key maybe not wasting but that you're using on that key are the depth and then the other 32 bits are the index in the sort the exact size of the index depends on how many tiles fit in the current resolution each depth ordering is thus directly resolved for all Splats in parallel with a single radic sword we can efficiently produce per tile lists of gaussians to process by identifying the start and end of ranges in the sorted array with the same tile ID this is done in parallel launching one thread per 64-bit array to compare its higher yeah so this is actually probably why they don't move the gaussians right so at the very beginning of this paper I was I asked myself I was like why are they creating and deleting gaussians rather than just moving around a fixed set of gaussians and I think the reason they're doing that is because they don't want to move gaussians from one tile to another right if you move a gaussian that's in one tile and now it's in a different tile because it's kind of drifted away from that that's going to be really annoying because you're gonna have to redo this uh sort because this sort is kind of indexing them based on specific tiles so this is probably why they are doing the thing where If It Moves too much then actually just clone it and make a new gaussian does completely eliminate sequential primitive processing and produces more compact per tile lists to Traverse yeah and then you can basically render each tile individually which is the whole point here's how you software rasterization of 3D gaussians and here's the fat here's how you can do it in parallel right you have this Loop here this for Loop here for all tiles T and I do this for all pixels I and T do that so you can parallelize this tile step here and this blending is also much faster right so in a Nerf for example in a Nerf you'd have this for Loop because you have to do it for all pixels but in a Nerf for every single Pixel not only do you have to sample you have to basically perform inference on a multi-layer perceptron and that's going to be super slow but here you don't need to do that because these are just gaussians okay coming coming into the end here details of sorting during rasterization the saturation of alpha is the only stopping Criterion in context of previous contrast to previous work we do not limit the number of Blended Primitives and receive gradient updates we enforce this property to allow our approach to handle scenes with arbitrary varying depth complexity and accurately learn them without having to resort to scene-specific hyper perimeter tuning uh maybe I mean the set of hyper parameters that they defined up here like these are scene specific in a way right because they're specific to the data set that they've tested this on I'm sure you could find a data set and scenes that are weird enough and different enough that these thresholds these position thresholds these uh Alpha thresholds are no longer applicable so I don't know if they actually uh did get something that is entirely uh not seen specific hyper parameters I think they're still specific just to the data set that they've evaluated it on we must therefore recover the full sequence of Blended points per pixel in the forward pass one solution would be to store arbitrary long list of Blended points per pixel in global memory to avoid the implied dynamic memory management overhead we instead choose to Traverse the per tile list again we can reuse the sorted arrays of gaussians and tile ranges from the forward pass okay so when you're doing this backward pass you need the full sequence of points in the forward pass and either you could store that in global memory and what they mean here by global memory is the memory uh on your motherboard the computer RAM as is normally called as opposed to the GPU memory which is the vram it's also that the GPU memory sometimes called the vram so rather than having to store these values in the ram load them back into the vram you can do this fancy thing here so that you don't have to we now Traverse them back to front so you do front to back and then back to front so a little little speed up tricks like this that's where that's where the money's at traversal starts from the last point that affected any pixel loading the loading of points into shared memory again happens collaboratively each pixel will only start overlap testing and processing of points if their depth is lower than equal to the depth of the last point that contribute to its color during the forward pass computation of green is described in section 4 requires the accumulated opacity values during the original blending process rather than traversing an explicit list of progressively shrinking opacities in the backwards pass we can recover these Intermediate opacities by storing only the total accumulated opacity at the end of the forward pass I mean this is kind of hard to understand but it's basically there's just a bunch of little tricks here that they're doing in the kind of order of operations here for the forward pass and the backward pass so that they don't have to uh kind of store things in global memory go back and recompute things that they've already computed and so on each point stores the final accumulated passing the forward process we divide this by each Point's Alpha and our back to front reversal to obtain the required coefficients for grading computation right so the opacity Alpha is basically roughly telling you how much that particular 3D gaussian or point is contributing to the final color of the 2D image that you're using for your reconstruction loss so if your reconstruction loss is saying this pixel right here it should have been green and instead it's light green well that gradient signal needs to get propagated into the right gaussians right so there's probably going to be one gaussian that has a high alpha that is mostly responsible for that uh final color of the uh pixel in that reconstruction loss so the alpha is basically telling you how much gradient different uh gaussians are going to get we next discussed some details of implementation present results in the evaluation of our algorithm compared to previous work and ablation studies actually now that I'm thinking about this this is probably there's a lot of like weird be like dynamics that you could have here right because you could envision kind of like these uh mode collapse kind of situations where like there's one gaussian that has a very high opacity and the neural network just keeps trying to change the color of that one gaussian to match the color of the image rather than trying to create that color from multiple different gaussians that are more transparent right so there's like there's there's there's a lot of like trickiness here a lot of kind of like pitfalls in this accumulate in this optimization process that I feel like they just kind of like magically works but I don't know I guess these these can these uh regularizations that they do here with the like splitting the gaussians and pruning the gaussians and getting rid of the ones that are too transparent I guess those somehow prevent these types of errors but it's kind of crazy that it still manages to find sets of gaussians that that collectively create the final image rather than just ending up trying to over optimize a single gaussian that does that makes one pixel look as good as possible okay uh implementation so that's that's the technique and now we're gonna basically compare so this next section here they're just going to talk about how it compares to other approaches so let's see comparisons of ours to previous methods and corresponding ground truth images from held out test views so held out test views just basically means that these are not part of the training set these are separate uh uh scenes that have never been seen before from the top down bicycle Garden stump kind of room from the MIP Nerf 360. so this is the data set playroom blah blah so we have the ground truth instant NGP ours it's kind of hard to tell here but I mean they kind of like mostly look good you can see here how here they're pointing out the fact that this wheel has a bunch of Spokes and then here the MIP nerve kind of loses track of those spokes right here you also lose track of the spokes those are the fine features that they talk about here they're showing you how the the window in the background instant NGP kind of becomes this fuzz of Nerf fuzz smear whatever you want to call it here it's much more accurate but like even this like how the are they doing that right like every single one of these bricks is a gaussian like how do you the gaussians only have a single color so like how are you getting a kind of a tile look where you have a red white red white red white red white how are you getting that it's like we're talking like I think it's just because they're using millions of of gaussians that's the that's the reality same kind of thing fine structure fine structure you can get that kind of like patented or like that same kind of gaussian or not Goshen but that same kind of look weird blurring Misty kind of effect that you get Nerfs here you don't get it gaussians are all you need yeah that's that's why I think this paper is pretty awesome it's just the gaussians are just such a good aesthetically pleasing uh prior to use c7k iterations so 7 000 iterations of their optimization process how many gaussians are there that's kind of more what I'm interested in how many total gaussians are there quantitative evaluation compared to previous work okay so you have R's 30k R's 7K total training time 41 minutes versus a Nerf is 48 hours persists instant NGP is about seven minutes or a lot faster it's also a lot smaller yeah so that's another thing to think about here is that this technique is or this gaussian technique is storing a huge amount of gaussians more gaussians than there are pixels probably so every single one of those right think about an image an image like a 256 by 256 image has 256 by 256 pixels which means that there's 65 000 total pixels each of those pixels is 3 RGB and then if you're encoding that with a uint 8 there's the megabytes right but let's say you have 1 million gaussians for each gaussian you're storing a three by three covariance you're also storing a it's bigger you're storing a three by three covariance you're storing a mean which is three numbers you're storing an opacity which is one number and you're also storing the spherical harmonic coefficients for the color which I think is probably something like three numbers you know you can get a huge amount of and let's say you're storing each of those numbers as uint 8 which is probably even they're probably using a higher data type but yeah your memory usage is going to be really high compare that to something like a Nerf right which is basically all you're storing in the Nerf is a tiny little multi-layer perceptron so that little multi-layer perceptron there's not a lot of Weights in that so it's obviously a lot smaller so that is one negative I guess is that they never talked about the total size of the actual final scene representation it does seem that because you have to store all these gaussians you might end up having a much larger size than the Nerf right look at these nerves eight megabytes versus basically a hundred times that interestingly planoxyls which is also kind of this point-based approach also huge memory cost so basically all these point-based approaches have a huge memory like the final scene representation has a ton uh of parameters or they're not really parameters because they're they're kind of points or gaussians whatever you want but you have to store all of those so it's like it's a more explicit scene representation so it takes more memory passing that around if you had to store that like you know that could be a negative maybe you do want a tiny ass little Nerf right 13 megabytes eight megabytes so it's interesting how like yeah you're getting better higher quality than the Nerfs but the Nerfs are tiny hmm so the level quality of billions and billions of gaussians finally use that compute you could so I mean like think about it this way though like what if I took uh the Nerf approach and I took the little multi-layer perceptron in this Nerf and rather than being like three layers of 256 parameters each or 256 neurons what have I used this giant like 10 billion parameter Transformer instead of this little multi-layer perceptron you know like then have a much higher memory but I bet you the quality of that would be absolutely insane you know so there is kind of a fundamental kind of are you the more explicit you're seeing representation is maybe the more quality the reconstructed views are but you're spending a lot more memory to store that more explicit representation uh okay and then here you have uh Peak signal to noise for synthetic Nerf data sets I'm not even gonna look at this this is basically garbage like synthetic data set and it's a psnr which is kind of a garbage metric so we implemented our method using a pytorch framework so Pi torch obviously very popular nowadays I think it's pretty much the go-to if you're still using tensorflow you should uh complain to your boss that you guys should be using pytorch or at least Jax if you're somehow dead set on Google products use the Nvidia Cub sorting routines for the fast Radix sort also Builds an interactive viewer used for interactive viewing I haven't even looked at this GitHub repo yet maybe honestly I'm a little bit uh intrigued so maybe we do go ahead and look at this and we'll see what the quality does source code available there we use the warm up the computation in lower resolution we start the optimization using four times smaller image resolution we upsampled twice okay so it's a little bit more complicated than we even thought there they're running this optimization process using a smaller image and then up sampling it that's annoying you don't want to have these type of like extra little tricky details like that uh spherical harmonic coefficient is sensitive to the lack of angular information for example nerflight captures where a central object is observed taking the entire hemisphere the optimization works well however the capture has large has angular regions missing uh completely incorrect values of the sh can be produced to overcome this problem we start optimizing only after the zero order component and then introduce one band of the spherical harmonics after every 1000 iterations okay so they're also kind of slowly rolling out this sh too so it's like two little one weird tricks here of that you have to use in order to optimize and get to a good place which is not good tested our algorithm on 13 real scenes the synthetic blender data set the tanks and temples data set all results are reported on an A6 hunt 6000 GPU it was an a6000 a six thousand GPU an a6000 GPU is around six thousand dollars how much memory do we have on this bad boy 48 gigabytes so it's a 48 gigabyte GPU that's not too bad show rendered video path in terms of quality the currency of the art MIP We compare against this method to the most recent methods taking every eighth photo for consistent and meaningful comparisons we use the standard error metrics used most frequently in the literature I don't think that's a good excuse they should have gotten a human study but I know it's an academic group so you know academic groups are generally a little bit more poor so maybe we shouldn't judge them for not having a uh human double-blind evaluation for quality copy the numbers from the original publication to avoid confusion average training time rendering speed and memory used to store slightly larger Networks okay so this is actually the Nerf comes in two sizes so instant NGP instant neural graphic primitive bass and instant neural graphic primitive big is the same algorithm just two different sizes of Nerf so slightly larger multi-layer perceptron and a slightly smaller multi-layer perceptron and you can see how you get a slightly higher ssim score here using the bigger uh MLP well they're really like crushing it on these metrics here though look at this 0.69 to 0.81 .84 to 0.7 like they're crushing it they beat these other techniques on all of these metrics and these are standard metrics hmm maybe I judge them too harshly we manually down sampled high resolution versions of each scene's input images to the established rendering resolution for our experiments doing so reduces random artifacts oh so whenever they downscale the images you get jpeg compression artifacts why are they not using pngs uh PNG images are not compressed jpegs are compressed so jpegs are lossy but pngs are not same Hardware their average trading time was 48 Hours compared to our 35 to 45 minutes and the rendering time is 10 seconds per frame we achieved comparable quality in five to ten minutes show visual results for a left out test View and previous rendering methods even MIP Nerf 360 has remaining artifacts that our method avoids the blurriness and then the fine details such as the bicycle spokes 100K here we go we're finally getting a number for the number of gaussians you have 100K uniformly random gaussians that's just so many cautions holy crap our approach quickly and automatically prunes them to about six to ten thousand gaussians and then the final size is about thirty thousand or no then the final size is 200 to 5 500 000 gaussians dude what the it's creating so many more gaussians than it prunes right so you start from a hundred thousand you prune them down to ten thousand but then by the end you have 500 000 gaussians per scene Jesus uh using a white background compactness in comparison to previously explicit scene representations the anisotropic gaussians used in our optimization are capable of modeling complex shapes with a lower number of parameters I would say this is a little bit more explicit scene representation than than they are making it sound like you know 500 000 gaussians how is that not explicit like there's a gaussian for almost every single little micro volume in the entire like scene so that is explicit scene representation capable of modeling complex shape of the lower number of parameters we showcase this by valuing an approach against the highly compact point-based models obtained by this so yeah I mean they're saying that hey look at us we can do this in 734 megabytes as opposed to the planoxyls which is about two gigabytes but 734 megabytes is still way more than 13 megabytes so this is definitely more of an explicit scene representation than something like a Nerf which is definitely much more implicit and much smaller uh obtained by space carving optimize until we break even usually happens within two to four minutes surpass the supported metrics in one-fourth of the point Cloud resulting in an average model size of 3.8 megabytes as opposed to their nine megabytes ablations okay so now this is going to be a more interesting section here they're going to talk about which things end up mattering the most we isolated the different contributions and choices we made and constructed a set of experiments to measure their effect so very nice uh definition of an ablation study uh blah blah initialization from sfm so how important is these sfm Point Cloud initialization uniformly sample a cube with equal size three times the extent of the input cameras bounding box okay so rather than starting from the points that are coming out of this sfm let's instead start by just uniformly sampling the 3D space with uh gaussians okay so this is uniformly sampling 3D space and this is starting from the sfm so I mean it's actually pretty good like this actually kind of gets most of the parts correct but you can see how this uh kind of Vine in the background is definitely way cleaner in the sfm points okay uh area is not well covered the random initialization method appears to have more floaters than cannot be removed synthetic nerve does not have this Behavior densification okay so now let's see two densification methods the Clone and split which one has the best results so we have no split no clone and then clone and split so the split in the Clone is how they're adding and removing uh the gaussians I guess not removing just adding gaussians over time and we know here that they're going up to 500 000 gaussians per scene from a hundred thousand so there's a lot of splitting and cloning happening it's not like this is like a rare occurrence in the optimization process the splitting and cloning is happening constantly so if you don't split and you only clone you get like these weird like missed art like kind of like the foggy Nerf Misty effect if you clone or if you split but don't clone then you kind of start losing out on the fine details like you get here in the spokes and cloning and splitting gives you the best result okay it's the next thing they compare so the next thing they look at is unlimited depth complexity of Splats with gradients we evaluate if skipping the gradient computation after the end frontmost points will give a speed without sacrificing quality so you have these gaussians and you've ranked them according to which one is closer to the uh actual view plane so technically could you just clip out the last uh gaussians in that ranked list right they're going to be way in the back they're probably not going to be very visible why don't you just get rid of them right why don't you just only care about only push gradients into the gaussians that are at the very front of this ranked list because those are going to be the ones that even if they have a tiny little Alpha they're still going to be contributing all of that Alpha to your final image versus the gaussians that are all the way in the back you're not going to be seeing them so even if they have a high alpha you're probably not going to see them so can we get uh better speed by just not calculating the grading for the ones in the back uh led to unstable optimization if we limit the number of points that receive gradients the effect on the visual quality is significant so the left is they limit it to only the top the first 10 and then the right is the full method so it does seem to matter which is kind of interesting it means that what this means to me is that the alpha is probably very small there's not many gaussians that have a high alpha most of the gaussians have a medium Alpha it could even be the case that what appear to be hard surfaces are actually multiple gaussians that are more transparent I don't know that's just my guess but tldr it's not better to clip in isotropic covariance an important algorithmic choice is the method of the optimization for the full covariance measurement for the 3D gaussians seems quite related subrancy with a question but in 3D there's a new paper diffusion with forward models can you please do that next I can definitely try man but there's a lot of there's a lot of people that want papers uh tomorrow I actually have a completely unrelated thing it's called meta gbt but it's kind of the top GitHub trending repo so I figured I should do a video on it uh okay back to this uh we perform an ablation we remove the in isotropy which is the uh uh looks the same from other from many angles by optimizing a single scalar value that controls the radius of the 3D gaussian on all three axes right so what if these uh 3D gaussians rather than having uh three separate covariances aka the the radius of that little ellipsoid is there's three separate axes for that what if you just had one right so all you could control was just whether it was a big ball or a little ball uh the optimization results are presented in figure 10 we observed that the anisotropy significantly improves the quality of the 3D gaussian's ability to align with surfaces yeah so obviously it makes sense that you're going to get a much better result if you can have these kind of weird ellipsoids that are kind of shaped and can be very long and thin or they can be very flat and and kind of like this versus if you just are trying to make this out of gaussians that are all forced to be little spheres little balls it's not going to be as good right and then finally spherical harmonics finally we use the spherical harmonics improves our overall psnr since they compensate for the view dependent effects where's the this one table three okay they don't they don't have a subjective uh picture for the ablation study on the spherical harmonics our method is not without limitations in regions where the scene is not well observed we have artifacts that's going to happen uh in anything where you're using this sfm as a starting point even though the nsyntropica gaussians have many advantages as described above our method can create elongated artifacts or splotchy gaussians I love how they use the word splotchy uh we occasionally have popping artifacts when our optimization creates large gaussians this tends to happen in regions with view dependent appearance so popping is just when you you're kind of like you're moving through the camera positions as you're saying okay render it from this camera view from this camera view render it from this camera view and like you're saying render it from this camera view render it from this one and then suddenly this like big gaussian appears and then you say you move it just a little bit and then it disappears so one of the reasons for these popping artifacts is the trivial rejection of gaussians via a guard band right so we saw that the rasterizer when it's choosing which gaussians to sort for each tile it actually gets rid of gaussians that are far away from the tile uh would be to alleviate another factor is to simple visibility which can lead to gaussian suddenly switching depth and blending order it's kind of weird why would they suddenly switch could be addressed by anti-aliasing which we leave as future work currently do not apply any regularization to our optimization uh I don't know about that I think that these are regularizations the that they're doing with the adding gaussians removing gaussians based on these like thresholds that that to me that's a form of regularization you know I think what they're referring to here is that they don't have an extra regularization term on the uh actual reconstruction loss the Reconstruction loss is just L1 plus the structural similarity but they could add extra things on that uh we'll use the same hyper parameters for our full evaluation early experiments show that reducing the position learning rate can be necessary to converge very large scenes yeah so you probably want to use a different different learning rates and gradient clipping for the different parts of these gaussians right I would assume that the color of the gaussian like you probably want to take different gradient steps for the alpha than you do for the position which is kind of what they're talking about here or maybe for the position you want to be able to take big steps move these gaussians around a lot more even though we are very compact compared to previous point-based approaches our memory consumption is significantly higher that's no good during training of large scenes Peak GPU memory consumption can exceed 20 Gigabytes because you have to put all these gaussians in memory versus the Nerf is a lot smaller however this figure could be significantly reduced by a careful low-level implementation of the optimization logic there you go here's your next uh paper make a low level optimized implementation of a 3D gaussian splatting uh rendering approach require significant GPU memory to store the full model and an additional 30 to 500 megabytes for the rasterizer depending on the scene size and image resolution there are many opportunities to further reduce memory consumption for of our method compression techniques for Point clouds is a well-studied field it would be interesting to see how such approaches could be adapted in our representation more potential there's a lot of interesting the same way that whenever Nerfs came out there was like a huge explosion of possible research directions I feel like with this it's the same kind of thing there's so many different things you could do differently and see if it works better or not uh this is called the Dr Johnson scene down here I guess what the is this is there like Dr Johnson scene is there like some it's probably the house of someone called Dr Johnson all right finally we made it to the end guys discussion and conclusions hype we have presented the first approach that truly allows real-time high quality ratings field rendering seems a little bit grandiose in a wide variety of scenes and capture Styles while recurring training times competitive with the fastest previous methods none of this is explicitly wrong it's just a little bit it's worded and phrased in a way that makes it sound better than it is but pretty much every paper does that so that's not necessarily unique our choice of a 3D gaussian primitive preserved properties of volumetric rendering for optimization while directly allowing fast splat-based rasterization our work demonstrates that contrary to widely accepted opinion a continuous representation is not strictly necessary to allow fast and high quality Radiance field training so their representation is not continuous because it's a bunch of discrete gaussians majority of our training time is spent in Python code since we built our solution in pytorch so it's going to be slow as versus if you wrote this in Rust it could be way faster only the rasterization is implemented as optimized Cuda kernels we expect that porting the remaining optimizations entirely into Cuda as is done in instant NGP could enable significantly further speed up for applications unfortunately everyone who knows uh enough Cuda to write optimized Cuda kernels is getting paid millions of dollars to uh work on large language models so there's definitely a drought of people who know enough low level programming to write this kind of they're probably not gonna uh be spending their time working as post-docs in some French University lab which is the issue we also demonstrated the importance of building on real-time rendering principles exploiting the power of GPU and the speed of software rasterization pipeline architecture these design choices are key providing a Competitive Edge in performance it would be interesting to see if our gaussians can be used to perform mesh reconstructions of the captured scenes yeah so I mean I would say why do you care about the mesh like if I think meshesor can eventually go away but uh a lot of people use Nerfs and then they use Nerfs to create a mesh and then that mesh is what they use for their actual video game like because most video games most things like that you're going to actually need a mesh so even though people are using Nerfs right now like they can't actually you can't actually use a Nerf in a video game so you have to turn it into a mesh and then be able to use it in a video game so that step of going from a Nerf to a mesh even though it seems kind of nonsense sensical because it's like if you have the nerf just use the Nerf uh people still want to be able to do that so you could do something similar here where hey we have this uh representation of our scene as a bunch of gaussians could you turn that into a mesh so that I could bring it into my uh computer game over here so that could be another thing that you could be interested in doing aside from practical implications given to the widespread use of meshes this allows us to better understand where our method stands exactly in the Continuum between volumetric and surface representations that's kind of interesting it's one thing that they could do to to answer this question is show us the distribution of these gaussians like are these gaussians just all on the surface of the object are they like within the entire object or you know I'm saying like if I had a 3D gaussian splatting representation of this of this uh Black Bottle are all of the gaussians on the surface of the bottle and then the inside is just a bunch of empty space or are they kind of like more evenly distributed right like what do the alpha values look like do you have like one gaussian that has like basically that's like perfectly opaque and then everything behind it is empty or do you have like like kind of like a a layer of gaussians that have inter kind of like more see-through and it's kind of like cumulatively they become the hard surface so I feel like I would have liked to see like a plot of that of like show me where all these gaussians are show me the 3D position of these gaussians for an object and then is it empty like what are the alpha values for those and so on uh in conclusion we have presented the first real-time rendering solution for radiance Fields well with rendering quality that matches the best expensive previous methods with training times competitive with the fastest existing Solutions Adobe for generous donations interesting so this is like an adobe funded lab cool that's it that's it guys optimization intensification we're already looking at these percine air metrics okay let me take a big sip of this here but mate and then let's do a paper summary if you guys have any questions now is the time foreign 's NLP uh I'm trying to think of like what would a gaussian represent in NLP I mean I don't know I don't know that I can't think of anything off the top of my head but like a gaussian is this right it's just where is it it's just that it's it's a Goshen maybe maybe you could uh yeah in beddings maybe you could do something like the the uh the the embedding of the of the word itself like whenever you're using a tokenizer it's it's like a gaussian in some multi-dimensional space like maybe something like that I don't know that'd be really weird but hey who knows maybe it works you're using 3D gosh here the whole point of this paper is you're using 3D gaussians as a prior for 3D objects in 3D space but you could Envision using n-dimensional gaussians as a prior for n-dimensional embeddings of tokens in an n-dimensional space for natural language I don't know it's on you Christopher you got to go out and do that do it do it for the for the internet points the problem with NLP is that even if you discover an approach that's better you're never going to actually be able to show that it's better because you're you're comparing against people who are spending millions of dollars on their methods right so it's like it could be the case that there's all kinds of better ways of implementing a Transformer such as rwkv right but the problem is that they're never going to have the same training budget the same data set and the same amount of engineering effort put into their technique as the current technique so it seems like the current techniques are always beating theirs even though in theory if we were to use the same data sets the same amount of training budget and the same everything there could be way better algorithms that some other random person invented it's just we're not using them because we're not using them yeah there's so many things like that uh okay let me let me let's let's summarize this paper though so today we read 3D gaussian splatting for real-time Radiance field rendering this is a new type of Nerf basically so in in a Nerf what you're trying to do is represent a 3D scene so you have 3D scenes which are basically here you have a bicycle in front of this whatever this is in this path and you're trying to be able to say hey I have a bunch of pictures of this bicycle and I want to be able to create new pictures of this bicycle right so those novel views are basically new 2D images from a 3D scene so we need to figure out a way to represent this 3D scene and there's a lot of different ways of doing that so in a Nerf you're representing it as a as a Radiance field you're representing it representing it as this little volume which is then represented by a neural network that says okay give me any point in this volume and I'll tell you what the color and the density of that point is and that's what a Nerf is uh but Nerfs are very slow and they're slow because you have to basically not only do you have to train them for an individual scene but then for every single point uh inside a ray of and then for every single Pixel of which there's array you have to perform inference on a little neural net so potentially thousands millions of inferences just to get the final 2D view that you're interested in what this paper does is it goes all the way back to the beginning and it says okay let's not assume that we have some continuous uh Radiance field that we can evaluate at every single point let's instead say that we have a discrete set of points that are gaussians 3D gaussians and these points we're going to basically move them around and slightly make them a little bit wider a little bit shorter a little bit more fuzzy a little bit more see-through right and over time we'll be able to represent this uh scene as this discrete set of gaussians and they uh here in the related work they talk about how this is actually uh more akin to what are called point-based techniques so point-based techniques like planoxyls are representations of 3D scenes that are kind of more explicit they're for example a mesh is a point-based technique you're storing all these little points and the connectivity between all these little points uh what else surfels are a point-based technique where you're storing all these little points and in the case of a circle it has a specific radius and it also has a normal Vector so it's a 3D representation of a scene that is much more explicit and these 3D gaussians are much more similar to that than they are to a Nerf right and you can see that because later on they hear they show you uh they compare to other things and you can notice how planoxyls which is a point-based method you need two gigabytes of memory just to store every single one of those little points and every single one of those little uh uh surface normals same thing here these gaussians they say they have 500 000 gaussians you need to store the XYZ position of every single one of those gaussians the XYZ covariance of every single one of those gaussians the color for every single one of those gaussians and the alpha for every single one of those gaussians so at the end of the day it takes you 734 megabytes just to store a single scene so compare that to a Nerf here these instant NGP these Nerfs they're 13 megabytes they're tiny right they're like a hundred times smaller and the reason that they're smaller is because the only thing they have to store is the weights of this little neural net and that little neural net implicitly stores the entire scene so this is uh kind of a point-based method it's they're they're pretending that it's like the better Nerf but really there's a whole history and kind of there's a rich history of people who have come up with all kinds of other point-based methods so it's not as novel and unique as I maybe thought it was at the beginning but it still is pretty cool and impressive um what else uh optimization density so they have this kind of General optimization process where they start from structure from motion point so these are basically guesses as to where your camera is for every single picture that you start from so for any of these 3D scenes you're gonna have to start from a set of pictures that each have a corresponding uh camera position right and unfortunately that process of going from a bunch of pictures with camera positions into these sfm points is noisy and not perfect so Nerfs also do that Nerfs also start from this kind of sfm points so they have that same problem so one thing that both Nerf and this approach suffer from is the fact that they're sensitive to this sfm initialization uh but at least in this paper what they do is they start from this sfm they basically put a bunch of little 3D gaussians at all those points and then they run this uh reconstruction loss where they're basically taking these gaussians they're kind of Shifting them moving them they're cloning them splitting them so like occasionally they'll split a gaussian into two or they'll uh combine two into one or delete some and then they keep rendering the image with this differentiable tile restorizer and then comparing the image that they render with the ground truth image and they're saying okay these two images the same and the way that they're doing that is with an L1 loss and a ssim loss structural similarity which is basically are these two images the same and if they aren't the same y aren't the same they're not the same because of this particular pixel okay what is that particular pixel correspond to oh it corresponds to this particular gaussian okay that particular gaussian what does it correspond to oh you need to change the color slightly this way okay we're going to change the color slightly that way and so over time you keep pushing these gradients pushing these gradients pushing these gradients and eventually these 3D gaussians kind of start shaping the rough uh shape and color and appearance of your 3D scene uh so that's the optimization process what else this optimization process has unfortunately a lot of hyper parameters that's one thing that I didn't like is that they have like a specific threshold so every 100 iterations you go through and if there are any of them that have below this uh amount of transparency you delete those every uh x amount of time or x amount of steps you look at the position between two and then you do this so like all these little thresholds and position and little one weird tricks and there's even more that they don't even include there that they include later here where they're like okay well first you start with a slightly smaller version of the image and then you slightly increase the image so there's a lot of little tricks that they're using in this optimization process which isn't great you know it's I think it's much better if you have a cleaner simple optimization process that doesn't have like 10 000 different little threshold tricks and steps but this is not by far the worst I've seen this is like pretty standard uh what else can I talk about uh yeah and then I think the other kind of big thing that they do here is this tile rasterizer so this tile rasterizer first of all it's differentiable which means you can uh back propagate you can chain rule backwards which means you can get the uh reconstruction loss from the image and then go all the way back to the actual properties of the 3D gaussian but the reason this uh tile rasterizer is great is because it's breaking up the image into these tiles and then using this very fast uh sorting algorithm that runs quickly on a GPU that allows you to sort these gaussians from which gaussian am I going to see first which am I going to see second which am I going to see third which am I going to see fourth and so on and this sorting is fast enough that you can basically this whole process runs quickly and that's compared to the ray marching which is a different type of algorithm a rendering algorithm that is used in Nerfs and the ray marching and Nerfs is slower when you actually try to implement it in the real world compared to this uh tile rasterization which is what they do here so a huge component of why the speed is is fast here and why this is much faster and you can get real-time rendering is because the uh the way that this tile rasterizer works out you can do it very quickly on a GPU uh yeah and then they compare it to Nerfs and planoxels which is another kind of point-based technique and it does seem to work quite well you know the results speak for themselves you get much better reconstruction to fine details here you get a lot less of this kind of blurring that you usually see on nerves where you get this kind of like fuzzy Misty effect and it's quite clean and that's basically it you know they do some ablations and they point out a bunch of future possible research directions but overall I thought this was a pretty interesting paper it was well written very clean I think it's a little bit they try they try to like Lead You In to think that they've done some crazy and then you realize that this is just a some a point based method that works because it kind of uses a fast GPU Radix sort but pretty cool stuff uh Mojo is probably the destination to help with speed UPS on this yeah so Mojo is a is basically python but fast python I think it's being led by the guy who wrote Swift at Apple Joe Swift right yeah Swift Creator Mojo So I don't know we'll see Mojo so far is just a bunch of promises and hype like I haven't actually seen them like do anything in practice with it yet but who knows this is a good example of something that could use Mojo right they say that they implemented everything in Python so if they could just uh take that same python file and use Mojo instead maybe it does result a lot faster but I feel like there's probably the better way to do this would just be to implement it directly in Cuda kernels right directly in a more performant language than Python and it would probably be super fast but who knows maybe this is the uh 3D scene representation that we'll see in uh video games maybe whenever you're it's when in 2030 when you're sitting in your VR headset and you're moving around in like a Nerf or not a Nerf you're moving around in some 3D World maybe the underlying representation of that 3D world is not a bunch of textures with or not a bunch of meshes with textures and lights that are bouncing lights off the object maybe the underlying representation will be something like this maybe it'll just be this giant set of little gaussians you know and maybe one extra point here before I uh go away um sorry I'm like flying through here is that one thing that Nerfs and these techniques none of the techniques in this paper uh address is that all of these are for static scenes so Dynamic scenes scenes that have a Time component right like you pick up the bike and you and you do something right like those don't exist yet so right now we're just saying okay well 3D gaussians compared to Nerfs like the quality is higher and it uh it renders a little bit faster but at some point we're going to want to have Dynamic scenes like movies right like 3D movies so is it the case that 3D gaussians are a better prior once you start putting time in or is it the case that a uh Nerf which is a little neural net is a better prior when you start putting time in right so once you start actually adding the dimension of time and you want to have a 3D scene over time it could be the case that 3D gaussians are a better prior and they make certain things easier right and it results that oh if you just add an extra time component over here then it's still you can still use this tile based rendering and it's actually just as fast versus if you try to add time to a Nerf it just blows up and it's terrible so I don't know it could be the case that for some other unknown reason that we don't know yet the 3D gaussians are way better than uh the Nerfs but so far they're just basically a little bit more memory a little bit faster and a little bit higher quality and that's that guys I need to go but I appreciate the time thank you John thank you Christopher thank you Luke thank you subrahu subraan sue thank you navjot thank you Scott thank you everybody Bob plan everybody who came everybody who commented thank you Ed thank you for the questions thank you for the comments thank you for your interest in science I don't know whatever Hebron says but thanks guys you guys have a great Monday and see you guys later peace

Info

Channel: hu-po

Views: 19,762

Rating: undefined out of 5

Keywords:

Id: xgwvU7S0K-k

Channel Id: undefined

Length: 166min 20sec (9980 seconds)

Published: Mon Aug 14 2023