Flow Matching for Generative Modeling (Paper Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello there today we're going to look at flow matching for generative models by people from meta Ai and whitesman Institute of science this is a bit more technical but it's the fundamental thing to understand if you want to understand how the world of kind of whatever started from diffusion models how that world progresses notably stable diffusion 3 goes away from the classic diffusion based approaches and transitions to this more General flow matching approach we've discussed this paper and the stable diffusion 3 paper in our paper discussions on Saturday nights on Discord so if you want to join those uh we have them almost every Saturday and they're always super duper interesting so a lot of the recognition here is not by myself but by the community who's provided generous inputs into what I now know about these things so a big thank you and let's dive in so what what is this about technically or traditionally diffusion models have been employed in order to do image generation notably text to image generation so if you think of something like DL where you're like okay a dog uh in a hat you have some sort of prompt and that goes to a pipeline and outcomes an image with some sort of I am going to to butcher this uh with some sort of dog in a hat that's a dog and a hat I'm very very very sorry uh the the process here um is Tech is usually a diffusion process or let's say in the last few years that has become a diffusion process also it's International hoodie day so what are you going to do about it um the thing here is it this is different from like traditional let's say Gan or vae based approaches to image generation in that a diffusion process is a multi-step process of image generation and that multi-step process kind of means that you can invest more computation in generating a sing Single image than just doing like a single forward pass so a diffusion process starts out with an image that's complete noise like just noise you s in fact you sample random noise from like a standard gaussian distribution and then you intermediately compute more and more kind of like Den noised version of that image and so on until finally you arrive at your target image the way you train those models is by in fact taking an image from the data set so you have a data set of images let's actually say these images here now these are generated but pretend those are your data set so you're going to take one of them and you're going to iteratively make it more noisy so here we'd have this nice screen whatever um thing and we iteratively just add noise and we add more noise and we add more noise and so on and the classic diffusion papers say that if you do this enough times this the dist distrib ution here no matter what image you start the distribution that you end up with if you do that with all your data samples is going to be a standard normal distribution if you do it correctly um but there is a way to add noise add noise add noise until you completely destroyed the signal and then the distribution of this final State here is a known distribution and therefore it's possible to sample from this distribution and as long as you learn to revert each one of those inter intermediate steps you can then kind of Trace back these steps and get to something that looks like the original data so a diffusion process starts with a data set continuously adds noise so there is a noising process in order to construct a series of intermediate data points uh so this is t equal 1 T = 2 and so on until T equals infinity and you learn a neural network to revert one of those steps so this here could be neural network um V re at tals 2 reverting uh one noising step so it looks at a noisy image and produces a slightly less noisy image and that's that's what you learn and once you have that neural network you can use it to as I said Traverse back the D noising steps now there are a lot of advances and tips and tricks in diffusion models so that you don't actually have to do this step by step you can skip some steps in fact what modern diffusion approaches will do they'll actually directly try to predict the final sorry here the final State and then kind of noise again to an earlier like a less noisy version and then predict the final State again and so on so there's a lot of stuff going on in that sense but the core is always that we Define a noising process and we try to reverse that now flow matching generalizes this flow matching says hey why do you even Define a fixed noising process there's all kinds of problems that come with that can't we just say okay we start out with a distribution and that distribution is um let's do that in one dimension so probability density like this a gaussian like this this is supposed to be a standard gaussian okay I'm sorry I apologize my drawing skills let's say this is a standard gaussian so this is the this is P0 the distribution we want to start with and then there is some some data distribution like this is the data distribution okay P1 why don't we just learn to morph the distribution that we start with that we can efficiently sample from into the distribution of the data without sort of explicitly defining the process that makes this into this like as long as we can just learn to to shape that distribution over time we're good right now obviously obviously there's a few problem first of all we don't know this distribution over here we have no clue what that is but what we can do is we can define an approximate distribution because we have some samples so let's assume we have a data sample one here one here and one here and we have three samples we can construct a kind of pseudo distribution by just uh doing some sort of gaussian mixture model so a gaussian mixture model defines a gaussian at each data point here one here and one that's terrible and one here and then integrates over them and that gives you at the end gives you sort of a a distribution like this so this distribution we call Q and this is purely defined by the data right we just place gaussians around each of the data points and then putting them all together and normalizing that gives us an approximate Target distribution so now what if we just were to learn how to morph p 0 into q and we're we're done as long as we can learn that flow matching is about how to do that how to um construct this how to learn a flow from one distribution to a another one while only having samples available from that second distribution and one key component is going to be that if if we can figure out how to make data samples so how to how to sample from here how to morph one sample into another sample like just working on single samples we call that conditional conditional flows if we can figure out how to do that efficiently then um through a series of of of proofs we can show that this is enough in fact to characterize the entire flow between the two distributions so this is the more mathematical contribution to show that hey we only have to worry about single paths like single samples and aggregating them will automatically give us the correct thing for the whole probability distribution and the other aspects of this paper are the more practical aspects like okay how can we actually do that how can we actually make that happen in a tractable way so that we can compute it but I hope this gives you a bit of an introduction into uh what's different in between diffusion at the top where we explicitly Define this noising process and flow matching at the bottom and in fact as we'll see flow matching is a more General way so if you define these flows in a very particular way you actually recover the diffusion diffusion process so flow matching you can parameterize in a certain way and you will end up with a with the exact same with a flow that would be induced by if you were to regard the diffusion objective as a flow but uh this paper now then argues that there are better way ways to characterize the flow than this diffusion objective that lead to more robust uh s more not more robust yeah more robust sampling like you need fewer function evaluations and so on okay the central objects here are going to be the following um they Define a probability density path and the probability density path as you can see here it's a function that goes into uh our positive our positive is notably in this case this is going to be a probability density function right so what's the input the input is Rd this is our data distribution and a Time variable 0 to1 so this is a Time dependent function um which is called a probability density path so uh you give it a time and you give it a place in the data space and it will tell you what the probability density of that particular place at that particular time is to make that more um tractable they have a lot of these examples where they morph let's see if we have a picture somewhere where they morph um densities into one another so yeah um imagine a 2d space of data and we have here is zero and we start out with a distribution like a gaussian around zero right and we want to morph that we want to flow that into a different data distribution namely um the example they often give is let's say a gaussian centered around this data point right here that has a a lot smaller standard deviation so we want to flow T equals z and this is T equals 1 they just go from I from 0 to one um with fractional times in between so this would be this would be P at time zero and if you plug in a particular Point like this point here P of time Z of that point would be I don't know 0 one know you imagine like there's a there's a bump here and then it kind of uh goes down here and so that's that's that but the same point the same point at time um T equals 1 so P1 of that point is going to be different it's going to be 0.003 or something like this because you can see it's it's much further away from from the the center and the standard deviation is smaller so a Time dependent probability density is one core object the second core object is a Time dependent Vector field as they say here time dependent Vector field and they call that V now V is almost the same except it goes it takes a time and it takes a place and it goes to Rd so um this together with the next object uh a Time dependent theomorphic map called a flow uh that is connected to the vector field in the following way now what does this mean so this here means the flow at Time Zero is just of any point is just that point right so um essentially let's erase the green stuff here essentially this for this this point uh the flow at Time Zero is the point itself and then the change of the flow is defined by the vector field okay and what that ultimately means is that V is going to let's let's do V in yellow V is going to determine how each point at that particular time is supposed to flow as a instantaneous rate to change so this is almost like the speed at every Point let's say all of these points go there go there go there right like if a point were here it will go there because it needs to end up at this distribution right here so um you can see like the the vector field V just at every Point defines in which direction and at which speed a point should move such that when we start at any point and we overtime follow this V we're going to end up at at in a situation where that point is going to be at a corresponding point in the in the resulting um in the resulting set well we haven't connected this to uh to the probability density paths yet so the vector field just defines some way of moving and if we move along that Vector field then we're going to end up at some place at T equals 1 and obviously our goal is to construct the vector field in such a way that if we apply it to the data distribution that we originally had like this gaussian right here that and every point in that data distribution flows along the vector field we're going to end up at the Target probability distribution that we want right now there's one there's one additional thing is that the flow at the vector field itself is also time dependent meaning that Vector field doesn't have to remain constant but that Vector field can also change from time to time so you start out at T equals z here and maybe the vector field will tell you you know go this way but then um the vector field here tells you go that way but a time one time step later the vector field could actually be different and tell you to go in a different way so you have to take care of that you have to go along a changing Vector field As you move through time and the path that you take that's called the flow in fact the flow is the path for every single point moving along the time trajectory given by the vector field at each time and I hope that's somehow imaginable the cool thing is in practice the thing they're going to end up with is actually a constant Vector field for the the um optimal transport objective that they propose so it makes it one step easier in the sort of final result of what they propose still the framework is laid out that the vector field can also be time dependent and therefore integrating across that will require um some some serious uh math in fact the way so now you can understand this this equation right here right so the the flow has the following properties it starts at the point itself so for every point x you start at the point itself and then in each time step um you go into the direction of the vector field uh at that point that you happen to be so you're here Vector field says you go here then you're here then you look at that point in the vector field at that time which maybe now says go here you go here and so on how do you how do you use this ultimately um ultimately you do just like that you sample an original point from your original distribution like from your um from your gaussin like okay a sample one and then I use an OD solver to do this process except I of course need to do this process in uh in a continuous time so I use an OD solver to just forward solve uh this this equation right here until I'm at T equals 1 and and that's it that's how you use once we have it all that's how you would use it so the goal is obviously how do we get this Vector field here because that's the The crucial point right that defines the flow so how do we get the time dependent Vector fi and that time dependent Vector field is what we're going to learn now okay yada y yada here is okay how it defines the flow so the vector field is set to generate the probability density path if it's flow satisfies equation three so this is ultimately how you connect uh a flow with a um the probability density path which essentially means you take the original probability density and you just push it through the flow right the flow is just a way of moving and now if you take the density and push it through the flow uh then you end up with the probability density path so the the you um every point in the density moves along the flow so you can just imagine um you take each point here with its corresponding density value and moving those things through the flow will end up with some other density uh that is then hopefully the target density all right flow matching what do we do we regress that flow so we construct we this looks deceptively simple um so it essentially means let's take a uh let's take a neural network V and and let's take uh this probability density path and the vector field which generates the probability density path and let's just regress that meaning let's just learn to match this Vector field for each given position and time and yeah that sounds easy enough but obviously the question is how do you get the probability density paths in the first place and how do you get the vector Fields correspondingly um to that and there is where I think uh this paper's contribution comes in so first of all they say what we said before uh for example we don't know the probability density of the target distribution but we can approximate it via samples so we can approximate that by simply placing some gaussian mixture model at the Target samples and that's good enough for now and the second recognition is we can we can generalize that we can actually Define all of these things like the probability density path and the vector field in terms of individual samples and that's what this section here is about where they say Okay a simple way to construct a Target probability path is via a mixture of simpler probability paths given a particular sample X1 we denote by this a conditional probability path so conditional probability path conditioning on one particular sample of data one the one here means it's it's at time t equal 1 which means that it is part of the actual data distribution at the at the end of the process and we're wondering if we just condition on that particular sample how does our probability flow look like so we can Define some boundary conditions saying that well at Time Zero um at Time Zero the probability density should just be the probability density of our original Source distribution like if we sample from a standard gausin originally no matter what the target the target um data point is at Time Zero we're just in this very plain data independent Source distribution because that's what we want to ultimately sample from and at time tal 1 they say we design P1 uh to be a distribution concentrated around the data point for example a gaussian with um the mean at the data point and a small enough standard deviation such that it's it's just a small gaussian um centered at that particular data point they see marginalizing the conditional probability paths over the target data gives rise to the marginal probability path so now we're wondering okay if we have these things here um how can we from the individual data points how can we aggregate that and the answer is just to marginalize across them so essentially we consider all of the target data that we have and we just aggregate across them and that gives us a total probability path note we still haven't defined this we still have not have not said how we're going to get from a particular sample in the source distribution to a particular sample in the data distribution the answer is going to be a straight line but we don't know that yet we just say well instead of defining a path between two entire distributions right two here is a distribution here like a path between two entire distribution let's just take a single sample here let's take a single sample of the data and let's just Define how to go from one to the other now that could still be one of an infinite number of paths but no matter which ones we choose we can aggregate them like this we can simply marginalize over them right weighted by the the target distribution um marginalize over the target data and we get an aggregate probability path or probability density path so from the individual sample paths we can go and create a whole flow not flow sorry probability path they say here where in particular at time T equals 1 the marginal probability is a mixture distribution that closely approximates the data distribution so they're saying if you do this then in fact at T equals 1 as you can see right here of the one um we are this is approximately this this data distribution if in fact that mixture of gaussians represents them well now they say interestingly we can also Define a marginal Vector field by marginalizing over the conditional Vector fields in the following sense so again assume assume that we have some way of figuring out what the vector field is that pushes the path pushes the particle along over time for an individual sample we pick one sample of source we pick one sample of Target and then we Define some path and imagine we can figure out what the vector field is of that one path um the question is if we have all of that if we have some way of figuring those out can we aggregate all of of those Vector Fields across the different samples and come up with a total Vector field where we can plug in the entire Source distribution and it will move it to the entire Target distribution and see the answer is yes we can in fact again marginalize over the conditional Vector Fields let as again assume we have some way of figuring them out as long as we reway them by by this Factor over here we're good again we're we're aggregating across the entire data set if you will and we're reweighting by this thing right here note these things uh we know this we have in fact figured out up here so this goes here this we assume we can construct somehow this comes this is the gaussian mixture distribution of the target data and as long as we reway by that we can in fact make this total Vector field um right here from the individual Vector Fields so it's just a matter of weighing the different Vector fields or vector field paths uh over time in a correct way so the key observation is this the marginal Vector field which is the thing we just looked at generates the marginal probability path of up here okay so this is this they're going to prove this as well but if if we Define the vector field like we Define it here it will generate this probability path meaning that if we put in the source distribution like that one standard gaussian uh distribution and push that along the vector field over time it will in fact end up at uh the target distribution this one right here and it will do so by following this probability path up here like the aggregate probability path of the individual samples so this it might not seem like much but it connects how to move individual samples between source and Target and how to move entire the entire distribution between source and Target that's one of the main recognitions and that's what's going to make it tractable um in order to that's what's going to make it tractable to to do this uh because we can only operate on samples we cannot do big Integrations across the entirety of distributions Unfortunately they say due to intractable integrals in the definitions of the marginal probability path and the uh Vector field it is still intractable to compute U and consequently intractable to navely compute an unbiased estimator of the original flow matching objective instead we propose a simpler objective which surprisingly will result in the same Optima as the original objective the conditional flow matching objective so what do we do um this is again a step in the same direction instead of saying let's compute the whole Vector field you know we can now aggregate this from the individual samples we can compute the whole thing and then we could regress uh neural network on that Vector field right we just say hey neural netork look here is that Vector field uh we get from the data please learn to predict that uh without us doing the whole integration thing and so on just learn learn it uh still untractable however what we can do is we can do kind of the same trick as before we can do conditional flow matching which means that we do flow matching on individual samples so we sample a Target data point we sample a Source data point um and by doing so we're also we're sampling a actually at this point again we don't know how we construct these paths so technically we should say we sample a probability path from one point in the source distribution going somehow to the Target distribution and as long as we can regress on the vector field that given Rise by this one sample along this one One path we're good like that's all we need to do um because this and the original flow matching loss have the same gradients meaning that uh up to a constant the up to a constant independent of the parameters the flow matching and conditional flow matching losses are equal hence the gradient with respect to the neural network parameters are equal so instead of aggregating first and learning a neural network to predict the entire Vector field we can just take a neural network and predict the conditional Vector field uh based on an individual data point and that will give us the same parameters if we learn to optimality than the original objective so that's another theorem of this paper now lastly the question is and this it's the paper is structured in a weird way where it first says okay assume we can do all these things what what's the loss right assume we can do all these things what do we even learn and only later do we go into how do we actually how can we actually get the probability path in the first place um because so far we just said it's some probability path from s from One Source sample to one target sample um and now we go into this and the reason why they do this is to go from the more General to the more specific is because now they're going to start making choices um and they're going to make particular choices either to remain tractable or to just remain practical and but note that from here on out you could do a lot of things but they're going to choose particular things to do notably they're going to say hey why don't we construct these probability paths to just be a series of normal distributions meaning again if we have our original distribution here and our Target distribution here then the intermediate distributions are just going to be and I'm going to draw this in sort of a uh diagram like this the intermediate distri so it goes from here to here the intermediate distributions are going to be gaussians that are kind of like one is here and one is here Gans that are just going to kind of interpolate somehow between them so this this is a valid one but it could also be I've already jumped ahead it could also be that the path goes like this as long as at every point it's a it's a gaussian distribution so we're going to Define some some fun some time dependent functions Mt and uh Sigma T that are going to define the mean and standard deviation of that gaussian for every point in time again this this could still be like w w wo wo woo but now we make the the conscious Choice it's a gaussin and it's an ISO ISO morphic isotropic like a sperical gaussian okay wherever that is again it's going to end up being a straight line but but we don't for the optimal transport um for the uh so this thing right here um contains both the thing they want to do ultimately which is the straight line and contains in a not straight line fashion the uh diffusion objective like if you were to do diffusion you can also capture it like this however the MW and Sig T are not going to be straight lines they're going to be like w and W something like this okay but still they're going to be gaussians which they don't have to be for this framework they could be any distribution uh but yeah so now we'll make the conscious Choice let's make those gaussians at T equals z uh we want to match the original distribution and at tals 1 we're going to the mean at the data point and some small standard deviation to Center the gausian around that data point to ultimately get our Q Target distributions when we aggregate across all of the data points yeah I jumped ahead before but so transforming that is really easy uh we just scale uh the original data point by this Sigma and we shift the mean and then then we have it and we can define a push forward um to essentially Define how this behaves over time right but this is just a mathematical way of saying you just got to got to follow the path and then scale it by the respective standard deviation so the conditional flow matching loss um becomes relatively relatively simple all right let that be a gaus probability path as defined in equation 10 let this right here it's its corresponding flow map and that flow map is again is just um moving moving along the trajectory defined by Mt and sigma T along the path uh while always being a gaussian then the unique Vector field uh that defines that path has the following form so you can see this now gets quite simple uh quite simple we can actually figure out the vector field at any point in time by simply um by simply moving along the trajectory of the mean and scaling by the standard deviation and then we know the vector the vector field itself is a derivative right that's why you have the have the derivatives here but you can figure out from the path and from how the gaussians behave how the vector field at that particular point should be so the vector field is just going to be defined in terms of derivatives of the mean and standard deviation uh functions that you have this is obviously connected to uh gaussing processes well I say obviously I probably couldn't write down how exactly it's going to like the thing here the thing is a gaussian process right you define a mean function and a a standard deviation function function except that you define it over over time um instead of over some uh other data domain so now they say okay what what what special instances exist of these gaussian conditional probability paths and the first one is diffusion conditional Vector Fields so in this section they're going to recover the diffusion objective saying hey if we if we pick very particular m new T and sigma T functions we can actually prove that that's equivalent to uh what the diffusion papers do so this subsumes the diffusion papers you can see the um formulas for like the vector Fields they become fairly they become fairly um involved I would say with you know specific like this T function being some sort of some sort of integral so the choices of this mu function and this uh Sigma function defined in this Alpha T um which again is defined with this T right here yeah becomes involved but you can recover it now why why is this so complicated uh that's because yeah we're going to see this later but also notably you can see that for T equals 1 uh this here becomes zero so this here becomes one so this here becomes Z for tal 1 this is actually not defined so in fact this is this is not defined for the the boundaries of that and that's an interesting observation in diffusion we said before here T equals infinity is when you end up at the noise distribution so you have to noise noise noise noise noise noise noise and only at tals Infinity you're actually guaranteed to land at the actual noise distribution and that's a crucial difference to flow matching in flow matching you can in fact define paths that actually end up after a finite time at the target distribution and that's really different like that's that's actually um a notable difference so you can see if you do these particular choices for the gaussian paths then you will you can't actually compute the end let's say of T equals 1 because T equals 1 in the flow matching framework corresponds to T equals infinity at the um at in the diffusion framework or or vice versa like however you define T1 and t0 but they propose a simpler way they simply say hey why don't we just go in a straight line like and you can see the difference right here so with with the optimal they call this the optimal transport path you can see that the vector field remains constant right at least the direction of the vector field remains constant you can see the strength in terms of the color but the direction always points to the Target distribution so at every point in time this Vector field just says hey why don't you just go towards the target um again these are two data two data points this is a conditional Vector field so we're only wondering about one particular source and one particular Target and we still have to then uh later aggregate across the whole thing to get the whole Vector field but still this essentially just says why don't you just go towards the target like relatively straight whereas the diffusion now you have to pay attention this is um this is a score function so diffusion doesn't regress the vector field diffusion regresses the uh derivative of the log probability um that's why the direction here is different but essentially you can see this changes over time and if you have a particular Point here it will actually kind of push those points ahead and ultimately curve back to the to the um Target why is that why does it do these kind of curvy things well because the way you the way you generate the data is is you start at a data point and then you add noise um first you add noise so you kind of go out in all directions with that noise it's just that you go out more into this direction where the uh Source distribution is so if you go from the back to the front you add noise and you just push away from yourself first and foremost and then you you shape it into this other thing but that means that any point here first kind of goes out and then towards here and which means that the reverse process kind of does these these weird curves and obviously yeah you're you're never going to end up at the true end distribution and that's why they do have a picture in the appendix for the conditional Vector field that would correspond to this score function right here I think it's worth looking at that to just see the difference so I think that's 19 if I recall correctly yeah here so this is the vector field given uh rise to by the diffusion path if you were to compute the vector field so you can see right here that uh the source distribution is kind of pushed along along the um along the direction here like so like so and it's going to be pulled this is 2/3 along the way it's going to almost be pulled around the target point and only at the end it's going to be pulled in to the actual um Target so that's a that's one difference and if you look at how a sampling procedure works and that's more in the experimental section you can see that with the optimal transport uh objective you you get so these are trajectories um sorry not sampling trajectories like these are actual trajectories if you push the original distribution forward to a particular T in in the uh in the path and the flow or diffusion path you can see that the optimal transport objective much earlier is shaped like the target distrib bution which is this checkerboard pattern here whereas diffusion objective no matter whether you do score matching or flow matching um they have the same provably the same objective right you will much later only kind of shape the the target distribution and notably also the um the optimal transport objective needs less function evaluations I recalled the way you ultimately got about this once you learn the neural network is to use an OD solver start with a data point and push it forward through that learned uh Vector field and you can do that a lot more efficiently with this if you have trained using this optimal transport objective than if you trained using the diffusion objective so they're going to propose oh yeah here you see this is the diffusion trajectories and these are the optimal transport trajectories so stable diffusion 3 uses this and investigates a lot more in detail how to exactly sample these time steps during training and which exact paths and so on are are the best ones in order to scale these things up but this is really the basis right here so again and we're we're kind of done with the the gist of this paper um while diffusion takes a data set and explicitly defines a noise process that it then needs to reverse it locks itself into one really particular way of of doing uh den noising and and doing um the morphing of the probability distribution whereas flow matchings takes a step back and says hey if we can if we just figure out how to um how to define from data how to define these Vector fields that give rise to these flows from source to Target we can learn a neural network to just take a a Time step and a point in space and tell us how we should move in order to reach the Target right independent of the process that gave rise to it just tell us how we should move if we can produce that from data somehow we can learn a neural network to predict that for unknown data right for data that's not in the data set and by doing so we'll have a general process to just take the some sample from The Source distribution and move it along that Vector field that the neural network tells us okay how to move how to move how to move and if we've done everything correctly that will give us a correct point in the target distribution or according to the Target distribution and the easiest way and one theoretically very sound way is to Simply Define straight lines in the data space between uh a source sample and a Target sample and regress on regress on that right so um you might think well okay straight lines but you still like the target distribution you still learn the target distribution from because each one of those straight lines is informed by a data sample hello it's me from the future and I realized I didn't do a great job at explaining this last part right here and that is once we have made the choice of uh using the optimal transport essentially the straight path between source and Target sample as the the path of choice how does the loss fall out of that so the path um or the vector field let's say is defined by the following flow the flow as you can see for a given sample is simply we take um we take X1 and X which is wherever that X at that particular time step is and we move in a straight line towards it so here you can see an interpolation T * X1 and sort of 1 - t * X then scaled by the standard deviation so if I have x0 here and I have X1 here a straight line between the two and that's the that's the uh conditional flow conditional on that data point so what is the loss that comes out of this and the loss is equations 9 and 14 and if if you go back there you can see that the loss just falls out of the equations but I think that's a bit not super satisfactory so equation 9 um where is equation n this is eight this is n so this is the loss here and if you look at the choice we made here ah there we go so the conditional uh flow matching loss in that sense is we take x0 we push it Forward uh through the flow to time step T right and then the vector field and this the parameters of this is what we are actually training the vector field at that particular point should should match this quantity right here which is the derivative of the flow at that particular point in time notice this piece right here is the same as this piece right here so what that means is we take x0 we push it forward through the flow whatever that flow is right and that flow might go to X1 ultimately so we push it forward until a particular Point let's say here and you can see the derivative goes like in this direction so we train the vector field predictor to say okay at this particular point the vector field should point in that direction along the flow and so what the what the vector field predictor is going to learn once we aggregate this across all of the data points it's going to essentially learn a weighted weighted directional distribution to point to let's say we have one data point here one here one here one here one here right and we start out with a distribution down here it's going to learn to for example if I consider this point right here after time T well it should Point probably mostly towards that one a little bit towards that one a bit towards that one and so on like it should it should learn and so the average of this is going to be a vector that looks somewhat like this right so the the final predictor is going to be somewhat like this if we do that at every single point right here we we going to like this like this like this if we do that at every single point in the space sort of aggregating across all of the points in the in the data set we're going to get at that particular time step one vector field that if you will represents the data so by by aggregating across all data sets then the vector field predictor will learn to map the entirety of this as a flow to the entirety of the data set so that's where ultimately the knowledge about the data set comes into the vector field predictor you teach it hey look at this particular point right here you should predict going to the data set that's this thing right here you should learn how to go to the data set in this case in the context of one particular sample but once we aggregate that across all the data the vectori field will have will will be will make a prediction that takes into account all of the data points so if you just apply the derivative so this here is the flow if you just apply the derivative to this quantity here you will get this um so that's fairly straightforward however you can also imagine it um like this so okay we push x0 forward through this straight line flow to one particular data point so um that flow is again we are at x0 we push it forward let's say goes to here well into which direction should that point obviously it should Point into the direction of X1 that's exactly what we see here so take X1 minus some uh scaling of x0 that's exactly this direction right here and now you can also see while the vector Fields remain constant over time uh because any mention of T essentially goes out of that Target the only thing that it depends upon is X1 the data point in question and a random choice of x0 so if we then marginalize all of this across the random choice of x0 and across the selection of the target data point we do get a vector fi pred predictor that flows the entire Source distribution to the entire Target distribution and yeah the fact that this doesn't depend on time anymore is just a just falls out of um the particular choice we make here uh for the the type of flow and the type of path that we're using all right back to me in the past I have to say some things are still a mystery to me but I hope this made at least a little bit of sense to you I hope this is understandable and um yeah that's that's it uh thanks a lot again thanks for the community and Discord for uh helping me understand this as well and I'll see you around bye-bye

Info

Channel: Yannic Kilcher

Views: 40,881

Rating: undefined out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper

Id: 7NNxK3CqaDk

Channel Id: undefined

Length: 56min 15sec (3375 seconds)

Published: Mon Apr 08 2024