How I Understand Flow Matching

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my kids love Play-Doh last time they make a Play-Doh version of their stuffy it's pretty amazing that we can create opportunely complex shades by squeeing pressing stretching rolling and twisting a simple ball or Play-Doh this is the core idea of thr based generating models in this video we are going to talk about the main ideas behind normalizing frows continuous normalizing frows and a scalable training me called Throw matching we'll leave the techniques for scaling up the training in the next video [Music] imag we collect a data set of images it would be awesome if we can model the data distribution we can create new images from this distribution or evaluate the likelihood of a sample fun but we don't know what the true data distribution is we only have the samples on the other hand we have simple base distributions like a gaan distribution from which we can easily draw samples and evaluate the could the idea is to turn a generator that transform a simple distribution into a data distribution we can TR this generator by maximum likely could this corresponds to minimizing the K Divergence between the two distributions so how do we get a likelihood here we have a sample noise Z we can transform this noise sample Z into an image X if our generator is invertible we will know the corresponding noise Z that generates this image image can we compute a likelihood like this well not quite let's take a look at a one-dimensional example here our base distribution is a uniform distribution between 0 and one suppose our generator is just stretching the Z value by a factor of two we see that the density of our transform distribution is now half of the original density this is because the probability contained in these areas must stay the same the same concept works for any one-dimensional density functions we can compute a likelihood by adding a scalar term to account for how much the density function is stretched or compressed in the local region we take the absolute value here because either mapping produce the same density okay now let's check the two- dimensional case we look at a specific location Z Plum and its local neighborhood this Factor specifies the change in X1 and X2 directions caused by the Z1 and this Vector specifies the change caused by data Z2 here we compute an area span by the two vectors using determinant we can now write down the relation that the probabilities in these areas must stay the same here is a visual example hom you look good now we can move the Delta Z to the other side and move them into the determinant we see that these are just partial derivatives we transpose this Matrix as a they won't affect the determinant this Matrix has a name it's the Jacobian matrix let's simplify this a bit further we can move the determinant of the Jacobian matrix to the right hand side and express the reciprocal as the determinant of the inverse transform this is known as the change of variable formula now get back to the maximal lik estimation using this formula we can write the local likelihood into two terms how do we compute this first we need an invert mble generator G second we need a way to compu the determine of the Jacobian matrix efficiently it's hard to create a complex transformation with just one single generator in practice we compose a collection of generators to gradually transform a simple based distributions into a complicated data distributions the likelihood computation of such a model is also simple and involves in multiplying each individual determinant here's the local it could now let's see some examples of invertible generators in which the determinant of the Jacobian matrix is easy to compute one popular design is called the coupling layer the first step is to split the input into two disjoint sets we do nothing for the first half we try the new network to predi the scale and translation vector and compute the second half by El wise multiplication and addition these two vectors are then concatenated as the output of the the cing layer let's ask the two questions first is this generator invertible given X we copy the first half then we compute the scaling and translation to invert the second half of Z here the neuron Network can be very complex and does not need to be invertible second can we comput the determinant of the Jacobian matrix efficiently here's the Jacobian matrix of a coupling layer the top left part is an identity Matrix since we copy the first half of the input directly to the output the top right part is all zeros because the output Vector has nothing to do with the input here the bottom left part is tricky The jobian Matrix can be very complex since it involves a new network but we don't care it does not affect the value of the determinant finally the bottom right part is a diagonal matrix because it only has element wise multiplication and addition the determinant is test the multiplication of all the predicted scaling values when we stack these layers together we need to shuffle these glits around to ensure that all the dimensions are updated the original paper use a special checkable patterns and a channel masking to create different splits this type of permutation is letter generalized by invertible 1 by one convolution here training with 1 by one convolution achieves a lower negative local like C and can generate high resolution samples another example is Auto grass throw to generate the value of the ice positions you use the input vectors before the ice positions to compute the condition hi and use it to transform Z to XI using an invertible function T not that this Transformer has nothing to do with the Transformer we use today in language modeling we can generate all the outputs following this strategy if the Transformer tow is invertible we can find the corresponding input Z but this process is sequential fortunately the forward sampling process can be easily paralized The tropia Matrix has a lower triangular structure because it's Auto regressive this means we don't need a full jopent to compute the determinant we just need to comput the gradients on the diagonal and multiply them together so far we have seen the coupling blocks and the auto regressive throws to make the competition trackable we somehow sacrifice the model's expressiveness is it possible to have a free for jaop Matrix H this is the idea behind the residual frows the layers in residual frows are very simple it process the input Z with the new network U and add the output U of Z back to produce the final output for the layer first is this invertible given X can we know what the corresponding Z is in general this is not visible but stepan said it can be invertible if the function U is a constructive mapping this means that that the distance between the two points after the mapping is smaller than the distance before the mapping If U is a constructive mapping then there exists a unique fix Point Z Star let's use x minus U of Z as our constructive map applying the fix Point Zer we get this expression by shuffling the equation a bit we found that Z star is what we want since Z stars is unique we can revert this residual layer G the serum also gives us a bonus and show us an interative algorithm to find the unique C star how about the determinant with some maths we can expand the determinant into a sum of infinite series of matric traces but this is scary we need to compute the trace of jaob Matrix perform Matrix mplications After Case power and sum up infinite turns how is it possible luckily we can use some tricks to simplify the computation to estimate the trace of Matrix a we can pretend there is an identity Matrix we can rewrite the identity Matrix as the cence Matrix of a random Gan vector v with zero me and univariance the linearity of expectations allow us to shuffle things around to get this expression we can now estimate the trace efficiently using mot color sampling but then we cannot evalate Infinity terms the residual frow paper showcase a trick to compu the unbias estimate by reating the sample finite term next we will see how we can generalize the residual flow ideas to continuous normalizing frows let's take a closer look of the residual throw method it gradually transforms a simple based distribution into a data distributions via K residual layers moving these terms around we get something that looks like a derivative when we increase the number of layers K to infinite we get an ordinary differ equation saying that the change in position of a sample follows the vector field our goal is to represent this time varing Vector field with a neural network with parameter Theta this is called a neural ordinary differential equation let's visualize what this looks like here we see the vector field gradually transform a simple base distribution like a gaan into a more complicated one the errors there specify the time varying Vector Fields here are some 2D examp samples now we understand how Vector Fields pushes samples around in space how does the probability density change at a specific location let's use a 1D example to build our some intuitions here we plot the probability distributions over X at time t for specific position xlum we have a probability density of PT of xlum at time t plus Epsilon let's say the probability density becomes lower at the same position how can we explain this with our Vector field at this position the vector field must have pushed the samples away from the position we can use the special gradient of the vector field to quantify the local outgoingness or in other words how much it diverges when the vector field throw into this position the special gradient will be negative the sum of the change in probability density and the local outgoingness must remain zero the same relations holds for higher dimensional data so we can replace the X gradient with a general gradient operator and denote that as a Divergence this is known as the continuity equation or transport equation the continuity equations gives us a tool for training continuous normalizing f using maximal like unfortunately Computing the local like involves integrating Vector Fields over time with an OD solware this limits the scalability of training continuous normal frows on large data sets or high resolution images next we will use frow matching to enable scalable training of continuous normalizing throws let's look at a continuity equation again instead of focusing on learning the right probability density that is computationally expensive we can instead just learn to match your throw the time varing Vector field UT fully determines the probability path and the final fin Target distribution this inside leads to the foral matching objective the goal is to Trend the ne Network to match the vector view here the probability pass interpolates from the base distribution at Time Zero to the Target distribution at time one this looks great the training objective is just a simple L2 regression loss it's simple to implement and does not involve integrating the vector field during training but something terrible happened we don't know what the probability pass or the vector fi is if we know the vector field directly why do we need this new network the trick is creating train data for the probability pass and the vector field using conditioning here we express the marginal probability pass as a mixture of conditional probability pass that vary with some conditioning variable Z using conditions we can design a valid conditional probability pass and Vector field for training let's say our condition is a single data point in our training data set we call it X1 here is the equation and the visualization of the conditional probability pass the vector field is also very simple intuitively for any point x we move toward the data point X1 the speed depends on the time ensuring that we land exactly on the data point X1 when the time equals 1 we can now Define the conditional frow matching objective the conditional probability the conditional probability pass and the conditional Vector field are all easy to compute surprisingly the gradients of the conditional flow matching objective are the same as the unconditional one this provides a scalable ways of training continuous normalizing throws this is great let's look at several other designs here instead of conditioning only on the data point X1 we also sample a noise x0 from the base distribution the conditional probability pass can be a gion distribution with a small variance that moves between x0 and X1 the condition of vector field is constant over time on this pass this independent coupling conditions these two methods like Rectify for and stochastic interpolant we can go beyond simple pair conditioning as well for example we can draw multiple samples from the base distribution and multiple data points from the training data set we then establish the correspondence between them used through optimal transport and create probability paths and Vector fields for training this help us create straighter paths for more stable training and faster inference speed here's a summary of three examples of conditional for matching designs compare with independent coupling having some coupling within each mini batch leads to the clear probability path now let's visualize the training process we independently sample a data point X1 a noise x0 from the base distribution based on the probability path we can create a sample XT using this noisy sample we train the newal network to match the conditional Vector field in this example the conditional Vector field is a constant Vector from noise to data it's useful to put things in perspective by comparing this with diffusion models in training a diffusion model we also sample a data point from the data set sample noise from a cion distribution with zero means and uni variance we encode the image using a forward diffusion process to to get a noisy image we can then TR the new network to predict the Noise by comparing these two we can see how thr matching simplifies and generalize diffusion models in diffusion models the conditional probability distributions countes from a fixed SP diffusion process it cannot generate a pure caution noise withing a finite number of four diffusion steps the pro measuring framework focuses directly on moving samples from a base to a target distribution and request the throw in between for matching keeps the ense of diffusion models but removes the unnecessary restrictions of the forward diffusion process in summary we cover the basics of discreete time normalizing throws continuous normalizing throws and throw matching as a scalable method for training continuous normalizing throws I expect that we will see a lot more exciting development and applications of row matching thanks for listening and I'll see you next time
Info
Channel: Jia-Bin Huang
Views: 4,553
Rating: undefined out of 5
Keywords: flow matching, normalizing flow, generative models, stable diffusion, continuous normalizing flows, ordinary differential equation, neural ordinary differential equation, Neural ODE, Generative AI, Rectified Flows, Stable Diffusion 3, Text-to-image generation, AI art, Conditional flow matching
Id: DDq_pIfHqLs
Channel Id: undefined
Length: 16min 24sec (984 seconds)
Published: Sun Jun 02 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.