V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video (Explained)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello today we're going to look at revisiting feature prediction for learning visual representations from video also known as the paper that introduces the model v jeppa v JEA is a derivative or a variant of the JEA architecture that was originally proposed by Yan lah as on the built on the hypothesis of feature prediction being a very good tool to do unsupervised learning from the we're going to dive into all of that we're going to dive into what the model does what it can do but in essence in a short form this is an unsupervised techniques to learn good features from in this case from video data and by good features we mean latent features that you can then use to make Downstream tasks better like for example if you want to do video classification you could use a model that's pre-trained using V jeppa and then fine- tune it to do some sort of topic classification fine tune it to do some an extraction of of stuff and so so that is the the base case for this model think of it like BT uh but for video no in in a in a way but also a bit different but in terms of what you can do with it uh so we're going to first take a tiny look at the original jeppa paper just to get you in the mood of what the underlying sort of core hypothesis of this is and then we're going to dive into this paper actually I want to look at the abstract of this paper in the very first place and then dive into the the JEA paper so they say humans possess remarkable ability to map low-level signals originating from the retina into semantic spatial temporal understanding of the world uh such you know recognizing objects recognizing Global motion and so on all from various pixel move movements that are very localized this ability to understand video in this case like like streaming image data of your eyes and brain is incredibly difficult to match with machines a long-standing goal of the machine Learning Community is to identify the principles or objectives that may guide such unsupervised learning in humans so the question is how do humans do that how do humans extract from this pixel movement these high level things and how do humans learn to do that right how to how does certainly you can argue someone has a builtin ability to do some of this stuff but still you need to learn all the objects that exist in the world a lot of them haven't existed for all of evolution so there must be a mechanism by which humans in an unsupervised way acquire this knowledge the hypothesis they're going with here is the so-called predictive feature principle which posits that representations of temporally adjacent sensory stimuli should be predictive of each other what does that mean that means representations should we're dealing with representations so we're not dealing with signals themselves we're dealing with representations of signals and in our current language that would be latent space embeddings so this this is the first indication we are not operating in pixel space right here we are always going to abstract from the pixel space and whatever we do we do in the latent space so there's not going to be a pixel reconstruction error here or anything like this um or a auto regressive synthesis of data or something like this all of what we do here happens in the latent space then temporally adjacent sensory sensory stimuli um meaning video frames that follow each other or you know the left side of the video and the right side of the video and things like this they should be predictive of each other meaning that if I know one of them I can predict not the other part of the video but the representation of the other part of the video so if I see kind of like half a dog is then I can predict oh probably the other half of the dog is going to be in the other half of the video or if I see I don't know I see like a road um in the beginning of the video I may may predict that there's going to be a car driving on the road at the end of the video that would be one possible future right so this hypothesis that humans May in part learn to extract meaningful features of the of in an unsupervised fashion of video data is based on so the hypothesis that that is based on this principle that humans learn to associate representations of video like things that appear at the same time or after each other in video to associate representations with each other and to predict representations from each other they say we revisit feature prediction sorry about that as a standalone objective for unsupervised learning of visual representations from video the goal here isn't to get the best model ever the goal is to say how far can we get with just unsupervised learning uh of of representations uh based on the on the principle that we the principle of feature prediction right the principle that we just discussed they say in this case we present video joint embedding predictive architecture V JEA which is based solely on feature prediction without using pre-trained image encoders text negative examples human annotations or pixel level reconstruction so this means this alleviates a number of things no need for negative examples negative examples are very common in unsupervised representation learning think how do you how do you learn an embedding model in an unsupervised fashion while you have sentences that are close to each other and then you have a third one that's far apart and you push these together and this one apart you always have to take care of how how far away should you push those like you know how close is too close and then you want to start mining hard negatives right because the task quickly becomes too easy so you want like okay I want to find negative samples that are kind of close but not really and then push those apart lots of problems with having negative samples um pre-trained encoders obviously also help they also don't want to use that human annotations are very costly and pixel level reconstruction as we will see the theme Here of this paper is that the method here V JEA is going to be much more efficient um at reaching certain goals than pixel based methods so it's not that pixel-based methods don't work it's just that they seem to waste a lot of computation a lot of parameters and so on at doing just a local pixel level stuff so they can really get those pixels right V JEA just doesn't care about the pixels it purely operates in the latent space and therefore can denote devote a much more extensive budget let's say to uh getting those latent features correct all right so I would be remiss maybe you've seen it already um um I've already talked about in the last video but there is a new course by weights and biases which happily um graciously sponsor this video so thanks a lot the course is on structured outputs of llms uh not exactly VJ I like but if you want to put two llms together if you want to build agent systems or or just kind of chain llm calls together it's really useful if the intermediate representation isn't just text but is actually structured I think the first call extracts some stuff from text and then the second call does something useful with each thing you extracted Json is very popular in between format so this course is going to teach you how to get the first one to Output solid Json and then how to use libraries based on pantic if you know that there are libraries that adapt pantic and to llm Output to essentially pars out the necessary information and validate whether the llm has actually given you something that's valid and and correct not correct in the semantic sense but correct in the Json sense it's yeah I actually expected a number here that's you know between 100 and 1,000 and so on so this is a huge issue with chaining multiple llms together and these techniques are going to help you greatly so they take you from the basics like how do I prompt uh llms give me structured output and so on up until how do I validate them how do I use libraries that already exist in order to validate those things if you're interested course is completely free uh and yeah that's a great price as they say so let's go back to um let's go to the JEA paper like the original paper that introduces this architecture now I don't want to hang around here too much because I've already made a video about this which you can go look up but the basic uh G the basic architecture of jeton there's going to be a lot of diagrams here let's pay attention that we don't skip the we may have already skipped the most important oh yeah okay so uh the the basic principle is going to be is going to be the following can we Design Systems that uh have the blue thing here that's a data point for example a video many frames of a video right um or it can be a piece of text it can be an image it can be multimodal whatnot the question is uh can we Design Systems that if they know one part here um they can predict the other part or in a different formulation can we Design Systems that have what what this called an energy function that um can determine whether or not two things go together so in this sense imagine this is the beginning of a video and this is the continuation can we build systems that are good at recognizing whether or not Y is actually a valid continuation of X and we formulate it like this because if you think about it for the for any beginning of a video there are obviously infinite possibilities of how the video could continue There are like most of them if you enumerate the most most of them will just be random pixel noise so that's not you know all of that is obviously invalid but then there's also infinite possibilities of how even a natural looking video could continue and only some of them actually make sense only some of them are let's call them valid or probable continuations of a video so if I see I don't know car driving down the road it would be very good to um to assume yeah okay maybe that car should should continue driving down the road maybe it should make a turn to the left make a turn to the right all of these would be valid continuations however a car driving down the road and then in the next instance I don't know there's like some Disney character jumping up and down that would be you know very uh not not continuing the current video and therefore um we would say that's not a that's not a kind of a valid continuation so can we build systems that can recognize when two parts of a data point or a collection of data points go together well or not and we call that an energy function detecting that is a a base for these systems and you can see we we're not only talking about beginning of video or end of the video we can do various things but they all revolve around masking out some part of the masking out some part of the data and then trying to predict that part from other parts of the data or or respectively trying to build an energy function that tells us whether or not the whether or not any given mask filling is compatible with the data we already know this can be then formalized um so if you want F we already seen okay to Any Given X there could be many many different y that um minimize the function or achieve a a low value or a high value respect effectively what you want and this can be abstracted by introducing a z which is kind of a selector variable this is a latent variable that encapsulates how X and Y are related for example in the example before when you see a car driving let's assume there are only three like valid continuations there's like left turn straight and right turn Z could encapsulate uh the choice of which turn to to make and therefore if I know Z then it is very defined which y would minimize the energy function in that sense so this allows me to sort of account for the fact that there are many different possibilities of going with a given X um because I can latently make the choice and embed the choice in z um in order to to determine which Y is a good one we do that most most ly because of for example um if I now train a system if I train a system with this and and that doesn't account for this let's say I have I don't know a billion videos and they all start somehow and just two videos they happen to start the same way right could be could be but in one video the car does a left turn in one video the car does a right turn if I just train these systems with I don't know cross entropy I just train them with pixel L2 loss something like this right or even with with the JEA here with like um only predicting the features strictly then what is going to happen is the loss which is ultimately like the mean loss of over all the samples will will kind of do a wish like we'll kind of do an overlap a blend between the all the possibilities right here but that's not desirable we want the loss to be clean and crisp for the particular choice we make right now and then clean and crisp for the other choice and we don't want it to be like the mean the pixel mean or the feature mean or whatever of the different samples a lot of architectures actually have a problem like this I I I believe notably variational Auto encoders are quite famous for being blurry at least in the original formulation and one part of that could be because they don't account for this thing in the loss part of their of their equation so I might I might I might talk crap here but I believe that was one of the reasons right so here we can already see the the first um first like ingredient that we're going to see later in these Jaa models then next if we talk about unsupervised feature learning right um again the classic example is something like learning text embedding models where we have two sentences like where we have just a bunch of text and we want to learn some similarity function like some embedding that then is can be used to do text similarity or something like clip right where we have uh an image and a text that sort of sometimes go together but the question is obviously how do we learn this robustly we already discussed negatives so if I have text text text text right and then text over here here so I I'm going to take two sentences one here and one that follows that immediately I'm going to take a sentence over here and they use these as positive examples and these here as negative examples and we already discussed that has some disadvantages so I would like to leave away not not do negative examples now the problem here is that if I only have positive examples that can collapse uh my learning pretty quickly so if I just make like let's say I I randomly initialize some Burt model or something and I just I push this sentence through it and I push this sentence through it and I just say these two should be similar right just these two should be similar um what it will do is it will force the loss will guide my model to just always output a constant like constant Vector of zeros because if well if this I put a constant Vector of zeros and this outputs a constant Vector of zeros they are always equal right but so is everything else but my loss my in my loss I only ever have positive examples so we call that collapse um in this sense and we're wondering what can we do in order to prevent collapse in this domain of unsupervised feature Learning Without negatives and uh the paper list a few variants here of how that's usually done and their propensity for collapsing none of these are jeppa but they kind of are Stepping Stones along the way so here you can see um one architecture what we do is we have X and Y which are again two two pieces two separate pieces but of the same data point right so we want to relate X and Y together um we encode X so we Builder representation and yeah also our ultimate goal is to build an encoder our ultimate goal is to after we've done all this training extract this part or this or this respectively so it's Paramount that we always think you know what what does this particular training mean for the encoder so here uh we have X we encode X which gives us s of X which is the latent representation right and then we try to predict why directly right so this could be I give you the beginning of a video and you just give me I encode it and from the latent representation you give me the pixels of the continuation of the video and then I compare that with the actual pixels and then I have some lost function right here now that is an auto encoder not an auto encoder but is right like it's beginning of the video to end of the video sort of just prediction um that works but again it wastes a lot of time and effort on really getting the pixels correct plus it obviously does not have this property that it can account for one beginning of the video going into multiple valid directions this will just the loss itself will just assume there is one correct thing to do and if I have two different data samples same beginning but different continu then that's just kind of noise it it like must be that like I'm just going to take the mean because that that is giving me the least amount of loss so cannot collapse which is a good thing right I act I actively always predict the pixels which means my loss back propagation will always go to the encoder give me valid good encoding right here however um we have all these other disadvantages okay generative latent variable architecture here we actually account for the fact um of this of this choice of the fact that different like the same video can continue in different ways you can see we have a this variable Z which can incorporate that choice how we train it uh different question actually that question becomes interesting now so you see the encoder now gives us a representation and the predictor the part that predicts that tries to reconstruct from the encoding of X the Y the pixels now also gets that choice so now also gets hey what choice did we actually make if you think of how you train these things again we're wasting a lot of time and effort on the pixels right here but we're also asking can this collapse and yes it can collapse how can it collapse the model can just decide to in in Z essentially put all information in Z and no information into s ofx so you can just learn how that's exactly done you know that it's a different question if you will but you can just always learn to put a lot of information into Z and um then only predict y from Z and never predict y from s ofx and therefore s doesn't get a good gradient and therefore the encoder doesn't get a good gradient and therefore collapse right we also have Auto encoder Auto encoder uh here can as far as I know also collapse in the sense that well how can it collapse it can probably just output some sort of static static output right here like we see a lot of crucial elements missing and then then we have the joint embedding architecture and this one is a real big candidate for collapsing um why because you can see right here we're not predicting the pixels anymore so we're trying to save ourselves and saying hey let's not predict the pixels pixel prediction wastes a lot of energy on exactly the pixels let's predict from the features of X let's predict the features of Y right um in fact we have have a where does this come from well there should probably be some predictor in between in between here I guess or this should just be S of X and S of Y maybe we just we just do this on the um on the two embeddings but in any case even if we had a predictor right here you can obviously see the danger that what we said before if we encode X and encode Y and we always output a constant Vector then that distance is always going to be minimal and therefore we the loss of the model is super small yet we did not learn a useful representation okay the the thought of predicting the latent features is really good because it means if we can predict s of Y from s ofx well that would mean we exactly fulfill the hypothesis hey like pred represent resentations of one part of the data should be able to predict representations of the other part of the data that's how we know we have good representations however that those representations must also be informative about the data and not collapse to nothingness because otherwise prediction is super easy but it doesn't mean anything so that's why the ultimate architecture is going to look something like this incorporating all of these things so um I have an encoder here uh that that gives us a representation we're going to have this Choice variable that goes into the predictor the predictor then predicts the latent representation of Y and then we have a distance metric here um there is going to be two modifications to that first of all this here is not going to be trained using back propagation this is actually going to be an moving average of the encoder over here and that moving average that has been very uh successful in other works that don't use negatives for representation learning such as bootstrap your own latent and so on where you just have you have two encoders essentially one for x one for y but you don't train y the Y is just a moving average of the weight of X so it's always kind of behind which makes sure that it's always a a little bit different uh from different from the encoding of X but it's still kind of a valid encoding right so you're trying to keep the encoding this the same enough so that obviously the predictor can make sense out of it and actually predict something sensible but you're also trying to make it different enough so that it so that it's not encourag to just always output a constant value because as soon as this part here would realize ah this part here is essentially the same as me I'm just going to Output a constant value and therefore this thing here will output a constant value and therefore prediction is really easy right so you want to make them different enough for the part over here not to realize that and re by realize I mean recognize it through gradient flow um and the second of all yeah I already said this is a moving average so there is no gradient training here so there just going to be a stop gradient going back here the gradient flow is only going this way so the encoder that's being trained is on the left hand side right here and then the moving average will build the encoder on the right hand side and therefore if you time it correctly if you get the hyperparameters correctly you can prevent collapse while having good representation learning you can also slap a bunch of regularizers on stuff uh notably you can slap a regularizer on Z to minimize information content you can slap regularizers on X and Y and so on and there are various derivatives of this some of which are mentioned in this paper like regularizing covariance matrices and and so on to make things more regular but now let's dive actually into the V JEA paper um so the I already mentioned the uh outcome of this is going to be feature prediction can indeed serve as an effective Standalone objective for unsupervised learning from video while using significantly shorter training schedules than pixel prediction methods so this is going to give a versatile Vis visual representation meaning we get a model that can take in a video and give us features that we can use for Downstream tasks or that we can super uh fine-tune for Downstream tasks um it's Superior to pixel prediction approaches if it's either better in terms of uh numbers like in town Downstream task numbers or it matches the downstream task numbers while being faster to train more label efficient and so on so if you have some labels it's also more label eff efficient than pixel-based approaches cool so this is the the base architecture here there's going to be X and Y which are two different parts of the same video what they often do is they'll block out like a large part of the video for the entire duration of the video and then only from the other parts trying to predict not the pixels but the representation of the pixels right so the predictor is going to get the encoding of X plus this uh Choice variable Z the the variable Z during training here needs to be constructed by hand so by hand we need to say how our X and Y related in this case this is going to be uh they denote that as Delta y Delta Y is simply the position of the thing to predict so if you have the video frame and you blacken out all of this stuff and then maybe some of this stuff as well the uh positions of of these are going to be Delta y right because if you if you just run these parts here through the encoder you just get some encoded um signal right and then by using this information you can more precisely predict what's going what's here and what's here because you know kind of where it is in relation to what you have seen of the video so this acts as Z in this case it's not a perfect Z in terms of the um of jeppa because it doesn't encapsulate this what I said before well the same video could continue in two different ways and so on doesn't encapsulate that uh perfect Z Would account for that uh but also coming up with such a z during training time is obviously very hard and if you don't do that you need to do kind of training inference time optimization or something like this or or learn something so this is a a substitute to at least give the predictor a little bit of information how it should take the representation of the things that it can see and predict the representation of the thing that it cannot see so you see it predicts s of Y which is yeah the representation so not the pixels themselves but the latent features of the masked out parts of of the video and then there's going to be an L1 loss I think so here you can see uh the objective is this one right here we're going to minimize the um we're going to encode X and then predict the encoded features of Y uh L1 loss across that and then jointly train the predictor and the embedder and to prevent collapse we're going to modify it like this we're going to add a stop gradient to the Y Direction and we're going to add a moving average so we're going to construct the Y encoder to be a moving average of the X encoder and the L1 regression is just a choice to make it more stable they have some theoretical motivation for doing the whole moving average thing um that they adapt from a different paper essentially what it boils down to when you do the whole math is they say look if we assume the optimal predictor here then calculate the gradient for the encoder and the gradient for the encoder is what you want to pay attention to because if there is collapse then the gradient of the encoder would be independent of the data right that that's what collapse means um that the encoder is independent of the data and therefore during training you would expect to see the gradient of the encoder becoming independent of the data but you can see here under some assumptions they can actually closed form calculate that gradient and you can see here it does in fact uh it does in fact contain the data here and therefore it will depend on the data which will mean that the encoder learns from the data and doesn't collapse again uh so this um um they say incorporating an exponential moving average to compute the representation of Y ensures that the predictor evolves faster than the encoder and remains close to Optimal there thereby preventing collapse so good we'll just take it as it is you can dive deeper if you want but it is more of a justification not a a proof let say this is going to be the actual um architecture of V JEA right here so starting from the left hand side sorry I cannot really Zoom much more here we are going to make the video into patches patches and tokens are going to be synonyms in in this form patches are going to be 16 by 16 pixels so 16 by 16 pixels for two frames so two consecutive frames 16 by 16 pixels so it's a it's kind of a volume of that's a terrible volume a volume of pixels so two frames next to each other the same 16 byx 16 pixels that's in total 512 pixels that's a one token they already train with frame Skip by the way so uh yeah but two consecutive tokens uh two consecutive frames give give you one patch then they mask out you can see right here first they they patch the image uh the video sorry after by the way after you have patched it you just unroll it right so 16 x 16 by two you make these little groups everywhere uh through the time Dimension as well always grouping two together and then you just unroll it so you have just one long series of these tokens and once you have that you can lay them out and you can essentially treat it as a as a a language problem uh if you will so you're going to mask out and by masking out they don't just mask out random tokens they as you can see right here they always so this this is the time Dimension here they always pay attention that they actually mask out continuous blocks over the of a video over the so the same regions across time which makes the problem a little bit harder because if you just randomly mask out patches you can easily just go to the next like two frames ahead and then all of the stuff is visible and because in most video there isn't that much movement from frame to frame you have a really easy time predicting stuff around so blocking out of continuous pieces of data makes it harder for makes it harder to solve the problem so the masked we only retain what is unmasked we push that through the encoder so we have an encoded uh Series right here now the The Little D here I think that's just a mistake that that this is always the same um kind of makes no sense because here it should be 512 right because each patch each token is 512 dimensions and then here it's they say at one point it's like 384 is I'm I'm not sure so um pay no attention to that that's probably just paper writing uh sloppiness I guess or or hand wavy but in essence we come out with a series of like a latent encoding of these of the unmasked parts of the video then we fill in they say concatenate mask tokens we actually fill in mask tokens for the regions we want to predict so each of these blue squares they they they come through an attention mechanism so they do technically have Global Information like any of these blue of these tokens can contain Global Information but still still they kind of correspond to a given unmasked patch in the original video right um like every last layer BT token still corresponds to a token in the input even though there's no necessity to do that except of course the the loss function so we're going to insert the part tokens for the parts of the video that we have masked and these are just going to be initialized with a learned masked embedding and some positional encodings and then the predictor is going to predict the embeddings of Y right and it knows what to predict first of all by the you know mask tokens but um no just by The Mask tokens so this whole Delta y that information is now inside of the Mask token if I if I understand that and the code correctly the what they before called Delta Y is inside of The Mask token because the information of where the regions are you want to predict compared to the regions that you know those are in the positional encodings of the Mask token um so the predictor here has technically access to the blue stuff which is whatever comes from X and the red stuff which includes the Delta y right here so it's exactly as we saw before and then it tries to predict s of Y here um on the other hand we're going to take the same video so over here same video as as here and we're going to just run it through the Y encoder remember y encoder is an exponentially moving average of this thing right here and you notice we are going to encode all the tokens and that gives us what they call Concept contextualized targets meaning that everything can attend to everything and therefore the thing to predict um has kind of information about all the stuff right uh so these are these are the these are the actual embeddings that you would get if you were to embed the whole data point so that's important masking after the encoding so after encoding you mask this part and obviously you need to do the anti mask so you need to retain all the stuff that you um have masked out over here because those are exactly the targets so the the red ones the red squares here the mask tokens are exactly specifying which parts you need to predict the predictor predicts those you can see we're going from L to m m so L equal to n + m here uh n is what we push through x m is what we push through or what we get from y and then there is an L1 loss and then we just train back propagate through the predictor to the encoder the encoder exponentially moving average over here there is a stop gradient here and that's the whole architecture badaam bada boom also this is a really machine learning Hing predicting y from X thank you thank you that that's very informative um yeah so you can read a little bit more into that exact what I found interesting is actually that they have an average masking ratio of 90% so the masking so the x is usually just about 10% of the video in itself sorry I was interrupted during the making of this video it is now about 1 hour later uh so I have no clue where we left off but we have gone over the the big part of this like the architecture the motivation and so on you can see see a bit more details into Network parameterization there are two different networks here one is the uh encoder and that's just the a vision Transformer as a video backbone and as you saw the video clip is split into a 3D grid consisting of 16 by 16 pixel blocks spanning two consecutive frames they say we refer to these spatial temporal patches as tokens then they say uh how they mask the predictor is a narrow Transformer implemented using 12 blocks with an embedding dimension of 384 and by the way contrary to most big companies uh this being meta they actually release big tables with all the hyperparameter numbers in the appendix so this like one giant compliment to the researchers here this is actually a paper that you could conceivably reproduce um they describe what data they use they describe the hyper parameter they describe the model architecture like pretty much anything except like what the exact hidden Dimension is although I have the feeling they just left whatever vit like whatever vit H and and vit L have as a hidden Dimension they just left that as it was and they just used that so conceivably this is totally reproducible apart of course from Hardware requirements which they just say like A180 GB but I have no clue if it's if that means one A1 10080 GB or just like a whole lot of them and that's just the type that we used no idea but um someone calculated it and you know calculated that they would use at least 16 of them or something something like this but I feel you can always do some offloading and reduce that down depends in any case they do operate with fairly High batch sizes here and I feel high batch sizes are also one of the things that certainly help if you're doing these unsupervised um unsupervised representation learning methods without negatives because the large batches they will kind of smooth out gradients and so on um although maybe that's not even desired interestingly they say each model takes as input a video clip of 16 frames sampled with a frame skip of four corresponding to roughly 3 second Clips on average so this is 30 same frames per second uh video so frame skip of four you take 16 frames of that you have see roughly 3 second Clips so the um 16 frames would give you you would split up into 16x 16 by 2 patches and uh the original resolution I believe they also have that here somewhere the original resolution is 224 or 384 um and that will give you the number of patches and the number of patches I think is about 1,500 patches in one data point and then they do 2,400 data points or 3,000 72 data points depending on the model per batch so not inconceivably large but it is you know fairly chunky I don't want to go too much into the uh into the experimental into the exact numbers right here um but we can go a bit through the main conclusions the results of this comparisons indicate that predicting in feature space provides a consistent performance Improvement over pixel space prediction in both Frozen evaluation of the video backbone as well as endtoend fine tuning so they compare it to methods that try to predict the pixels and not the hidden representations and they find the ones that predict the hidden representations to be more performing when they use the trained encoders for Downstream tasks again the you cannot evaluate these models in isolation because they're not like next token predict s in sense of they don't predict the input data or something like this they're not generative so the way you evaluate them is you take the encoder that you've trained the encoding part and then you use it for Downstream task and if you do that then they tend to be better than pixel-based methods they also investigate data mixes um and masking strategies which masking strategies are better than other masking strategies and I want to kind of skip over that but you can see right here you can achieve um similar results as other models like here here you can achieve similar results in various tasks however the amount of samples seen can be reduced drastically so either it is better than other methods or it is more label efficient given the uh similar performance and yeah that bodess really well for this model one interesting thing is the way they qualitatively evaluate this so they go ahead and say okay we got some numbers but we want to actually investigate what kind of stuff the encoder actually learned what's these these features that are extracted by this purely feature prediction objective what do they actually contain so in order to do that they train a decoder now this is actually going to go back to the pixel space as you can see right here but this is only for inspection this is only for evaluation so the way they do it they mask out a piece of data they pass it through the encoder so you get encoded uh tokens as you see right here they pass that through the predictor the predictor predicts the latent features of the missing part of the masked part and now all of this that's V JEA right so they do that um that's pre-training and then this is completely Frozen so Frozen and then they train the decoder they train the decoder to reconstruct the original pixels given only given only the predicted latent representation of the masked parts of the video so this decoder does not have access to any of the data that's around here so usually if you were to do a good mask filling decoder or something like this you actually give it access to all the pixel around here so it can do a a good matching boundary and so on but here they explicitly don't want that because they wouldn't know if if the pixels in here are good they wouldn't know if it was predicted from the pixels around it or actually from these features that are learned so they specifically only take into account the latent representation of the predicted latent representation of the masked region as you can see right here anything in the blue square that's decoded from this model and you can see that it matches up fairly well with what could be in the video obviously the boundaries aren't perfect but the sort of arrangement of things the objects that are in and the general sorry the general General locations of stuff is valid and so this and this and this these are three different uh decodings of the video as far as I can can tell and you can see yeah that they're pretty much they're pretty much pretty good and do are all quite plausible in terms of what's contained again boundary artifacts are completely expected in this uh type of evaluation so this is it this is V jeppa uh tour through the paper they have an extensive appendix that you can go look with more details with tables and hyperparameters and so on much much appreciated so all everyone give one clap for meta Yay good job all right that was it for me I do think this is a really cool uh Direction unsupervised learning and getting the sort of grasping the principles behind it uh is remains very important and yeah that's that uh stay hydrated see you around bye-bye
Info
Channel: Yannic Kilcher
Views: 40,071
Rating: undefined out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, meta, jepa, v-jepa, yann lecun, fair, video unsupervised representation learning
Id: 7UkJPwz_N_0
Channel Id: undefined
Length: 50min 3sec (3003 seconds)
Published: Mon Feb 19 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.