Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello there today we're looking at self-supervised learning the dark matter of intelligence this was written by yan lakan and ishan misra of facebook ai research and it is not a paper it is more a blog post shared on the facebook ai blog and it outlines the current state of self-supervised learning what it is and what it can do why the authors think it is important it goes over things like birth goes over things like contrastive learning energy based models gans and so on and at the end it gives a bunch of recommendations for the way to go forward on a high level the main recommendation is that we should build latent variable prediction models that are not trained contrastively and we'll go through all of what this means in this article so we'll go through the article i'll switch over to here where it's a bit of a more legible format and as always if you like content like this if you enjoy it share it out don't hesitate to tell a friend about it alright let's do it they say in recent years the ai field has made tremendous progress in developing ai systems that can learn from massive amounts of carefully labeled data so the keywords here are massive amounts yes we got that but carefully labeled data of course we all know that supervised learning has worked very well if you have enough labeled data and that's exactly the problem in order to push machine learning to more to higher abilities it seems like what we need is first of all bigger architectures which we can do by just building bigger computers but we also need more data the problem here is that we need orders of magnitude more data and labeling that data is going to be very very expensive and therefore we're looking for methods that can do without labeled data that can learn most of what they learn from non-labeled data and then apply that to a little bit of label data in order to learn a task but this is not the only thing so the need the expensiveness of labeling is not the only thing that they criticize here they say this paradigm of supervised learning has a proven track record for training specialist models that perform extremely well on the tasks they were trained to do so this is another criticism right here namely that if we train something in a supervised fashion with labels it will become or it might become very good but it will be very good at that particular task and it won't be super good at other tasks such as you know tasks that are relatively neighboring to the field that we're concerned about they go on they say that supervised learning is a bottleneck for building more intelligent generalist models that can do multiple tasks and acquire new skills without massive amounts of labeled data this is into the direction of francois chole who defines intelligence as the efficiency with which you transform new data into new skills and this is reflected here in this article by jan lakka and and i'm sorry ishan but uh young lucan just has the big name and uh unfortunately you're a bit in his shadow here but i'm fairly confident these that yanuka is not just on this for the name because the arguments in this article he has raised in many uh talks that i've seen of him in the past few years so it is it is really kind of a condensing of all of these talks in this here but back to the paper this acquiring new skills without massive amounts of labeled data they say that has to be our goal because it is impossible to label everything in the world and there are also some tasks where there is not enough labeled data like translation systems for low resource languages so they make two observations right here first of all they say look um here for example if we show just a few drawings of cows to small children they'll eventually be able to recognize any cow they see by contrast ai systems trained with supervised learning require many examples of car images and might still fail to classify cows in unusual situations such as lying on a beach what are you doing silly cow don't lie on a beach um so this is another point right these these ai systems they take so much more data than humans uh to learn new skills and they ask why the short answer is that humans rely on their previously acquired knowledge of how the world works so they make this they make this argument here that there is a thing like common knowledge about the world or common sense it forms the bulk of biological intelligence in both humans and animals humans are animals like okay this common sensibility is taken for granted but has remained an open challenge in ai research common sense they say is the dark matter of artificial intelligence so they point out that you have this common sense that you learn simply by interacting with the world they say as babies we learn how the world works largely by observations you form predictive models about the world you learn concepts such as object permanence and gravity and later in life you you even act in the world now they're not going into this acting in the world but their point is that throughout your life you just observe the world and you build these predictive models and that's how you learn about how the world works i'm not entirely sure that things like gravity are learned in this way i think there's some evidence that at least part of it is biological or at least you're extremely biologically predetermined to learn about things like object permanence and gravity but the point is taken that there is something built into you either from experience or from biology that allows you that is kind of this common sense and that allows you to acquire new tasks with extremely few additional samples because you bring in this knowledge uh about the world so their core claim here is that we believe that self-supervised learning is one of the most promising ways to build such background knowledge and approximate a form of common sense in ai systems they say the way we're going to get ai systems to also have this common sense knowledge is by doing self-supervised learning right so they give some examples of self-supervised learning they also contrast it with unsupervised learning where the difference so they say unsupervised learning is a bit of a misnomer learning is never really unsupervised self-supervised learning specifically means that you generate the label out of the data itself so what could that be you know for example in invert the language model you might have a sentence like this is a cat and this is a sentence from the data set now in self-supervised learning you would somehow need to come up with an input sample and a label for that input sample just by just using this text right in a supervised um in a supervised data set you would have some label associated with this and this could be anything depending on what the task is like this could be labels could be annotations for what kind of words these words are label could be whether or not the sentence is a positive or negative sentence but in self-supervised learning you can do something like this and here's what bert does they cross out uh a word like this a so this now becomes the input sample x and the label is going to be whatever was missing here so the label will be the word a now the task of the machine learning system is given x figure out what is y okay so figure out that at this particular place in the sentence there should be the uh word a now bird does a bit more sophisticated things like it also replaces tokens and so on but ultimately what you want is for any for any corrupted input to for the system to output the uncorrupted output and thereby the system will learn about the world it will maybe not about the world but it will learn about language if it wants to do this task correctly it needs to learn that if you have a this is construction there should probably be some kind of specifier for what comes next right here and then cat is some sort of an object or animal so given all of this evidence you only have very few possibilities like a or my or this is a one this is two cat no this is your cat something like this but all the other words in the language cannot be so they formulate self-supervised learning as obtaining supervisory signals from the data itself that's why it's not unsupervised it is self-supervised because you create the label from the data and the important part here is and i think that's often neglected in these self-supervised things is that the way you create the label from the data that is human specified right this this step right here that needs can i draw a light bulb that needs a human idea like how could we create a label and an input data point given a data point so we shift the burden of the human from labeling the data explicitly to simply saying to simply constructing the method of how to obtain labels from data this is still building in substantial human bias but it is much more scalable if i have one method to create labels i can apply it to an entire data set whereas if i create labels myself i have to go through every single data point right but it's not unsupervised because the supervision is in the process that creates the label so they say leverage the underlying structure of the data the general technique of self-supervised learning is to predict any unobserved or hidden part or property of the input from any observed or unhidden part of the input so the general recipe or one i would say one general recipe because it's not the general recipe even though they claim it here i would say one general recipe is that if you have an input you just hide part of it and then you have the model predict that hidden part they give a bunch of examples here this is quite a cryptic drawing i think so these are three examples of what you could do if you have data and this yeah time or space i would claim it's easiest if you think of this as a video sequence so this is a video sequence and the frames are all they're stacked like this frame frame frame okay and it goes up until here so what you're going to do what you can do option one is you simply take the past you define a time point t right here and you take the past and that's the observed part and you take the future which you have in your data set but you don't show it to the model so the model is supposed to predict the future from the past this in video you can understand it this is also what for example gp the gpt models do like gpt3 does exactly this it takes in a the past words so far and it predicts the next word or the next few words the second part is you don't have to necessarily predict the future you can also just leave away a bunch of frames in the middle somewhere at different parts now what the model has to do is has to reason about a part let's say this part right here it has to reason given the surrounding evidence so it takes all the evidence into account and it reasons what kind of frames could have been left out there in again in video in nlp land this would be something like bert so bert is trained in this objective uh as a as a masked language model and then the last one is really quite specific i think to something like video maybe also different modalities but doesn't apply super well to nlp maybe you could though but this is where if you imagine this being your frames you not only do you leave away these frames right here but you also would leave away part of the frames that you observe so in these frames you would simply only observe the bottom right thing right here and you would not observe everything else so not only do you have to reason about what goes into the missing slot but you also have to reason about what goes into the parts of the frames you don't observe and as you can see here these can be different parts uh throughout the video so i think it's just it just makes a a point that this can be quite general so in general you just hide parts of your input and you re-predict them from a model and that means the model you know if it can for example if it can predict the future of a video from the past given you know certain input it will necessarily have to learn something about how the world works or at least about how the world looks through a video lens right if it does this task well it has a lot of prop captured a lot of properties of how the world looks in in video and that is much more rich information than simply giving a label to train on and the hope is that by learning all of these different things that are necessary to predict the future well from the past the model will learn such a useful representation that adapting this model to solve any labeled supervised task is going to be really quick because it also it already has very very good representation of the data and the common thing here is that okay in order to predict the in order from the past to the future there can be there can be numerous features that are helpful right there are all of these features that are very helpful uh to predict the future from the past now if i have any supervised task right i have for example the past and then i want to determine if i don't know um what can we determine from a video if if this is a happy video right is this a happy video or not the uh the core assumption here is that since you know predicting the future from the past has sort of the structure of the world built in and since our supervised task is probably a function of a subset of that structure like whether or not it's a happy video probably depends on whether or not in the future someone will fall off a cliff or not right so sub a subset of these things in combination are going to be relevant for that task so they can be adapted since the representation is already there they can be adapted pretty rapidly while the ones that are not important can maybe be overwritten and relearned to get some additional signal from the from the input that was not learned in the in the self-supervised training so the goal is again by learning to predict the hidden inputs from the non-hidden inputs you learn about the structure of the data by learning about the structure of the data you get useful representations and by having useful representations you can adapt very quickly to new tasks that's the that's the sort of argument here so why don't we do this all the time every time everywhere they go into self-supervised learning for language versus vision so in language this is uber duper successful while in vision i think in vision it's fairly successful too but there is a challenge when you think about language versus vision specifically in terms of this hiding hiding parts of the inputs and then reconstructing them so there are two there are two different things that we need to consider here the first thing the first problem is dimensionality dimensionality and the second thing we need to consider is uncertainty okay so dimensionality in nlp is what's our dimensionality if you think of this problem again this is a cap this thing right here how do we do it in bert like we mask out the word and then we feed this sentence we feed it through a big neural network that is bert and then at the end at this position we attach a classification head so this is a classifier that classifies into the whole vocabulary so what we end up with is we have our whole vocabulary so there is the word a there is the word is there is the word cat there is the word dog there is the word mom there are all these words right we can actually enumerate all of these words and because we can enumerate them we can let the model output a distribution so maybe it says well the word a is you know super likely the word is not so likely the word cat it appears in the sentence you know the observed sentence so it might be a bit like the word dog the word mom not really and so on so what we get is a discrete probability distribution note that the dimensionality even though it's sometimes large so this it can be something like 30k it's still countable we can still do a classification into 30 000 different classes especially if we use wordpieces we don't have out of vocabulary we can actually choose our vocabulary size second of all we can actually represent our uncertainty notice that not all the weight here is on the word a especially if there is also like your which is also possible but in this case not correct the model can express the fact that it thinks that both words could fit into this thing so if there is this is zero this is one over here that probably adds up to more than one in any case you can see that the top prediction here is only maybe 0.4 in probability so the model can represent uncertainty by simply not allocating all of the classification mask to a single thing so these two things are solved pretty well dimensionality is you know high but not too high and uncertainty can be represented now what about computer vision and that's where they they have this diagram right here that sort of is supposed to sort of detail what i just said in that nlp tasks these masked prediction tasks they have they're rather discreet okay they have relatively less well they're relatively low dimensional and have less uncertainty i'm not really sure if the less uncertainty and they have a better i would say they have a better way of representing uncertainty and then the fact that they have less uncertainty simply comes from the fact that they are more discrete and low dimensional than other problems so what do i mean by more discrete lower dimensional and so on if you look at vision problems if you think what do i need to do to predict a video right and let's let's even go let's even go simpler than that let's take a common task in self-supervised learning so i have an image the images of a cat let's say like i know you're surprised ears eyes let's that is a cruel cat okay so that is one cat okay and i mask away part of an image so i simply cut out this part here and my model is supposed to reconstruct the part from the known parts that is a self-supervised task is exactly in the category of what they suggest here now can we do the same thing as we do in the nlp thing remember in the nlp thing we made a model that output a classifier over all the possible things that could go in there like no we cannot well first of all how many things are there that can go there um well infinity because this is a continuous problem right so if i give you a patch and you know here is a part of the head this and maybe the whiskers you can see this it could technically be right but it could also be that the cat here because we don't know right an equally likely continuation is that the cat is like holding a wine glass right here that is filled with wine we don't we don't know right uh an equally likely continuation like there are infinitely many likely continuations for this for filling in and that's a bit the same as in the nlp task because there are multiple words that could fill that slot but way less plus we can we will never be able to enumerate all of the different patches that could and could not go in there right we could we can't even enumerate all the ones that could go in there and it's completely impossible to list all the ones that are both possible and non-possible so we could build a classifier on top of it so we simply cannot like this this we cannot build a classifier this is not possible in the vision case so it is too high dimensional and also there is no good way of representing uncertainty there is much more and i get it well i think the dimensionality has a direct effect on the uncertainty so what people do or what people can do is they they say let's not build a classifier let's actually just predict what is there right because i can do a neural network like a cnn something like this layer layer layer layer layer layer like a unit with some skip connections right here right and i can actually try to train my model to just reconstruct that part right like how how hard is this like we said at the beginning instead of this is a this is a very terrible cut but you know the model is not trained super well so it only has one eye um the model doesn't help me the model isn't trained super well so i can just program or i can train my model to reconstruct but now all my model can do is it can output one thing it can only output one completion if i don't have a classifier where i can represent my probability distribution i can only output a thing and since there are many i have no way of representing many and i can't really output the mean of them because the mean of these two pictures is going to be not a real picture because it's like a half transparent wine glass right so that's certainly invalid so you can as you can see the fact that we can't build an explicit classifier means we have to predict directly but then since we can't predict directly we have no way of representing uncertainty so i wouldn't call this more uncertainty i would call it that computer vision has less of a possibility to represent uncertainty directly i think that's something they say in the text actually um so that is the problem with computer vision now what do people do to tackle this and the answer is going to be contrastive learning but they go there in a bit first they make an excursion to energy-based models so here they say a unified view of self-supervised methods even though i thought this hiding part of the input was already the unified view but in any case they say there is a way to think about self-supervised learning within the unified framework of an energy-based model now a short pre thing here from me i you know this energy-based model and you'll see what it is in a second i think that is just kind of a it doesn't tell me anything like the term energy based model it can just be applied to anything like any problem like energy based model simply means loss function right but yeah let's so an energy based model is a trainable system that given two inputs x and y tells us how incompatible they are with each other for example x could be a short video clip and why another proposed video clip the machine would tell us to what extent y is a good continuation for x to indicate the incompatibility between x and y the machine produces a single number called an energy if the energy is low x and y are deemed compatible if it is high they are deemed incompatible so this is kind of a physics approach to the thing so if you again think of this as your video and you want to predict the future from the past what an energy-based model would do is it would um it had two components so the main component would be this energy function right here and the energy function would tell you how well x and y fit together so now it's you can actually put both frameworks in this so if you predict why right if you if your model actually predicts the continuation then your energy function could simply be something like the l2 loss between the actual true um between the true continuation in your data and the one you predicted however if you do if you could if you could do the classifier approach and you could actually list all the video sequences that are possible then your energy function could be something like um could be the classifier loss but you know again so if you think about this then anything is an energy-based model right a classification problem is an energy-based model because if i have an image here of my trusty cat and i have the label cat right my f of x and y is simply if i define my energy function as my cross entropy between um you know as my classification cross entropy of cat given all the other labels that is an energy based model right it's so i don't see why we need to frame this as energy based model if we can simply say loss function like beats me but in any case um i guess the sort of physics approach here is just another way of thinking about it but i dare anyone to bring me a thing that is not an energy-based model in machine learning i might have just uh summoned some demons here okay so they go back and say well look the the an early example of this are these siamese networks that have recently become fashionable again and that is where you do the following so now we switch away from predicting this hidden part from the unhidden part and we go more into the predicting a hidden property part so here you can see you have two different crops of an image and this is the most popular self-supervised task for computer vision you have an image of something like the sun and you crop it twice in different locations so you crop it here you crop it here and what your what your model needs to do is it needs to figure out that these two patches come from the same image if it can do that then it will have learned some good representation and if you regularize correctly then it learns an even better representation so here it needs to figure out that these two chess uh looking things actually come from a similar picture and the hope is so okay what do they do they feed each of the ones through the same encoder right and the w in the middle means that the weights of the encoder are shared so you obtain two hidden representation and then this here this could simply be you know like the inner product between h and h prime or like the negative inner product if you want to actually make it as an energy so or maybe one over the inner product however you formulate it but what this will do is it will tell the model if two things come from the same um image you better have representations for them these h that agree with each other which means that they are close in the inner product space they have a high inner product if this is the case right then it means that you have learned something useful about the world because you can tell me when two crops are from the same image and the hope is that the model will learn that oh wait if you know if the model wants to do this well it needs to learn aha there are chess pieces in here it can't simply compare maybe you can compare these pixels okay that will work but if you compare this pixel and this pixel that won't work so it needs to learn something more sophisticated actually needs to learn there are chess pieces in here if it wants to do a good job and differentiate representations from those with crops from different images like if we have a crop from the sun right here what we want is that the inner product between these two is high but the inner product between any with anyone with a part of the sun picture is low okay so we train it like this and this is exactly where the contrastive learning goes so these siamese networks they look fun but without the part i just outlined without the contrastive part they fall into danger of collapse so if i only ever input two crops from the same image and say please make the hidden representation such that the inner product is high um what i what i will end up with is a model that simply collapses and always gives me the same hidden representation for every single image because that satisfies the constraint right and that's what they point out here this phenomenon is like the network could happily ignore their inputs and always produce identical output embeddings this phenomenon is called a collapse when a collapse occurs the energy is not higher for non-matching x and y than it is for matching x and y so they say the uh the easy part is the easy part is that when vectors when x and y are slightly different versions of the same image the system is trained to produce a low energy okay so now that's easy the difficult part is to train the model so that it produces a high energy for images that are different now what counts as different and non-different here again is much of human supervision so this task of cropping that has fundamental assumptions that you know for example in one image there is largely one object or one topic that we're interested in right if this is a map and we actually want to differentiate the places it's a pretty bad task to do this cropping also what people do a lot is color jittering color inversions brightness modifications all of these is human intuition human supervision that the color shouldn't matter the brightness shouldn't matter and so on and the more things you give to the model like this the more you bake in your assumptions so again we we move from supervised learning where we tell the model here's the correct label here's the correct label to self-supervised learning where we tell the model sort of we tell the model what what kind of transformations should and shouldn't matter and the model has to figure out itself how to create the representation such that these constraints hold so now they go into the solutions for collapse they say they're avoid there are two techniques to avoid collapse one is contrastive methods and the other one is regularization methods so contrastive methods they actually have this graphic right here as you can see so their point is that if we talk about energy based models we want energy to be low on x y pairs that we as humans define match so this could be because we crop them from the same image or we actually it is the same image but slightly distorted in different ways so we as humans we simply determine these two things match or it is the uncorrupted and the corrupted version of the same sentence in birds training and these here are represented by the blue points so we want the energy to go down on the blue points but we want the energy to go up everywhere else right everywhere where it doesn't match we want the energy to be high now what could we do we could simply you know push down here because we can create lots of examples right we can create lots of samples where x and y match because we don't need labels anymore we can create the labels ourselves so we can create lots and lots and lots and lots of of image crop pairs that match right so the pushing down isn't the problem the pushing up is the problem now if you see this graphic you might say why don't i just you know enumerate or kind of go through here and i push up on all the green places right i push just up and up here and up here up here the problem with that is that the higher the dimensionality the less possible that is and here is where the graphic tricks you into thinking that it's a good idea when it's actually not like you will not be able to enumerate all the green dots even around the blue dots like it's just not possible because the dimensionality is so high if you have a dot in 512 dimensions that is a vector with 512 entries right 512 entries now you would need to let's say if you were just to look around a data point you would need to jiggle the first dimension maybe to the left and to the right and the second dimension and the third dimension and you need to do this all combinatorically so you would need to do this one to the right this one to the left this one to the left and then this one to the right this one to the right this one to the left and so on you need to do it in different magnitudes here sometimes you need to keep them constant it's just not possible so what do people do in these contrastive methods they say well we can't push up on all the points but what we can do is we can sample and that's why you see the green things epileptically jumping around in that we can sample the green points instead of enumerating them we simply sample them and that's where we push up and that is a difficult task to do so it is difficult to come up with examples with sense with meaningful negative examples because so what people do in this task right here is what i just said well here are two images that fit right this is a blue point and here are two images that don't fit so this is a green point however as you we already saw there are many many more green points than blue points and most green points are really far apart from the blue points if i just take any image right here it might be way too easy for the model so the best thing would be to give the model sort of a curriculum or at least what we call hard negatives but that is computationally very expensive because we have to go search for heart negatives like images that are close but not but still different uh would be best for the model but we don't have that all we can do is sort of randomly sample crops from other images because we don't have labels we have no clue if you know two images are the same or not we just scraped them from instagram come on um all looks all the same to me so the problem here is that if we just do it randomly then most of the green points will actually be pretty far apart and that means we just have to train for a long long time so contrastive methods they work in computer vision right now however coming up with incompatible pairs that will shape the energy in a suitable way is challenging and expensive computationally at least in vision systems right um the method used to train nlp systems by maxing or substituting some input words belongs to the category of contrastive methods but they don't use joint embedding architecture instead they use a predictive architecture okay so that's saying that them if you look at what you know bert does with this it this this masking one thing out and then classify directly that is technically contrastive because what you do in a classification model is you push up like these are all the possibilities and what you do during training is you push up on the class that is correct and you push down on the classes that are not correct that's what the cross-entropy loss does so technically it is a contrastive method however you do this in this sort of predictive framework you don't do it via this method of having shared embeddings and that's because you can actually enumerate all the things that you could do right so with the contrastive methods for vision we can do the same thing now what we can do here if you think about this problem again of we cannot possibly enumerate all possible pictures that go here but what we can do is we can enumerate a couple and then simply classify which ones are are good and which ones aren't and that's exactly what these contrastive methods do that we just looked at right so we sample the green points we sample also the blue points and then we simply either classify between the green and the blue points or you know we make their inner product go high at the end these are not so much different objectives whether or not it's really a classification loss or not the point here is that first they obtain shared embeddings they obtain some sort of embedding right here and then they make the embedding agree or not agree so they quickly go into what bert is birdies is uh usually called a denoising auto encoder so what you have is you start off with a data point with the uncorrupted version you corrupt it and that's the part where you mask out some parts you can see this right here you mask them out and then you have a prediction for what should go in the blanks and the loss here is simply the classification loss this is just your cross entropy loss that goes here a vast language model which is an instance of a denoising autoencoder itself an instance of a contrastive self-supervised learning however there is another way there is another so here they talked about there are two ways where we in which we can combat this right there are two categories oh sorry about that there are two categories so this is category one is contrastive methods uh where we classify some against others either all of them or a sample of them however the other one is what they call this this predictive architecture oh sorry no a predictive architecture of this type can produce only a single prediction for a given output since the model must be able to predict multiple possible outcomes the prediction is not a single set of words but a series of scores for every word in the vocabulary for each missing word location so that's still burped uh bert which can give you uncertainty by simply telling how likely each word is and here they say we cannot use this trick for images because we cannot enumerate all possible images is there a solution for this problem the short answer is no there are interesting ideas in this direction but have not yet led to results that are as good as joint embedding architectures one interesting avenue is latent variable predictive architectures so that what you see down here this is a latent variable predictive architectures so it goes down this is the description that goes down here latent variable predictive models contain an extra input variable z it is called latent because its value is never observed with a properly trained model as the latent variable varies over a given set the output prediction varies over the set of plausible predictions compatible with the input x and the name generative adversarial models here so this is a bit a bit confusing but so up here is the loss this is a loss and um here you have this new variable z and this z comes from a domain right here where it can move around and by by moving around z you actually move around the output y right here so they represent this as this this curvy curvy boy here um so as so maybe z is here and that represents a point here on the manifold but as you move z like to the right then you move along this manifold right here so this is a way in which a model can for a given x you can see here x is mixed with z x is first you obtain a representation for x then it's mixed with z for a given x you can produce many different outputs by simply varying z and if you sample a bunch of these z and then calculate sort of an average loss over them maybe or just a loss per sample then eventually you'll train your model to not only you know handle this one prediction but handle many different predictions now you might know gans so gans are simply when you do not have so when you so again simply cuts off this here so gans only have the z variable and then they produce this set of outputs and the this is the discriminator right here that decides between the real image and the produced image of course the last thing here is that this r is the regularization on z i believe they never i don't think they ever pointed out what the r is but they also don't think they ever point out what this regularization is they talk up here about so i'm going to assume that refers to the r right here and now it gets a little bit it gets a little bit confusing so they say um down here they say first of all they say non-contrastive methods applied to joint embedding architectures is possibly the hottest topic in self-supervised learning for vision at the moment domain is still largely unexplored but it seems very promising so non-contrastive methods which means they don't need negative samples but they still do joint embedding so they take two different things that come like from the same image they jointly embed them but they don't have negative samples like the original siamese networks but you need to avoid collapse and these models right here for example there's bjol which i have made a video about you can check that out i think they argue that batch norm for some reason avoids this collapse if they build in batch norm but also there are other architectures right but they all um they they are in the beginning and so they say rather than doing non-contrastive joint embedding maybe we should do essentially what bert is doing but for vision so perhaps a better alternative in the long run will be to devise non-contrastive methods with latent variable predictive models so predictive is you know we predict the output directly like burt does but we can't envision because we can't enumerate all the possibilities so we can't represent uncertainty so what we should do is we should do this latent variable thing where we deterministically predict right this is deterministic we deterministically predict the embedding and then from the embedding we construct fuzzily like with the by sampling z like we sample z from this ground distribution we construct this entire set of outputs and that will represent our possibilities like our uncertainty that will represent all the things that could fill the gap that we're trying to predict so they say that may be the way forward and then they say something confusing the main obstacle is that they require a way to minimize the capacity of the latent variable the volume of the set over which the latent variable can vary limits the volume of the outputs that take a low energy by minimizing this volume one automatically shapes the energy in the right way which sort of means that yes if i have to limit this capacity of this latent variable right because otherwise the latent variable could contain all the information like in again the latent variable contains all the information and uh it's only actually limited by the by the generator right by what the generator's weights are so the latent variable contains all of the information so technically again something like a style gun could happily ignore the input right here and it could still produce pretty good images and you have to do tricks in order to make the model actually pay attention to the input and not only pay attention to the latent variable so you can regularize you can constrain this latent variable such that the model pays attention to the input and why do we want the model to pay attention to the input because the entire reason is that we want to use this embedding right here then for future supervised learning like this embedding that's actually the goal of self-supervised learning um there you see why gans probably cannot give us super good um embeddings because gans just have the part on the right okay but something like an infogan or like as we said like a style gan that takes an input could technically already give us is technically a model uh about something like this though here they say um so so that's you know you limit the we keep the you limit the capacity of the latent variable but then they go on and say a successful example of such a method is the variational auto encoder the vae in which the latent variable is made fuzzy which limits its capacity okay and here is where i uh i i was i was confused but the vae have not yet been shown to produce good representations for downstream visual tasks okay another successful example is sparse modeling but its use has been limited to simple architectures no perfect recipe seems to exist to limit the capacity of the latent variables now i get that limiting capacity however in a variational encoder it is not exactly the latent variable that is made fuzzy it is actually the embedding right if you think here in a in a variational autoencoder what you do is you have whatever your image and then you have your encoder and then you predict in the latent space you predict gaussian distributions like you predict the mean and you predict the standard deviation of a gaussian distribution and then you sample from that gaussian that is a horrible gaussian you sample from that gaussian distribution and due to the reparameterization trick you can actually simply sample from a standard gaussian down here like that is at zero and has standard deviation one and that will be your z variable and then you can simply do z times uh sorry z times sigma plus mu and that will be sampling essentially from the um that will be sampling from that respective gaussian so in this way the variable z is not made fuzzy what is actually made fuzzy is this here and this here comes from h right this is h this is the embedding gives rise to these mu and sigma and these are made fuzzy because they're multiplied by a stochastic variable so i'm a little bit confused about this paragraph right here because a vae i don't think that it limits the capacity of the latent variable and fuzz is the latent variable but i might be wrong or they actually mean something else by latent variable they actually mean the embedding here in that case it might make sense again however then it doesn't make super much sense to limit its capacity and i've also looked at this sporus modeling which simply seems to be kind of sparse encoding of of images it's a really old paper from uh 69 but sorry 96 96 uh not that old yeah but okay i'm simply going to interpret this as in order to obtain a meaningful representation h down here we need to limit the capacity of the latent variable right here because otherwise the model will simply ignore the input and not build a good representation for it so they argue that an architecture like this an architecture like a vae like an infogan or something like this could potentially be the next step if we can make it work the challenge in the next few of the next few years may be to devise non-contrastive methods for latent variable energy-based model that successfully produce good representation of image video speech and other signals and yield top performance in downstream supervised tasks without requiring large amounts of labeled data so in german we have a saying uh that what they want is the uh ayaling the volume which means the egg laying wool milk pig so it can do anything and everything and it costs nothing um so that's what they mean again some of these things like energy based model like anything is an energy based model i just i just don't find this to be super uh discriminating in its in its meaning of what that of what that is lastly they talk a bit about their new model called uh seer which you know is a self-supervised model but it's just like a giant convent trained on a billion images like oh but you know they open sourced it thank you you open source the code uh so i can totally train my own billion parameter on a uh on a billion random public instagram images because you know my raspberry pi just technically has that capacity so thanks uh but you know no but i'm i'm joking a little bit at least better than open ai and at the end they go into how they use other ways of self-supervised learning at facebook all right that was my overview over this article i hope you got at least something from it as a high-level overview they first say self-supervised learning is maybe the way to get this common sense into ai systems um then they go into what is self-supervised learning they define it first as predicting hidden parts from unhidden parts and later they say it can be viewed as an energy-based model that they point out that there's a crucial distinction between tasks like language and vision because vision is much more high dimensional gives you much less of a way to represent uncertainty then they go on and say well the contrastive methods handle part of that they handle this not uh they handle this part of the dimensionality that you can enumerate all the possible things however they are prone to collapse sorry you know the the siamese networks are prone to collapse the contrastive methods fix that however because you have to sample from such a i dimensional space and that is really hard it takes a lot of data and what we could do is we could do these predictive models that directly classify the output or directly predict the output right you predict the missing frame you predict the missing word but we do it in this way where you not only do you predict a single thing but you predict an entire set by means of these latent variable predictive models and that they say is maybe the way forward even though it doesn't work too well yet like via e's work but the problem is they don't have this ability to generate good representations for supervised learning that just doesn't work too well yet all right that was it if you liked it leave a like subscribe share doubt tell me what you think in the comments and bye bye
Info
Channel: Yannic Kilcher
Views: 98,435
Rating: undefined out of 5
Keywords: deep learning, machine learning, explained, neural networks, artificial intelligence, deep learning tutorial, what is deep learning, introduction to deep learning, what is self supervised learning, self supervised learning, self-supervised learning, self-supervised learning yann lecun, yann lecun, yann lecun energy based models, energy based models, energy based machine learning, energy based models deep learning, byol, contrastive learning, bert, noise contrastive estimation
Id: Ag1bw8MfHGQ
Channel Id: undefined
Length: 58min 37sec (3517 seconds)
Published: Thu Mar 11 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.