DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Research Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Not sure this is AGI-related.

(traditionally) convolutional NNs could perform semantic scene segmentation, only after having been trained on datasets that were pre-segmented. In most cases, those training sets had to be manually segmented by actual humans drawing on images. That is cumbersome and expensive.

With DINO, object segmentation is automated, allowing the creation of terabytes of pre-segmented datasets for further training downstream. That's great news for computer vision and machine learning. However, I don't see any obvious connection to AGI.

👍︎︎ 1 👤︎︎ u/moschles 📅︎︎ May 16 2021 🗫︎ replies

Captions

hello there i hope you have all seen this this is a new system by facebook ai and what you're seeing here is a visualization of the attention maps of that neural network in the middle is a supervised baseline and on the right is this new system called dyno it's not as much a system as it is a methodology for unsupervised pre-training of visual transformers and you can see that the system has neither been trained to learn what a dog is nor has it been trained to do any sort of segmentation yet if you look at the attention maps it clearly can track objects it knows what to pay attention to in the images and it can do much more than that so here you can see that it can sort of track objects behind occlusions so the ship goes behind the waves the horse goes behind the grass and you can see in the attention map that these this is well reflected you can do more than that though even so if you use this feature representation that this model gives you for imagenet then as the model gets trained and you represent imagenet in its feature space it will cluster the same the images of the same class it will cluster them together which is already pretty cool because it it has no labels at training time but also it will cluster similar classes uh with each other which you know speaks to the it kind of speaks to the fact that this might be the next step in unsupervised representation learning for images and specifically it appears that the features that come out of a network that is trained with dyno are extremely valuable for the kinds of things we you know we are interested in when working with natural images so this is image retrieval and classification so this system let's just switch over to the paper right here the paper is called emerging properties in self-supervised vision transformers it it presents a system called dino it's by matilde cajon hugo duvro is of facebook air research indriya and sorbonne university you can see a bit more here in in these in these pictures where again this is the self attention so the the attention map from a vision transformer that was trained with dyno and no supervision okay and you can clearly see that in all the cases the attention falls on what you would consider as a human the the relevant things in the image now i have my hypotheses why this is the case like completely without labels and we'll we'll see about that but the representations that come out of the systems um are really useful for example you can fine-tune uh linear classifiers on top of these representations and get that gives you really good image classifiers they do that with imagenet you can use these for image retrieval because similar images are clustered together you can use even do zero shot classification simply by doing a k nearest neighbor classifier in that feature space um and yeah here you can also do some sort of proto image segmentation by looking at the attention maps you don't even have to do something special to visualize this like you have to do in cnns the attention map directly gives you the the sort of segmentation map or or something pretty close to it as an overview this system dyno is simply a they push the self-supervised learning and they specifically make the case that self-supervised and visual transformer they go together really well and they as i said the dino is called self distillation with no labels so it does that is die no and um yeah they they push various kind of metrics in in in the self-supervised systems or you know then linear classifier trained on top of them for example 80.1 percent top one on imagenet in linear evaluation with the with a visual transformer base a quick overview over the system is right here so two things they say are important next to all the other self-supervised systems first of all they do they have a kind of student teacher that's the self-distillation part the teacher is a momentum teacher and it does this centering and it also does sharpening in the softmax right here and then there is no contrastive learning there's no negative samples that the sharpening and the centering sort of take care of keeping the model from moat collapse or from collapsing also there's no batch norm so if those things don't don't mean anything to you maybe stay tuned we'll we'll discuss them in a bit more detail as we go through the paper if you like paper uh summaries like this and other content for example our cooking video uh feel free to share this out and tell your friends about it by the way the cooking video did terribly i don't know why um i guess i guess my youtuber skills are just not not on par but yeah i don't know yeah if anyone has any ideas all right let's dive in so vision transformers are a new thing right vision transformers um i've also made a video about vision transformers they are the easy the simple application of the transformer architecture which was prevalent in natural language processing with the introduction of attention is all you need and follow-up papers burped and so on and applying this to images and the concept is very simple uh you have an image and you divide this into patches so you divide the image into patches and then you simply unroll that array sort of so you unroll that array so you have patch patch patch patch and so on and then you simply consider this as a sequence like a sentence like hello my name is and so on you simply consider the sequence of patches as word embeddings so there is like one i think there is one fully connected layer to actually get the word embedding or the token embedding and then you put a transformer as you would in nlp so there is a transformer here and you do whatever you do with the transformer so usually if you don't know people prepend a special token that special token is usually called something where am i going to draw this that special token is usually called cls token and that is also passed through the transformer and the transformer in its base configuration it sort of keeps it keeps the length of the sequence the same it's actually not necessary to do this but that's just how we do things so for every input token you'll get a corresponding output token or output embedding or output signal or whatever you want to call it and such that none of the input tokens is you know kind of preferred because every input token sort of refers to some little patch here in the image if you want to say something about the entire image you don't want to prefer any one of them so what you do is you have this special token the cls token which is associated with no location in the image and that's ultimately what you use to classify the image or also here to do representation learning so the representation we're looking to get out is the final layer embedding of the cls token and that through the transformer architecture had aggregated all the information or we hope so from all the visual tokens in the image so that's a visual transformer now what do we do with it in this dino architecture i've already shown you this picture let's go a little bit deeper into that self-supervised learning naturally means you have no labels and in this case you don't even have a negative sample mechanism or a contrastive learning mechanism so what you want to do is you want to train a model that sort of gives you gives you sensible representations and that is easier said than done if you have no labels um now the when you do contrastive learning the goal is that you have an image and you just take two patches from the image let's say and you have another image and you take a patch from that and now you have what's called your anchor this is your anchor and then you have patch patch a from the same patch b now you present the model all the three patches and you tell it which one is the anchor and it needs to decide is the patch a or patch b from the same image you can see how this objective can give you sort of representation because the model learns what kind of stuff is likely to be in the same image this is not the case right here we don't do contrastive learning we don't have negative samples we only we take one image and then we augment that image in different ways now augmentations are a kind of a science by itself i think they say they follow the paper b-y-o-l in terms of augmentations i've also made a video on that essentially what you do is you do various random perturbations of the image you might flip it you might apply some color jitter you might apply like some solarization um anything like this anything you can do to make the image different but that you're relatively sure that you know it still looks like the same like you would still recognize it as the same image so a part of these augmentations are also crops what i've shown you here are crops of the same image they do something special right here when they have an image they crop in two different ways one they call i think global crops and these are crops which generally cover more than 50 of the image whereas the other ones they called local crops and these are crops that cover less than 50 of the image okay this is going to be important in in one was so this these are global and these are local crops uh of the same image so they um exactly and keep that in mind and now we have to understand what's up with this student and this teacher so what we ideally want to do is we want to have two different augmentations of the same image so here we have an image and you can see we make two different versions of that image now this could be two different crops and then we apply two different color jitters we applied two different random rotations and so on we just want two different versions of the same image and our goal finally is going to be here you can see the loss is that the representation we get out of it is the same so we teach the network that look these two things they might look different you know but they are in fact the same they are you know from there's crops differently augmented differently cropped but from the same image so the easiest thing would be to just pass the two through the same network but that it does not work um so if you don't have negative samples your main goal is to avoid what's called collapse if the network just maps everything to the same representation then it always wins right it always is like well you know okay the the two things are the same because everything's the same you don't want that so a trick is to have two different models one you call the student and one you call the teacher and they're called student and teacher because uh from from distillation so in distillation what you usually have is you have a data set and then you train a big model which is the teacher and now what you want to do is you want to make you want to make that model maybe smaller right such that it runs on a mobile phone and that's then the student and there is a procedure where you take the data set and you take the teacher model and you sort of transfer the knowledge from the teacher model to the student model while using you can use the data set to do so and that usually works better than training the student model from scratch it's very interesting why that even works but this process is called distillation so that's why it's called teacher and student however in this case it's kind of a self-distillation so the teacher and the student they're not big or small they're the same architectures in fact we only train the student okay and the teacher is made from the student so here is where the the terms break down a bit like um so in the distillation sense the teacher is the teacher in the distillation but now it breaks down because the teacher is constructed from the student so we have a teacher we train the student to predict the same thing as the teacher does like learning from the teacher but then at the same time after we have done after we've updated the student we then have we then build the teacher from the new student and the way we do this you can see right here is by exponentially moving average so we keep the teacher model and then as we update the student model we simply update the teacher a little bit into the direction of the student model and there is also a schedule associated with this exponentially moving average like how much the exponential decay is and so on this seems all to be loaded with hyper parameters um but again the results are really cool and it it i guess it's yet gonna turn out how sensitive to hyper parameters this whole setup is they do make ablations but yeah no we'll see how other people with other data sets fare all right so we have the teacher that is built from the student exponentially moving average and we want to make the two predict the same represents or the same output four different augmentations of the same image okay in fact here you see um it's even a bit more complicated so this is the pseudo code so we want to augment the image we get two different versions of the image we push both of these versions through the student and through the teacher right and then um we want if you i don't know if you can if you can track if you can track that but t1 is the x1 that went through the teacher it that needs to be the same as x2 that went through the student and then the image x2 went through the teacher should be the same as x1 going through the student so we want to augment the image differently two times then that gives us two different views of the same image then we want to run them through the both through the teacher and student and then we want sort of everything to be consistent with everything else so we want the one augmentation in the one model to be consistent with another augmentation through another model now there are two more things here um the first one is these centering what's called centering and that's what something the teacher does and also something they say in the text is that in the teacher they only use the global cropping whereas in the student they use both the global and the local cropping so the student uses both and the teacher only uses the global crops so essentially if the student gets a local crop and the teacher gets a global crop the goal here is that both things predict the same representation and that means the student has somehow learned that you know whatever i see here is a little piece of whatever the teacher has even though it does not should reformulate this because it doesn't see what the teacher has right so the student somehow has to from a very small sub patch it has to know um it has to output something that it would that itself or the teacher which is itself averaged would also output if it sees more context in the image right so you train the network to for all of these crops and for all the different augmentations output the same thing without knowing what the other thing is and i think that is the advantage to contrastive representations honestly because in contrasting representation contrastive learning you sort of contrast with the negative with the negative samples and here it's really like you don't know anything and you need to output something and that needs to match whatever whatever you yourself would output if you saw a different part of the image so you have no choice but to output you know either the same thing all the time which is prevented here or to output something that's on the image and you can't just output something that's only in your patch right otherwise another patch wouldn't show the same thing like if you if there's like there's like a little tiny structure here you would not output that because the other patches don't have it however if there is something big in the image right like you know our traditional cat right here and you recognize that because you see a little cat ear if you output a representation for cat um and you know since you would also do this for the other ear and for the pause and so on you this whiskers you then would you then win like your loss is small so you're intrinsically pushed towards outputting something that describes the image as a whole right and that differentiates it from other images so what what encourages you to be different that's this centering and also in the soft max there is a um there is a sharpening so first of all the centering is simply what you do in the teacher you keep a running average here again you can see that you can keep a running average of all the representations that the teacher sees right you just you keep you keep that as a list or a running list all the representations that the teacher sees running average and you simply subtract that from the logits down here yeah that's that's centering it's something like a normalization but not really what it does is it it keeps uh the it keeps the logits um sort of close in a in a range that's manageable and and has some variants and so on um and you know within as a proxy it also does that to the student because the student is trained to be like the teacher so centering is a bit like a normalization here and then the second thing is that there is a different parameter in the softmax um as a temperature parameter so the softmax function is at the end and that has a temperature where is it where are ya this is the softmax function you can see it has a temperature parameter right and that temperature is um much lower for the teacher than for the student and they call this sharpening now why is there even a soft max that's what i asked myself like if you think of a of what you do with a representation usually when you do something like a contrastive loss uh you may just do a contrastive loss or a self-supervised loss on the representation itself like you do cross product or not cross product inner product or you do l2 distance between the representations or something here we do cross entropy and the cross entropy after a soft max and the way i interpret this is the following um a soft max is like what you get out is a normalized distribution right however we have no class labels here so what you do is you simply choose you choose a number any number right this is you as an implementer of this algorithm choose what dimension you want to output here now after the softmax whatever you input is going to be a distribution over the amount of things that you have input so and you can interpret this as classes right there's class zero one two three and so on and you're gonna get class zero is probability ten percent class one zero percent class two forty percent and so on right it you don't know what it means but you know you um you get this as an output and the teacher having this sharpening it will have a much more peaked distribution so for the same thing it might have a distribution that's not as much class 0 not as much class 1 very much class 2. all right this even goes off screen for you yeah very much class two and so on and since this is the since the teacher is the target for the student you see here is a stop gradient uh the student is sort of this is a common i guess i guess this is a common trick in distillation like the teacher is very sure and that means the student gets a better learning signal to match the teacher so this this sharpening of the teacher uh gives is less noisy for the student and also i think it also helps prevent this i'm not sure so they speak of sharpening and centering and one i think one they claim furthers collapse probably the sharpening and one prevents it which might be the centering i might mix them up but you know one sort of reduces the noise but encourages i think the sharpening must reduce noise but encourage collapse and then the centering counteracts that counteracts the collapse yeah probably though there is an argument to be made that the sharpening might also counter collapse because oh yes that's what they say now remember so they say the sharp so they they say uh naturally this would then be biased towards the uniform distribution um with the centering i believe but the sharpening then counteracts that again it's in the text somewhere um i'm more interested in why this is even a soft max in the first place so i interpret this as you force the model to come up with an within k-dimensional classification problem by itself and it has to choose by itself what the classes are right so it has to somehow make representations that allow itself to come up with a classification problem that it can solve and i think that's that's pretty smart um you know you instead of giving it a classification problem you simply ask it to come up with one um now this could go horribly wrong right but uh apparently if you do it like this it goes well so that's the dyno architecture again we augment image we augment it in different ways we pull we put all the things through the student and through the teacher the teacher is an exponential moving average of the student that gives us different representations of different augmentations of the same image we require the representations to be the same in terms of their so we take the representations we ship them through a classifier through a soft max into a distribution we require the outputs to be the same of the student and the teacher while the teacher has centering which is centering the logits by an exponential running average of all the representations it has ever seen and also it has a sharper soft max all of this together and yeah the teacher has a stop gradient so it's we train the student all of this together gives us a system that comes up with good representations and does not collapse now what does this buy us it buys us what i've essentially shown you at the beginning and also it buys us k nearest neighbor classification which are zero shot classifiers okay like right now i can i can pump this through the system pump a data set through the system i can come with a new image and i can simply do k nearest neighbor i don't even have to train the network anymore i can come with a new data set i can do image retrieval i can do linear classification on top of the representation and all of this works much better than previous systems no matter the architecture but it seems to work especially well with the visual transformers down here if you see this for example compared to the to the best resnets so there's this five percent difference in linear evaluation which you know this is 25 error this is 20 error on imagenet and there is even a bigger difference when you look at k nearest neighbor classification which is the right most column uh they do a lot of experiments as i said in image retrieval in copy detection which is really interesting uh that's i think where you where you want to realize if if someone has taken an image and made another image out of it you know and don't know if that's a good if that's such a good thing given that the entire meme culture relies on it if you look at this cls token right the cls token is ultimately where the representation that you take comes out if you look at the attention heads of that and you visualize the attention maps um it gives you this this not only this segmentation map but like yeah like not only does it tell you where to look but it even seems to be uh sort of segmenting the individual objects here in the horse you can you can see the straps of the horse uh you can see sorry this is a zebra um yeah you can see there in the trucks you can see the roads uh is or the the wheels are separate from the truck and so on they do ablations they compare it with sort of supervised baselines you can see this works much better and what i think is pretty cool is down here in the appendix somewhere yeah they have more of these attention maps compared to supervised attention maps and this i mean that the comparison is is very very strong um yeah because yeah so compared to supervised what i think is happening that if you give the these things a supervised problem they you can see they do pay attention for example here they pay attention to whatever uh the cat's face or something in the ear you can see the the cat shape however there is this thing like there is the shortcut learning which is a i think a date to set problem but also no supervised system just stops kind of learning once it has mastered the task or it might it might try out various optimizations for the task that you give it right and and these optimizations i think are what you know pop up all over the place with these little specks of attention that it also does you know these it might not make sense in this particular image but you know the same attention pattern or the same uh thing to pay attention to might make a lot of sense in like three other images in the data set so that's why that's there um whereas if you do this unsupervised uh there's no there's no hyper optimization on a single task uh there is no real like there is only there's a like especially if you have also more images which you can do in unsupervised right um you can also can't hyper optimize for individual samples and so on so that's one thing and here is this complete map of image net i think and maybe you can't read it but like here's tractor and right next to it is like harvester and trasher um there's minibus down here so all of these like the vehicles are clustered together there is kind of butcher shop and grocery store right next to each other this you know it appears to be really really good representations now the question is why right that's that's the question so this this was the paper i encourage you to go read the um experiment section and so on it's it's very cool cool ablations uh they show why exactly they use this loss and what happens without the momentum of the teacher and so on but what interests me is why does this give you such extraordinary representations in an unsupervised fashion and i am sort of i have two hypoth or two things that i think contribute mostly to this so if we look at the question of why right the first thing i think is the augmentations the augmentations have played a large role not as much in in nlp in nlp we do it a little bit differently but augmentations in computer vision and self-supervised learning have a central role and it's really important that you have the correct ones which is a thing they also say right here right they they really stress that this multi-crop augmentation is quite important um so augmentations seem to be central and to me augmentations are a bit like that's where you put the that's where you put the human prior that's where you tell the model what it should pay attention to and what it shouldn't pay attention to right because all the things you destroy with an augmentation like you make the color brighter that's you tell the model color doesn't matter right or brightness variations don't matter so by augmenting you tell the model what it should and shouldn't or you know what it shouldn't pay attention to essentially so all the things that it's the same if you have an if you have a data set of dogs and cats right and um you know you tell it you know this is a dog this is a dog this is a dog essentially you tell it you shouldn't pay attention to you know what is different in these images you should only pay attention to what is the same and the augmentations that's kind of where the knowledge goes in so if we want to go towards fully let's say fully autonomous self-supervised learning that's what we need to get rid of we need to get rid of the augmentations or we need to get rid of us designing augmentations for the domain if we want this to be you know domain agnostic and also if we want better image representations because the probability that we as humans exactly capture the correct augmentations um is zero right we seem to capture pretty good ones uh but you know the probability we have the best ones is like zero okay the second thing and this is a thing that's i think more hidden is the data set and what i mean is how the data set is constructed so these things are often you know trained on something like imagenet dataset and you can see in these pictures there always seems to be like an object of interest in these in these pictures right even if you train this from pictures in the wild like you scrape pictures from instagram or whatever uh the way people don't don't take pictures of random things people if if you're you know if it would be pretty weird to have a picture and you know there's just like dirt road like it's just like dirt road and here's like you know a bit of grass and you post this on social media and you're like whoa look at this so by how you construct the data set even if you scrape it from the internet by how humanity takes pictures you are implicitly telling the model what's important so the model learns how should i say this how you make the data set speaks a lot about where your attention goes and that's what you feed the model right so these things these self-supervised methods in this way they rely a lot on data set construction so we shouldn't expect this to transfer to domains where we get like random iid data from the world because these things aren't iid we tell the model pretty clearly by the data we give it what's important what isn't so that is a little bit of my opinion and i think that's correct right i think the model if we have self-supervised learning the information should be taken from the data set right so that the mod the model should look at the data and say no what seems to be given how this data set is what seemed to be the important things in there i i'm more a fan of getting rid of the augmentations so that's my opinion if you want more experiments it's you know it's also faster and has less parameters and and so on but again dyno is a method of self-supervised uh learning where and they their argument is that it combines naturally well with the vision transformer right that was it from me check out paper check out blog subscribe share and bye bye

Info

Channel: Yannic Kilcher

Views: 54,006

Rating: undefined out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, deep learning tutorial, what is deep learning, introduction to deep learning, facebook, facebook ai, fair, byol, swav, self supervised learning, unsupervised feature learning, unsupervised machine learning, feature engineering, stop gradient, dino, self distillation, self-distillation, segmentation maps, visual transformer, visual transformer self supervised, imagenet

Id: h3ij3F3cPIk

Channel Id: undefined

Length: 39min 12sec (2352 seconds)

Published: Sat May 01 2021