Supervised Contrastive Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi there today we're looking at supervised contrast of learning by people from Google research and MIT now this paper proposes a new loss for supervised learning and you might recognize that this is a big claim so for ever now we basically use this cross-entropy loss in order to do supervised training of neural networks this paper proposes to replace that with the supervised contrastive loss and let's jump straight into the results here they say our supervised contrastive loss outperforms the cross-entropy loss with standard data augmentations such as Auto augment and Rand augment so these are some of the previous state-of-the-art data augmentation techniques used together with the cross entropy loss and they say there are supervised contrastive loss outperforms them you can see here on image net which is the biggest vision benchmark or the most famous one this new loss the supervised contrastive loss outperforms these other methods by something like a percent one percent is a big improvement on image net right now so they claim it is a big claim right you recognize if this is true this could be a game-changer basically for all of supervised learning and supervised learning is really the only thing right now in deep learning that works so it could revolutionize the field but so here's the but it is actually not a new loss to replace the cross entropy loss and that's they do they do come about this pretty quickly some and you don't think they're they're dishonest or lying or anything here but it is it is sort of if you start reading you like what this is a new loss it is not it is a new way of pre training the network for a classification task and so let's look into this so if you look at what does what does mean to build a classifier in this is what you usually do this is supervised cross-entropy training you have an image and the image here is of a dog you put it through your network and you obtain a representation so the representation here R is this last layer or the second-to-last layer and you put that through a classification layer and then a soft Max and what you get as an output is basically a probability distribution and let's say you have three classes here there's dog there's cat and there's horse and let's say the network doesn't yet isn't yet trained very well so the probability for dog here is fairly low so this is basically what the network thinks of that image like which class does it belong to with what probability I also have this label right here so the labeled dog for that image what you do with that is you do a one hot vector so that would look like this so the one is at the position where the correct class is and then the cross-entropy loss takes all of this and does the following there's a sum over all your classes in this case you have three classes and let's call these the labels l and you want to always take the label of the class times the log probability that the network thinks belongs to this class so you can quickly see that this if the label is 0 so for all the incorrect classes that means this entire term drops away and only if the label is 1 so only the correct class that will result in the log probability of the class where the label is the correct label right so in order to make this a loss you actually have to put a negative sign in front of here because you want to this so this entire thing reduces to the log probability of the correct class this is what you want to max semi's there for you if you want to minimize something you need so you minimize the negative log probability of the correct class which means you maximize the probability a a if you've never looked at the cross entropy loss like this it is important to notice that you're gonna say hey all this does is pull this here up right and it doesn't do anything to the other ones but you have to realize that this softmax operation since this is a probability distribution all of this is normalized to sum up to one so implicitly you will push these down through the normalization right so what this does is it pushes the correct class up and it pushes the other classes down so this to look at this is going to be important later because if you look at what this representation here does so again you have the network produces a representation here this is 2,000 dimensional and then it does it adds on top this classification layer this classification layer is simply a linear layer and then a softmax on top so how you have to imagine this is that there is a representation space this 2010 shell space and the representations are made in such a way that the labels such that sorry let's have three classes here the representations are made in such a way that a linear classifier can separate them correctly right so here this would be like a boundary and then this would be another boundary and this maybe would be another decision boundary so you can see that the linear classifier can separate the classes well that is the goal if you use this soft Max cross-entropy loss that is implicitly what will happen in the representation space W all the cares about is that the classes are on one side of the decision boundary and everything else is on the other side of a decision boundary so if if you have the network isn't trained very well at the beginning and you maybe have a sample of the green class here it will push the network such that the representation of that sample will go on to the other side of this decision boundary and it will push the decision boundary at the same time to make that happen more easily right so it will optimize all of this at the same time that's what you do that's how you optimize representations so this work here and another work has said wouldn't it be great if the representation and decision boundaries weren't just trained at the same time for this but we learn good representations first such that classifying them becomes very simple and in essence what this paper says is if we have a representation space W shouldn't images of the same class shouldn't we just make them close together you know so without caring about decision boundaries we just want them to be close to each other and we want them to be far apart from other classes if that happens you can see that a linear classifier is going to have a very easy time separating these classes later so that's exactly what this paper does it has a pre training stage and a training stage so in the pre training stage this is over here su provides contrastive in the pre training stage it simply tries to learn these representations right like over like down here such that without the decision boundaries class think images of the same class are close together and images of different classes are far apart which notice the the subtle difference right to the cross-entropy loss where you just care about them being on one or the other side of a decision boundary and in stage this so this stage one and then in stage two and there is where where it comes in you basically freeze the network so you freeze these weights down here these are frozen you don't train them anymore all you train is this one classification layer so the represent actually freeze also the representation layer here you only train the classifier on top in stage two but you train it using soft Max and using the cross-entropy loss so you you train the classifier in the old cross entropy way using just normal supervised learning the difference here is that the stage one free training is is what's training the network and the cross entropy dose only trains the classifier right so let's look at how this pre training actually worked what is using what it's using is a method called contrastive pre-training now in contrastive pre training and they have a little diagram up here what this does is if you look at the classic way of doing contrastive pre train you have to go to the unsupervised pre-training literature people have kind of discovered that they can improve a neural network by pre-training it first in an unsupervised way and this is also called some of these methods are called self supervise so the advantage here of self supervised or unsupervised pre training is that you don't need labels what you want to do is simply to make the representation space somewhat meaningful right so you simply want the network to learn representations of images that are somehow meaningful right that are there and here's how you do it so you want to take an image like this dog here and then you want to randomly augment this image which just means you want to produce different versions of the same image in this case down here this is a random crop it's cropped about here it's still the same image but it's a different version of it in the case here you can see that it's flipped left right and the brightness is slightly increased so these are just different versions of the same image and what you also want are what's called negatives negatives are simply different images from your data set right for example this or this or this you don't care as long as they're different right you just sample a bunch and what you want so you're you're embedding space and they make it big a deal here that they are normalized and that seems to work better but this is not necessary for the idea to work the big idea is here that if you have an image right here let's say this is this is the dog and the blue dots here are the augmented versions of the same dog and the green dots are all the other images in the data set what you want is that the all the images that come from the original same image are pulled close together and everything else is pushed apart right so that's why these are called positives and these are called negatives so the contrast of training basically means that you always want to have a set that you pull together in representation space and assets called the negatives that you push apart so the network basically learns about these random transformations that you have here the network kind of learns what it means to come from the same image it learns to be robust to these kind of transformations it learns about the data in general and how to kind of spread the data and embedding space with these transformations so this usually ends up in a pretty good representation space and people have been using this in recent years in order to gain significant improvements now the problem here if you specifically do this to pre train a classifier is the thing they show on the right so on the left here you have a picture of a dog right but if you just do this self supervised you do it without the labels so it can happen that this image here shows up in the negatives but it is also of a dog right and now this image here is going to end up maybe being this image here and you see what happens to it it's a green one so it's gonna get pushed apart and this is going to make the entire task for the later classifier much harder because if they are pushed apart from each other how is a linear classifier going to have them on the same side of the decision boundary while having everything else on a different side right so the the task here is implicitly making the task for the later classifier harder by pushing apart samples that should be of the same class and so this is this is not happening if you introduce a labels to the pre training objective that's what they do the supervised contrast objective now you still all you want to do is here we're going to draw the same embedding space and we're going to draw this original dog image and we're going to draw the Augmented version of the original dog image but now we also have the following we also have these images which are images of the same class so we're going to put them in black here and let's say the augmented versions around them in smaller black thoughts augmented versions of those right you can augment them as well and then you have the negative samples and the negative samples are not just any images but just images of different classes so you just go over your mini batch and all everything that's of the same class we could becomes positives including their augmentations and everything that is not in the same class becomes negatives and also you can augment them as well so now we have a bunch of things in our embedding space and our objective is going simply going to be again we want to push away all the images that not of the same class as our original as our red original image which is called the anchor so all of this needs to be pushed away but now we want to pull together all the Augmented versions of the original image but also we want to pull together all of the other images of the same class including also their augmented version so all of this is going to be pulled together so not only does the let work learn about these augmentations which again for this idea the augmentations aren't even necessary the network there learns a representation space where images of the same class are close together which again is going to make the task of later linear classifiers that needs to separate this class from other classes very very easy and again the other images aren't just going to be pushed away but if they're from the same class let's say this and this image are from the same class all of those are going to be pushed apart from a red dot but by themselves being pushed together to their own cluster here of their own class I hope this makes sense and I hope the difference to the cross-entropy objective is sort of clear the cross-entropy objective simply from the beginning just cares about which side of the decision boundary iran while this pre-training objective first cares to put things close together that are in the same class and then the decision classifier will have a much easier time the reason why this works better than the because because it's not entirely clear from the beginning that why this should work better because it's working with the same information it's just because people have generally found that these pre-training contrastive pre-training objectives they just are somewhat better at exploiting the information in the data set then if you just hammer on hammer with the contrastive sorry with the cross-entropy loss from the beginning so but it is not fully explained yet why this works better be as it's working with the same data again the difference here is that the previous methods of contrastive pre-training the self supervised ones they did not have access to the labels and the advantage of that is you can have a giant database of unlabeled additional data that you do the free training on whereas here we do the pre training including the labels so here the label dog is an intrinsic part because we need to know which of these samples we need to pull together but that also means we cannot leverage the may be that we have more on labelled data and unlabeled data is pretty cheap to obtain so that's the advantages and disadvantages here so this new loss so they they do compare this here and usually in these contrastive objectives you have somewhat like two encoders want to encode the the anchor and want to encode the Augmented versions and this one is like a momentum with shared weights and so on it all of this isn't really important if you want to look into that look into papers like momentum contrast or I did one on curl for reinforcement learning I think the the general gist of it is clear so they compare the formulation of their loss to the self supervised one usually it takes the form of things like this so the one is the the anchor here and then the ZJ I would be the positive example and you see here that the inner product between the anchor and the positive example sorry about that the inner product should be high because here the loss is the negative of whatever is here so if you minimize the loss you say I want the inner product between my anchor and whatever is the positive sample to be high and all everything else here which includes the thing on the top but it also includes everything else I the inner product to be low and which is exactly the thing where you push you pull together the positives and you push apart everything else the that is the standard objective that you had before they they extend this but looks almost the same so compared to the unsupervised objective now first of all they extend this such that you can have more than one positive sample now this is also possible in the unsupervised way so they just augmented by this and they also now this is the crucial part they include the labels into the pre turning objective so they say everywhere where I and J have the same label should be maximized in the inner product so should be pulled together while everything else is being pushed apart yes so they say we generalize to an arbitrary number of positives and I also say contrastive power increases with more negatives I think that's just a finding that they have that when they add more negative so when they increase the batch size that contrastive power increases they do analyze their gradient which I find it's pretty neat you can already see that if you formulate a loss of course the gradient is going to go in the negative direction but they make it clear that if you look at the gradient for the positive cases what appears is this one - pIJ quantity and the pIJ quantity is exactly the inner product between I and J normalized of course so if you minim so the gradient is going to point into the negative direction of that for the positives which means you you're gonna pull them together and it's going to push into this direction with for the negative classes which means you you push them up and they also analyze what happens a in with relation to hardness so they say there are two kinds of if you just look at the positive samples there are two kinds there are easy positives where the network has already learned to match them closely where the inner product is almost one if you look at them that means the pIJ quantity is large right because that is basically the inner product and you look at this term this term is exactly what we saw in the gradient then you see that this here since this is one this entire thing is zero this is all so highs this is close to one so this entire thing is zero this is almost zero but if you have a hard positive where the network hasn't learned yet to align the inner product properly or align the representation properly then the angle between the things again these are normalized the angle is they are approximately orthogonal so the gradient magnitude is going to be this here is going to be approximately 0 so this is close to 1 and this here since this is also 0 is also close to 1 so this is going to be larger than 0 which means that their loss focuses on the examples that are that the network cannot yet represent well according to their objective which makes sense right first of all but second of all it that is exactly the same thing as in the cross entropy loss if you if you look at the cross entropy loss and you have a situation where the network is really good already for a given sample so it already puts a dog into the dark class then the the gradient will not be pulling much for that sample it might mainly focuses on where you're still wrong so it is like I appreciate the analysis but it is not a notable difference I think what they want to show is that their loss if you do gradient descent really does what it is supposed to do namely first of all it does this polling together pushing a part of inner products for the positive and negative samples and it mainly focuses on samples where you not yet have found a good representation to align them with others it focuses on pairs that are not yet correctly close or together or far apart they also connect this to the triplet loss where they can show after some approximation that if their loss only has one positive and one negative sample it is going to be proportional to the triplet loss the triplet loss is basically where you have an image and you find one positive I think there's going to be of the same class right here and you find one negative of a different class and you try to push those apart while pulling those together the problem here they say is the problem of hard negative sampling in order for this to make sense you need the negative sample to be what's called a hard negative sample so this the call is hard negative mining because you only have one negative sample you better make this something where the network can learn from right and if it's too easy the network can't learn anything and they're more thereby you have the problem of hard negative mining where you often have to filter through your mini batch or even through your data set to find a good negative sample to go along with this pair of positive samples but I don't I don't really see how their method except that you know it has a bunch of positives and negative samples except for that which I guess you could also apply to the triplet loss there's not really a difference here again your if your method is a contrastive method you do have the problem that if you simply sample at random your negative samples are going to be become easier and easier over the training over the course of training and you get the problem of at some point you're gonna have to do actively sample hard negatives I think this paper just gets around it by having huge batch sizes so yeah but again they do get state-of-the-art on imagenet for these types of networks and augmentation strategies and they do look at how their loss appears to be more hyper parameter stable so if they change out the augmentation if they change the optimizer or the learning rate you can see here that the spread in accuracy is much smaller than for the cross entropy loss except here but it is it is hard to compare variances of things that don't have the same means in terms of accuracy so take this on the right here with a grain of salt they also evaluate this on corrupted image net so there's an image net data set where you it has several levels of corruptness of the data set and you can see your accuracy goes down but the accuracy for the cross entropy loss goes down faster than for the supervised contrastive loss you see they start together like this and they go further apart now it is not clear to me whether that's just an effect like if you just trained a supervised contrastive loss also to this level whether it would fall off at the same speed or whether because it is the supervised contrastive loss it would kind of match that curve it's not clear whether that's really an effect of the difference of the losses or is just an effect of the fact that they aren't the same accuracy to begin with again this kind of shifting you can't really compare things that have different means in the first place but that's it is an interesting finding that their method is more stable to these corruptions I just want to point out at the end their training details and just highlight they train for up to seven hundred epochs during the pre training stage which is I think standard but mad and they trained up models with batch sizes up to 8192 so you need like a super TPU cluster to run these kind of things and I am never exactly trusting of numbers like this even though it's it's kind of a good improvement it is still like a 1% improvement and in these small numbers I feel I just feel the there might be sir there might be a big effect that things like batch sizes and how much you put into computing how much compute you put into it and what else you're doing there might be so much influence of that that I first want to see this replicated multiple times across the entire field before I'm going to really trust that this is a good thing to do alright so I hope you like this if you're still here thank you consider subscribing if you have a comment please leave it I usually read them and with that bye bye
Info
Channel: Yannic Kilcher
Views: 29,050
Rating: undefined out of 5
Keywords: deep learning, machine learning, supervised learning, classification, classifier, labels, pretraining, unsupervised, self-supervised, representation learning, representations, hidden space, loss function, google, mit, imagenet
Id: MpdbFLXOOIw
Channel Id: undefined
Length: 30min 8sec (1808 seconds)
Published: Fri Apr 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.