Learning To Classify Images Without Labels (Paper Explained)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi there check out these clusters of images right here and just have a look at how all of them are pretty much showing the same object so here's balloons here's birds here's sharks or other fish these are from images from the image net data set and you can see that these clusters are pretty much the object classes themselves so there's all the frogs right here here all the all the people that have caught fish so this the astonishing thing about this is that these clusters have been obtained without any labels of the image net dataset of course the data set has labels but this method doesn't use the labels it learns to classify images without labels so today we're looking at this paper learning to classify images without labels by water from Guns Becca Simon Van Daan hender stung stamatis Georg Ulis mark pro Simmons and Luke fungal and on a high level overview they have a three-step procedure basically first they they use self supervised learning in order to get good representations second they do a clustering so they do a sort of k nearest neighbor clustering but they do clustering on top of those things but they're doing in a kind of special way and then third they do a refinement through self labeling so if you know what all of these are you basically understand the paper already but there's a bit of tricky steps in there and it's pretty cool that at the end it works out like you just saw so before we dive in as always if you're here and not subscribed then please do and if you liked the video share it out and leave a comment if you feel like commenting cool so as we already stated the problem they ask is it possible to automatically classify images without the use of ground truth annotations or even when the classes themselves are not known a priori now you might seem like you might think that this is outrageous how can you class high when you don't even know what the classes are and so on so the way you have to imagine it going forward and they're sort of they don't explicitly explain it but it's it's sort of assumed that if you have a data set dataset ba-da-da-da-da and you learn to classify it what basically that means is you cluster it right you put some of the data points in in the same clusters okay and then of course the data set I'm gonna draw the same data set right here the same data set would have an actual classification thing so this would be class zero this here maybe class one and this year might be class 2 now you can't possibly know how the classes are like called or something which one is the first which one is a second so at test time basically if you have a method like this that doesn't use labels what you're going to do is you're basically going to find you're going to be as generous as possible and in the assignment of these and say I'll look if I assign this here to cluster zero and this here to cluster 2 and this year to cluster 1 and I just you know carry over the labels what would my accuracy be under that labeling so you've asked generous as possible with the assignments of the labels so that's how it's going to work right that's a way you have to keep in mind we're basically developing an algorithm that gives us this kind of clustering of the data and then if that clustering partitions the data in the same way as the actual labeling would the actual labeling with the test labels then we think it's a good algorithm okay so they claim they have a okay in this paper we deviate from recent works and advocate a two-step approach and it's actually a three step approach but we're feature learning and clustering are decoupled okay why why is that so they argue what you could do what people have done is and I'm going to well this is just a wall of text so what you could do is you could just basically cluster the data like who says you can't use clustering algorithms but then the question is what what do you cluster them by like you need a distance so if I have points in 2d it sort of makes sense to use the Euclidean distance here but if I have images of cats and dogs and whatnot then the Euclidean distance between the pixels is really not a good a good thing but also so you might think we could actually we could you use a deep neural network and then basically send the image that's the image right here send the image through the deep neural network and then either take this last state right here so it goes through and through and through and we could get take either of the hidden states or we could just take you know the last state that is the sort of hidden representation right here and do the clustering with that but then of course the question is what do you what which neural network do you take how do you train that neural network and there have been a few approaches such as a deep cluster which try to formulate basically an objective for that neural network where you first you send all the images through right you send a bunch of images through to get you in embedding space you get you points and then in embedding space you think well the features that are in the embedding space they are somehow latent and they you know if basically the entire thing is if this neural network was used to classify images you would have a classification head on top and a classification head this is like a five class classification that is nothing else than a linear classifier boundary that you put on top of this hidden representation so if you were to use this neural network for classification it must be possible to draw a linear boundary between the classes and therefore the either things like the inner product distance or the Euclidean distance must make sense in that space if they don't make sense in the picture space but they must make sense in the hidden representation space because what you're going to do with them is exactly linear classification the last classification head of a neural network is just a linear classifier so the assumption is that and the conclusion is well in this space you should be able to cluster by Euclidian distance so what deep cluster does alternate like is first get the representations you start off with a random neural network then cluster these representations then basically label self label the images in a way now Oh way oversimplifying that technique right here but you have these alternative steps of clustering and then kind of finding better representation and then clustering these representations and what it basically says is that the CNN itself is such a is like a prior because it's the translation invariant works very good for very well for natural images the CNN itself will lead to good representations if we do it in this way and there's some good results there but this paper argues that if you do that then the the algorithm tends to focus a lot on very low-level features so if the pixel on the bottom right here is blue right then you can and the neural network by chance puts two of those images where the blue pixel on the bottom right it puts them close together then in the next step it will because they're close together will cluster them together and then it will basically feed back the new representation should put the two in the same class right it will feed back that it should focus even more on that blue pixel so it's very very dependent on initializations and it can jump super easily onto these low-level of features that have nothing to do with what the high level task you're ultimately trying to solve which is to classify these images later so what this paper does is it says we can eliminate this we can eliminate this the fact that these methods will predict will produce neural networks that focus on low-level features and how do we do that we do that by representation learning so representation you're learning you might know this as self supervised learning and this is the task they solve in the first step of their objective so let's go through this this right here is an image now the T is a transformation of that image and in self supervised learning there are several methods that you can transform an image so for example you can random crop an image you can just cut out like a piece right here and scale that up to be as large as the original image or you can use for example data augmentation which means you take the image and you basically so if there is I don't know the cat right here you kind of convolve it with some things it's there's like a very squiggly cat okay I'm terrible is you can you can rotate it for example so it's like this okay so these are all these are all sets including the crop sets of this transformation T so you transform it in some way and you want after you've transformed it you send your original image that it should be red you send your original image and the transformed image through a neural network each one by themselves okay and then after then this you say the hidden representation here should be close to each other Oh this is this is basically the self supervised training task it's it's been shown to work very very well as a pre-training method for classification neural networks you you have an image and it's augmented version and you minimize the inner product or the Euclidean distance between the two versions in the hidden space and the rationale is exactly the same the rationale is that this hidden space of course should be linearly classifiable and so the distance between those should be close and the rationale between having these tasks is that well if I flip the image right if I flip the image to the right it cannot focus on the pixel on the bottom right anymore because that's not going to be the pixel on the bottom right here and I'm not always going to flip it into the same direction and sometimes I'm gonna crop it so it also can't focus on the pics on the bottom right because in the crop that pixel is like out here it's not even in the crop so basically what you're looking to do with these self supervised methods is you are looking to destroy this low level of information that's that's all you're looking to build a pipeline of a neural network here that destroys deliberately low level information and you do that by coming up with tasks like this self supervision tasks that this that deliberately exclude this information from being used I think that's what's going on generally in the self supervised learning thing okay so this here as you can see is the neural network that you train you send both images the original and the Augmented version through the same neural network and then you minimize some distance which is usually like the inner product or the Euclidean distance in this embedding space okay and what you train you can see right here you train the parameters of this neural network so the transformations are fixed or sampled and the distance is fixed you train the neural networks such that your embeddings minimize this task now this is nothing new this has been this has been used for a couple of years now to get better representation self supervised learning the thing but they basically say we can use this as an initialization step for this clustering procedure because if we don't do that we focus on these low-level features okay and notice you don't need any labels for this procedure that's why it's called self supervise okay so the second second part is the clustering now they cluster but they don't just cluster these representations that would be that doesn't perform very well in their in their experiments what they instead do is they minimize this entire objective right here and we'll go through it step by step so they train a new neural network okay this thing right here this is a new neural network so first you have you already have the neural network which was called what was it even called the one that gives you the embedding with the theta okay it's called five theta it's the same architecture and I think they initialize one with the other so in step 1 you get Phi Theta 5 theta goes if from from X gives you a representation of X ok let's call it hidden X so that's via self supervised learning but in step 2 you train an entirely new new neural network this phi ada here and you initialize it with this one but now you train it to do the following again you want to minimize sorry you want to maximize the inner product right here see that's the inner product you want to maximize the inner product between two things now that's the same thing as before we want to minimize the distance between two things and the dot product distance in that case you maximize the dot product between two things and the two things are two images that go through the same neural network as before right this and this now what's different here is that here input and one image of the data set that's the same as before okay so we input one image but here before in the self supervised learning we input an Augmented version of that and now we input something else we input this K right here now what's K what K comes from this neighbor set of X okay this is the set of neighbors of X and these neighbors are determined with respect to this neural network right here okay so what you do after step one is you take your neural network with the good embeddings and here is your data set X your data set X that's really another your data set X is this list basically of all the images in your data set and what we're going to do is you're going to take all of them using that neural network that you just trained and embed them into a latent space right here okay this is the latent space where you have done this self supervised training and now for each image right here so if this is X eye you're going to find its K nearest neighbors and they use I think they use five as a benchmark so you're going to find its nearest neighbors it's five nearest neighbors and you do this for each image so this image has these five nearest neighbors and so on so in step two what you're trying to do is you're going to try to pull together each image and its nearest neighbors in that in this this not in this space directly but you determine which ones are the nearest neighbor from this neural net where can you keep it constant that's how you determine what the nearest neighbors are in the first task and that is your NX set for X I and in the second step you're trying to make the representations of any image and its nearest neighbors closer to each other okay so with with this thing right here you maximize the inner product between X in after this neural network and a nearest neighbor of X that was was a nearest neighbor after the first task now the way they cluster here is not just again by putting it into an embedding space like we saw before but this thing right here this neural network as you can see here is is a C dimensional vector in 0 1 now C is the number of classes that you can either know that so you don't know which class is which you don't have labels but you could know how many classes there are or you could just guess how many classes there are and as long as you as you over guess you can still like build super clusters later so this they simply say it's in 0 1 but they also say it performs a soft assignment so we're also going to assume that this is normalized so for each for each data point X here you're going to you're going to have an image you're going to put it through this new neural network ok this new neural network new and it's going to tell you it's going to give you basically a histogram let's say class 1 2 or 3 we guess there are 3 class and it's going to give you an assignment of the 3 and you also take a nearest neighbor here is your dataset you also take a nearest neighbor of that so you so you look for this set n of X and you take a nearest neighbor maybe that's that's a maybe that's a dog I can't I really can't draw a dog yeah that's the best I can do I'm sorry and you also put that through the same network and you're saying since they were nearest neighbor and task 1 they must share some sort of interesting high level features because that's what the first task was for therefore I want to make them closer together in in the in the light of these of this neural network right here so this is also going to give you an assignment like maybe like this okay and now you object you you train this network right here to basically match these two distributions okay so this is this is now a classifier into c classes but we guess c and we don't have labels we simply our label is going to be my neighbors from the first task must have the same labels that's our label now they say they also have this term right here which is the entropy over assignments okay as you can see so they minimize the following they minimize this quantity which has a negative in front of it so that means they maximize this log inner product and they also maximize the entropy because sorry so they minimize this thing and but the entropy is a negative quantity right so they maximize the interview because here is a plus and now they minimize the entropy let's see what they say by minimizing the following objective now entropy is the sum of the negative sum of P log P and this if this is P yes this is the probability that an image is are going to be assigned to cluster C over the entire dataset so they're going to mmm yes so it's negative this quantity negative minus P log P and this is the entropy so they're going to minimize the entropy let's see what they say we include an entropy term the second term in equation two which spreads the predictions uniformly across clusters C ok so what we want is a uniform assignment over class which means we should maximize the entropy oh yes okay they minimize this thing and this here is the negative entropy right okay so they want basically what they want of over the whole dataset that not all of the images are going to be in the same cluster well this is cluster one and then this is cluster two and then this is cluster three so that term counteracts that basically the more evenly spread the entire dataset distribution is the the the higher the entropy the lower the negative entropy and that's the goal right here I'm sorry this this was I was confused by the too many negative signs and then you minimize the entire thing all right now they say they say different thing right here they say here this bracket denotes the dot product operator as we saw it's the dot product between these two distributions right here the first term in equation two imposes this neural network to make consistent predictions for a sample X I and its neighboring samples the neighbors of X I and here is an interesting thing note that the dot product will be maximal when the predictions are one hot that means confident and assigned to the same cluster consistent so they basically say the objective encourages confidence because it encourages predictions to be one hot and it can Courage's consistency because it you know that because the distributions need to be the same they should be in the same cluster right now I agree with the consistency like if you make the inner product high then of the up to date of two of these histograms of course they'll look the same right because these are ultimately vectors these are three-dimensional vectors let's call them two-dimensional vectors right so here is class one here is class two if you you know make the inner product small or high they will agree on their predictions but I I disagree that this encourages anything to be one hot like in and if you have two vectors that are both zero one times zero one the inner product is going to be one and if you have two assignments that are 0.5 and 0.5 then it is also going to result in an in an inner product of is it 0.5 right is also going to to be no so what's the inner product here the inner product is 0.5 times 0.5 plus 0.5 times 0.5 which is 0.5 am i dumb an embarrassingly long time later oh it's because the l1 norm okay okay we got it we got it I am I am okay I am too dumb yes of course I was thinking of these vectors being normalized in L 2 space where their inner products would always be 1 but of course if you have assignments between classes and it's a probability distribution a histogram then all of the possible assignments lie on this on this thing right here now the inner product with yourself of course is the length of the vector and the length of a vector that points to one class or the other class is longer than a vector that points in between so okay I see that's where they get this that's where they get this must be one hot from so okay I'll give that to them it is actually encouraging one hot predictions as long as these things are normalized in l1 space which they probably are because they're histograms right yes that was that was dumbness of me I was trying to make a counter example I'm like wait a minute this counter example is a counter example to my counter example okay so yeah that's that's that so as you can see they are of course correct here and they now make the first experiments so they say basically after the first step of the self supervised training they can already retrieve sort of nearest neighbors and the nearest neighbors the nearest neighbors of these images right here are the ones that you see on the right and after the self supervised one these nearest neighbors are already pretty good at sharing the high level features actually crazy-crazy good right this flute here is in different sizes as you can see the fishes aren't aren't all exactly the same the birds so you can see it really focuses on sort of higher level of features but I guess it's really dependent on this higher level tasks and they what they also investigate this quantitatively but I just want to focus on how how good is this after only the self supervised thing and now they do this clustering and they can already sort they could already evaluate it right here because now they have a clustering right after this step they've basically pulled together the neighbors and they have this neural network that is not assigning classes so they could already evaluate this and they are going to do that but that's not good enough yet then they do a third step which is fine tuning through self labeling now self labeling is pretty much exactly what it's what it says it's you label your own data with your own classifier now that might be a bit outrageous and you basically saying wait a minute if I label my own data and learn a classifier on these labels isn't isn't it just going to come out the same and the answer is no right if you have a dataset because your classifier doesn't give you just first of all if your classifier is something like this right just happens to be and you label and you learn a new classifier it is going to be more like this right because it sort of maximizes a lot of classifiers maximize these distances between the classes so even if it's like that and then the second step they do is they say okay there are some points where we are actually more confident about such as this one we're more confident about that one also this one and then this one here is pretty close like we're not super neither this one but we're very confident about these two so we're only going to use the ones where we are in fact confident about to learn to learn the new classifier or basically we you can also weigh them and so on but they go by confidence right here as you can see in this final algorithm so this is the entire algorithm and I got kicked away at our algorithm there we go all right so semantic clustering by adopting nearest neighbors they're scan algorithm so in the first step you do this pretext task this is the self supervision the representation learning okay for your entire data set no sorry this is this is this your optimized optimizer neural network with task T that's just self supervised representation learning okay then the second thing we're going to determine the nearest neighbor set for each X now they also in that's that they also augment the data they do a heavy data augmentation and so on also in this in the third step in the self labeling they do date augmentation there's a lot of tricks in here but ultimately the base algorithm goes like this so you find your neighboring sets for each X and then what you do while you're clustering loss decreases you update this clustering neural network by with this loss that we saw so this is the loss where you make the nearest neighbors closer to each other while still keeping the entropy high okay and then in the last after you've done this you go through when you say while the length of Y increases what's why why is all the data points that are above a certain threshold now you going to filter the data set that is above a certain threshold and that's your data set Y and you Terrain this same neural network you basically fine-tune it with the cross entropy loss on your own labels so now you only have labels Y okay so it's not it's not labels you have the cross entropy loss between the assignments of this and the assignments of your data set okay so you basically do the same task but you filter by confidence and they use a threshold I think of 0.7 or something like this now let's go into the experiments the experiments or look as follows so they do some ablations to find out where in their methods kind of the the gains come from and we'll just quickly go through them if they just do these self supervision at the beginning and then just do k-means clustering on top of that that will give them on C for ten a thirty five point nine percent accuracy so not very good so the clustering you can't just cluster on top of these representations and then be done if they do what they say so this is sample and batch entropy loss this basically means you do not care about the nearest neighbors you do this entire thing but you only make an image close to the prediction close to itself and it's augmentations so you don't use any nearest neighbor information also doesn't work like I wouldn't pay too much attention that the numbers are ten twenty or thirty it just it like doesn't work now if you use the scan loss you all of a sudden you get into a regime where there is actual signal so this is um this is now significantly above the this is significantly above random guessing and if you use strong data augmentation as I said is a lot of this is has these tricks in it of what kind of data augmentation you do and so on so never forget that that these papers besides their idea they put in all the tricks they can so you get 10% more and then if you do this self labeling step you get another 10% more and this is fairly respectable like 83.5 without ever seeing labels it's fairly good but of course there are only ten classes right here so keep that in mind but they will do it on image net later and they investigate what kind of self supervision tasks at the beginning are important and they investigate things like rot net feature decoupling and noise contrastive estimation which noise contrastive estimation is the best a noise contrastive estimation I think is just where you as we said you input an image and then it's kind of noisy versions with augmented in various ways and then you classify them together and this has been like this these methods have been very successful in the last few years yeah so this they they have various investigations into their algorithm I want to point out this here this is the accuracy vs. confidence after the complete clustering step so this is now the third step is self labeling and you can see right here as these confidence of the network goes up the actual accuracy goes up as well so that means the network after the classroom is really more confident about the points that it can classify more accurately there's like a correlation between the network is confident and the actual label of the point which is remarkable because it has never seen the label but also see how sort of the range here is soup is quite small so with the standard augmentation that goes like from here to here so where you set that threshold is fairly important that might be quite brittle here because you need to set the threshold right such that some points are below it and some are above it and you you don't want to pull in points where where you're not because if you pull in points from here you're only you only have the correct label for 75% or something like them of them and that means if you now self label and learn on them you're going to learn the wrong signal so this this step seems fairly brittle honestly but I don't know of course they go on and investigate various things such as how many clusters do you need or how many nearest neighbors sorry do you need this number K here and you can see that if you have zero neighbors then you're doing a lot worse than if you have let's say five nearest neighbors so the jump here as you can see is fairly high in all the data sets but after that it sort of doesn't really matter much so it seems like five nearest neighbors should be enough for most things and here they just show that when they remove the false positives that their algorithm actually converges to the correct clustering the correct accuracy which is not surprising like if you remove the wrong samples that are wrong then the rest of the samples are going to be right I think that's just showing that it doesn't go into some kind of crazy downward spiral loop or something like this but still it's just kind of funny okay so they do you investigate how much they improve and they improve by quite a lot above the kind of previous methods so they have a lot of previous methods but a manner this includes things like k-means and so on ganz deep cluster that we spoke about and this method it already gets as you can see fairly close to good accuracy so you have like eighty eight point six percent accuracy and that's you know fairly remarkable on C 410 without seeing the labels but we'll go on and now they go into image net now image net of course has way more classes it has a thousand classes compared to see four tenths ten classes so if you if you think you know clustering ten classes might and they're fairly apart from each other might work with various techniques image net a thousand classes that's way more difficult that I do subsample this to 50 100 and 200 classes and they get okay accuracy as you can see they they get 81 percent in for 50 classes where a supervised baseline would get 86 percent into 200 classes they get 69 percent where a supervised baseline would get 76 percent so it's fairly it's it's there and and that's quite remarkable for this low number of classes and they figure out that if they look for these samples that are kind of in the most of the middle of their cluster they get these prototypes right here you can see all of these images I don't if you know imagine that some of the images we really only have the part of the object and so on so here with the prototypical things you really get center clear shot of the object with clearly visible features and so on so this sort of re sort of repeats the fact that this clustering really does go on what sort of semantic information of course the labels here are you know from the from the test label set the network can't figure that out and and then they go for a thousand classes and in a thousand classes it doesn't really work because there might be just too many confusions right here but they do have this confusion matrix of their of their method and it shows that the confusion matrix is pretty much a long like block diagonal along these super clusters right here so you can you can see the dogs the network confuses the dogs fairly often and then insects with each other but not really across here which is still quite remarkable but I mean that's you get the same thing for a lot of these methods so I don't know I don't know how much different this would be in other methods but certainly it's interesting to look at now they go into one last thing and that is what if we don't know how many clusters there are right if we don't know anything so say so far we have assumed to to have knowledge about the number of ground truth classes the model predictions were valid losing the Hungarian matching algorithm we already saw this in the DET är by facebook if you remember however what happens if the number of clusters does not match the number of ground truth classes anymore so they now say Table three reports the results when we overestimate the number of ground truth classes by a factor of two okay so now they build just twenty classes for C for ten instead of ten classes and we'll going to look at table three real quick where's Table three this is Table three okay so when they over cluster you get the thing here on the bottom and you can see there is a drop in accuracy right here now what I don't actually they don't actually say how they do the over cluster matching so if you imagine if I now have I don't know six clusters but I any need to assign them to three clusters you know here do I still use this most optimistic thing so they are still used I think they still use this most optimistic matching right where you assign everything to its best fitted cluster right you compute all the permutations and then you give it the best benefit of the doubt now if you imagine the situation where I over cluster to the point that I have each image in its own cluster and I run this algorithm to evaluate my clustering I give it basically the most beneficial view then I would get a hundred percent accuracy okay so like in in in one of the in this over cluster approach I would sort of expect that you actually get a better score because you can like there is more generosity of the matching algorithm involved now that's counteracted by the fact that you can't group together things that obviously have similar features because they are in the same class so there's kind of two forces pulling here but I was kind of astounded that it's going down and the evaluation method of this matching algorithm it sort of breaks down when you have more classes at least in my opinion yeah but but it's interesting to see that you can just overshoot and but then you need some sort of heuristic to reconcile that in any case I think this paper is pretty cool it brings together a lot of things that we're already present and introduces this kind of this step approach but what you have to keep in mind and by the way there's lots of samples down here what you have to keep in mind is there are a lot of hyper parameters in here there are like this threshold and you know the first of all yeah the number of classes the thresholds the architectures and so on and and all of this has been tuned to get these numbers really high right all of these steps all of the augmentations and so on the chosen data argumentations it has been chosen to get this number as high as possible so you know to interpret this as oh look we can close classify without knowing the labels is you know if yes in this case but the hyper parameter choices of the algorithm are all informed by the labels so it is still very very unclear of how this method will actually work when you really don't have the labels when you actually have to choose the hyper parameters in absence of anything and yeah I think the future might tell if they continue to work on this alright thanks for listening looking watching and bearing with me through my wrestling with with various math basic math in this video I wish you a good day and bye bye
Info
Channel: Yannic Kilcher
Views: 30,406
Rating: undefined out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, ethz, clustering, self-supervision, self-labeling, entropy, dot product, representation learning, cnns, convolutional neural network, deep cluster, nce, noise contrastive estimation, unsupervised, overcluster, imagenet, cifar10, nearest neighbors
Id: hQEnzdLkPj4
Channel Id: undefined
Length: 45min 34sec (2734 seconds)
Published: Wed Jun 03 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.