VOS: Learning What You Don't Know by Virtual Outlier Synthesis (Paper Explained)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] outliers we all know them we all hate them how can these data points just be out of distribution not in the training data things that we haven't seen before things that we don't even expect well they suck so today we're going to look at what you can do about it specifically we're going to look at the paper learning what you don't know by virtual outlier synthesis this paper presents a technique to generate what it calls virtual outliers which are synthetic data points that are out of distribution the core idea is that rather than trying to come up with data space out of distribution samples this paper comes up with latent space out of distribution samples which is much easier and much more useful they're then designing a loss that pushes down the energy of the model wherever the outliers are and pushes up the energy wherever the data is this paper is really interesting because it presented very successful results on a multitude of benchmarks so definitely this technique looks like it works however when i read the paper i was quite critical i had a lot of criticisms i had a lot of open questions and that's why i've invited the authors for an interview to the channel so this video right here is a comprehensive paper review i'll explain in detail what is in the paper what the method does what its contributions are what its experimental results look like what is good about it and what i think is bad about it then in the next video release tomorrow i'll interview the authors of the paper the authors will have seen my review and therefore are able to respond to any criticism and any questions that i had so be sure to check out the interview part as well because it was really really cool to get all my questions answered as always let me know how i can improve these videos by leaving a comment leave a like if you do like and i'll see you around bye bye do you have audio of someone talking do you want that transcribed boy do i have the product for you assembly ai builds accurate speeched text apis which means that developers can use these apis to automatically transcribe and understand audio and video data in just a few lines of code and this works in the traditional way where you upload audio and you get back the transcription but they can also do this real time so you get a websocket to their neural network powered backend and in real time it gives you back text for your speech that's insane but this is not all they have a ton of features on top of that for example they can do summarization they can do topic detection they can do bad word detection content moderation in your audio and i have to say this is really good in fact i have uploaded this video right here to their apis and the text you see on screen is the raw output of that model so judge yourself how good it is we'll actually try some swiss german words on it it is an english model but we'll just give it a shot half a hass oh well isn't that great so give them a try they even have a basic free tier their documentation is super extensive they give you walkthroughs and examples of all the parameters that you can send they have a great blog where they describe different feature sets and different ways of applying their technology and yeah it's a really cool thing now i've only scratched the surface right here they do much more they have features upon features on this but it's best you check them out yourself so thank you very much to assembly ai for sponsoring this video is really great please check them out a link is in the description and i wish you a lot of fun [Music] hello there today we'll look at vos learning what you don't know by virtual outlier synthesis by shifang do chow ning wang mu tai and ishan li this paper presents a model that can do out of distribution detection in object detection networks but not only in object detection they show it on object detection but it is a general framework for detecting out of distribution data at inference time if this really works this could mean a lot for especially for safety critical applications networks that are deployed as a classifier or a detector somewhere and they would be able to recognize accurately when they are presented with something they didn't learn at training time like some out of distribution class in this particular case on the left here you see an image which is an object detection network at inference time it has correctly recognized the car on the right hand side however it thinks that the moose here is a pedestrian it doesn't even classify all of the moose but it recognizes there is an object and the class is pedestrian probably because it hasn't hasn't seen uh moose's miss what's the plural of moose in any case it hasn't seen a moose or multiple miss at training time and therefore it cannot classify it and very often these networks make very very high confidence predictions for classes that they haven't seen this paper tackles this and proposes this technique called virtual outlier synthesis to which we'll get to in a second as i said it's a general framework they demonstrated on object detection which is a particularly hard task but this could also be applied to image classification they do make the point that if you have an image like this and you haven't seen the moose class during training most of the image will still be in distribution like this will not be a particularly out of distribution image except for that small part with the moose however if you do object detection then the object itself here is out of distribution and maybe that makes actually their tasks as researchers a bit more easy because they are less often in these ambiguous cases where like half the data point is out of distribution in any case uh they mention here they that the networks that we currently have they often struggle to handle the unknowns and they assign high posterior probability for out of distribution test inputs now why might that be if you train a typical classifier the classifier will just attempt to separate classes from each other you see this here in the middle this is a projection of the last layer of a neural network right before the classifier layer so right before the softmax so the last the classification layer all it can do is it can lay linear decision boundaries essentially um through the through the distribution of of data points so the what the model does is it sees three classes right here so this is class one this is class two this is class three and what it needs to do is linearly separate them so it says well okay i'm gonna this is not an ideal color for this uh i'm going to just put my decision boundaries like this and now i've essentially separated the classes because all that is important to a classification loss is that you know points in class 3 are away from points in class one and away from points in class two so that also means that the more away from classes one and two i go the better like the more likely it is to be class three because all i've ever seen at training is is samples from class three and my entire objective was just to to make it to push it away or distinguish it discriminated from class 1 and class 2. so obviously if i go more into the direction of class 3 the network will become will output a more and more confident number about this being class 3. even though as you can see the data is all in this region right here and out there there is no data yet the network is still very very confident red here means quite confident an ideal situation would be if the network was very confident where the training data is right here however again we have the decision boundaries like this however if you go further out it will say something like wait a minute even though this is not class one for sure and not class 2 for sure it's most likely class 3 but still i haven't seen any training data around that area so i'm also going to be to just output low a low probability or a low confidence score i'm going to say it's class 3 but i'm going to assign it a low confidence because i haven't seen actual training data in that vicinity now this all seems intuitive and makes sense and so on mostly that is because low dimensionality and high dimensionality data is very different and can deceive if you look at it in this in a kind of a very simple projection like this you as a human you see this data and you go like of course that makes total sense however this it becomes very different if you look at high dimensional data note that there is a reason why our classifiers do the thing on the left because the thing on the right essentially amounts to like a probabilistic model of the data distribution right the thing on the right it has an idea where all the data is right the thing on the left it just needs to separate data from each other three lines are enough for that the thing on the right actually needs to model the data in the latent space which can become pretty complicated in high high dimensions and it needs some very very uh distinct assumptions to make it tractable so the right thing is essentially a generative model of the data like a distributional model of the data which needs a lot more resources and power and could could pull away resources from the classification task to be solved so what does this model do um first of all they have some notation right here uh which i found to be well let's just first look at the diagram right here so this is the whole model architecture they have an input over here so there's input input x right i'm going to use the green highlighter i guess for this stuff um there's input x you can see this is the this is the input image in general in general first you have this proposal generator and that proposal generator will generate bounding boxes so some of these detection networks they have two stages first proposal generation and then a sort of a post-processing stage where they assign labels to the proposals so the proposal generator would simply ask where are objects you know any sort of object the object ness property it sort of generalizes between objects so it makes sense to train the object detector to just predict where are bounding boxes in this case it will predict well there's one here there's an object and there's an object here and then it will pass on those two to the classifier to determine what's in the bounding boxes and you can already see the object detector has done a good job it detected that this thing right here is an object however uh the classifier it what can it do other it has to assign a label there is no option for it to say no actually this isn't an object and previous methods have tried this they've just added like an extra class for outlier it usually doesn't work too well because the reason is pretty simple in order to do that here on the left you'd have to introduce like another line and say okay so i'm going to introduce another line i'm running out of colors here introduce another line you know like right here so this would now be outlier sorry outlier space well that doesn't cover that doesn't cover this region or this region or the region back here right so having a single class for outliers um is is sort of useless because there are just so many places where outliers could be uh and not just like a single a single slice of the space so you'd have to have many uh you'd actually have to have like a lot and ultimately that amounts to exactly the situation on the right where you know ultimately you're going to train a classifier that is a threshold between low and high density areas and that's exactly a generative model of of the data all right first stage is the bounding box proposal this thing right here then you pass on the bounding box to multiple things first of all there is a loss that's simply concerned with did you detect the objects correctly so during training the proposal generator would simply be trained with that loss right here now everything here is back propagated obviously but that would be the main loss to localize the bounding boxes the second the second stage here would be the assignment of a label this would be the so-called classification head so that would take the latent representation that is generated including the bounding box right so we're going to feed this through a neural network and that will give us a latent representation this h thing mean that they call that the latent representation right before the classification layer and the classification layer would assign a label to it and that would be the normal way of doing things and now we augment that by a bit just to say they formulate this here as saying we have a data set the data set here contains x is data b is bounding box and y is labels so b and y would be the the labels right that those would be the things to predict and then they say they split it up into two two things so first of all the p of the bounding box and then the one of the label and i don't think that's correct i think that's a typo right here i think this should be the probability of the bounding box given x not not the label and this should probably be the probability of the label given x as well as the predicted bounding box let's let's call this b-hat right here the predicted bounding box so b-hat would be sampled from this but i this is this is minor because the rest of the paper essentially treats it as as i think i write it down in any case what they do in addition to that is they also have this classifier right here the classifier that takes into a a sample and the bounding box and it tries to predict this number g and g is one if the object is in distribution and g should be zero if it's out of distribution so this is a binary classifier that classifies any sample into in or out of distribution independent of what the classifier head says what class it is so that would amount to the situation on the right where if you're anywhere in this region right here the classifier would still say well that's clearly class 3 because that's the region of class 3 but your other classifier would say yes but the the outlier probability is very high the in in layer probability is very low for that region so you can do outlier detection at inference time now how do we do this we do this by generating these virtual outliers during training virtual outliers are essentially outlier data points that you synthesize now you what you could do and they mentioned that what you could do is you could train like again you can simply train a generative model of the data and then use that to sample out of distribution data however they mentioned that synthesizing images in the high dimensional pixel space can be difficult to optimize instead our key idea is to synthesize virtual outliers in the feature space so the feature space is if you have your you have your image right let's just talk about classifier you feed it through a bunch of neural networks and then here is the last layer and all you do at the end is you have a classification head that classifies it into multiple classes and this right here is just described by a matrix w this is just a linear layer that goes from the amount of features i guess d or something like this to the amount of classes c that's the dimensionality so in this space at the end you would do in this space right here that's the space we've seen in in these diagrams up there here is where we would sample the virtual outliers so what we would do is we would look at our training data where does our training data fall we say aha okay there is class ones two and three as we had it then we build a gaussian mixture model of the training data essentially we'd assume that each class has is described well by a high dimensional by a multivariate gaussian they all share the covariance matrix by the way and then we would say well okay given that that is the case which ends up at the situation on in the right we would sample we'd sample data points from outside of those gaussians so that have a sufficiently low probability so these would be these virtual outliers we would just sample them anywhere where we where our gaussian mixture model says that there is no data but still we sample according to the gaussians so we're not going to be like way out here in undefined space just because this is in our support set we're still going to sample from these gaussians but we're going to sample until we get a sample that has a very low likelihood so we're deliberately going to sample outliers from these gaussians and that those are going to serve as samples for our outlier classifier so then the outlier classifier what it needs to do is it needs to find a decision boundary between these virtual outliers and and the data you can see draw this right here so there's going to be a decision boundary now you can see this decision boundary gets quite a bit more complicated than the decision boundary of the of between the classes especially you know given that we do it in the last layer so we'll go on in in the paper a little bit um what we just said is gonna come up in a second here so they say we assume the feature representation of object instances forms a class conditional multivariate gaussian distribution and they state this right here so every class has a mean all the classes share a covariance matrix and they do calculate they don't learn these things they do just calculate them from the training data in an online fashion so this is in the penultimate layer of the neural network as i just said yeah they compute empirical class mean and covariance of training samples and they do this in an online sorry about that in an online estimation fashion which means that as they train the network they collect the training data and then in an online fashion they compute these metrics to always be up to date they do say here we assume the feature representation is this gaussian let's say c figure 3 and figure 3 is a umap projection of umap visualization of feature embeddings of the pascal voc data set and i'm not sure what they mean by look at figure three this is a umap this is like a a projection a non-linear projection into low dimensional space if i'm i'm i'm not exactly remembering what umap does but for sure this is a projection like this doesn't convince me that the data is gaussian it convinces me that the data is kind of in one place ish right um like or like it convinces me that all the blue points are closer or most of the blue points are closer to each other than they are close to for example the green points here like that that is what is convincing to me from this graphic it is not at all convincing that in the original high dimensional space where they come from they are somehow a cluster or a gaussian even or even that all of these classes would have the same covariance matrix even if they were gaussians so that that is it is a wild assumption um but you know it seems to work so the results of the paper are that they are very very good at this outlier detection they reduce false positive rates by a lot so you know it seems to work i'm just saying this does not convince me um or maybe i don't understand umap uh maybe there is something so here is where they say they sample the virtual outliers from in this feature representation space using the multivariate distributions so they would simply sample the virtual outliers uh from the gaussians but the gauss but then evaluate them and only take them if they their likelihood is smaller than some epsilon they say it's sufficiently small so that the samples sample outliers are near the class boundary um yeah these outliers would then be converted to to the output so this would be the output the classifier head by the classifier matrix now that is that is how they sample the outliers and you know all all good so far i have a few concerns right here for example uh what you're going to teach the model is you know successfully um if in the last layer before the classifier there is a data point and that data point does not is not where the training data is then if this model works it will in fact it will recognize it as an outlier right what will not happen and and this seems yeah okay what will not be the case if if that that moose right here for some reason right an earlier layer already confuses it with something right an earlier layer thinks oh this you know it's four legs it's probably like it looks like a dog right then the moose will come will come to lie really inside of the dog class because it would have the features of a dog which the lower layers would have confused it so you'd have to have done this technique in one of the lower layers and there you could see that that this is an outlier but the lower the layers you go you know the less your data even less your data looks like a gaussian i mean ultimately you'd have to do it in the input layer right and there it becomes clear that this is just like a a distribution of the data that you're trying to approximate and in the input layer certainly this is not a gaussian at all so i think this only works for specific outliers if there is an outlier that as i say has like the same features as some in distribution data resulting that in the last layer they are in like inside of this cluster then this method will not be able to detect it um yeah that is that is kind of my one concern the other concern i've already said is that this separating these outliers is naturally a harder task because uh yeah as well it essentially amounts to a generative or a distributional model of the data rather than just a discriminative classifier so how are they incorporating this into training now during training we still we still don't know right we have so up here right we have our loss right here for the localization we have a classification loss which um is is fine is good uh so our classification loss tells us if we have the class correctly but we still need a third thing which is this uncertainty loss uh we are going to estimate the uncertainty which is going to be our measure of how much out of this how much the model thinks that this is an out-of-distribution data point or not and how are they doing it they are using the log partition function for that so the log partition function is a it's this thing right here um it's essentially essentially what is at the bottom of the soft max if you use a soft max for classification so if the f here is the logit of class k so if this is the output of your classifier and then you do a soft max in the last layer across your logits the softmax would look like this right so you'd have the class y at the top and then you'd have that log some x of the uh of all the classes at the bottom so the bottom right here is kind of like a measure of how peaky your distribution is right if if you're if you're if your logits are you know one is just standing out heavily then that is kind of a measure for uh low uncertainty like you're quite sure about what you're you're doing and if the all the logics are kind of the same then this um they they are they're all more even so this measure is a little bit of an indicator of certainty right so this was already this was already shown to be an effective uncertainty measurement for out of distribution detection so what we're going to do is we're going to use this as a uncertainty loss right here so what we're going to do is we're going to to train or not to train we're going to have a log logit based loss so we're going to say we are going to use a sigmoid and what we want is we want this measure right here we want the uh we want this right here which is one is the one is the log it and one is one minus the logit i can't remember which one is which in any case um we want this measure to be high for in distribution data and low for out of distribution data or the other way around want the uncertainty to be high for out of distribution data and low for in distribution data so if we get a data point we'll plug it in to this free energy um well the uh by the way this the negative of the log partition function is called the free energy sorry i for got to mention that that would make some connections to other even other fields of science so we're going to take our data point and we're going to not plug it into the classifier but just this bottom part of the classifier right to measure is the distribution that we're getting very certain or very uncertain and then what we want is that if we have a true data point then we want the we want the uncertainty to be very low if we have a fake data point we want the uncertainty to be very high okay so by adding this loss right here by adding this loss what this does is this trains our classifier to be more certain if the data point is real and less certain if the data point is fake which ultimately right will result in decision boundaries like this or certainty estimates like this on the right here so the um certainty estimate on the left would just be if we just train the classifier objective the thing will get more and more certain as we go away from the classification boundaries if we look at this certainty measure and now we explicitly train the model to only be certain around the data and to be again very uncertain around all the virtual all the virtual outliers so that's why you see blue anywhere away from the data we explicitly train the model to to do that so our uncertainty classifier that we talked about where was it this thing right here our uncertainty classifier is not in fact an additionally trained model uh it is simply us plugging a data point into this uncertainty measure and during training we make sure that this measure is low for fake data and high for clean data now this loss if i see this correctly it is uncertainty loss initially it will only it will directly affect this parameter set right here since we only generate the fake data in the last layer the only parameters that are really affected by this loss in that case is the classification weights right here however implicitly obviously by saying that the true data here must have a high certainty or a low uncertainty um and by contrasting this with the fake data in the last layer it may also be that through back propagation the entire network is shaped such that the latent space will be more optimal for doing this classification however right i i cannot conceive super well how this you know how all the effects and and counter effects and so on are gonna work out uh but it would be interesting to think a bit more clearly through that um yeah so what we're gonna end up with is a probabilistic score for out of distribution detection our loss is going to be a mixture of these classification and localization losses and the uncertainty loss got added with a given hyper parameter so this is going to be our detector for in distribution we simply take predicted or we take an inference sample we take the predicted bounding box we'll plug it into this uncertainty estimate right here so this here is this free energy we plug it into the sigmoid formula here and that will give us uh one if the classifier is very certain and the zero if it's very uncertain that this is in distribution data we can define a threshold and that's going to be our out of distribution classifier so that's it for the method they go through a bunch of results now i'll shorten the results by saying they're just very good at everything like it at the data sets they try against the baseline baselines they do ablations and particularly noteworthy for example here is the false positive rate where lower is better you can see if they were just to add an outlier class this would hurt the performance quite a bit like more than other modifications right here which i found interesting to see um yeah they detect they compare against other outlier detection methods and uh they they do have i believe some samples right here needless to say yeah i have my concerns but it does work pretty well so and i'm just a person that looks at this paper for the first time and hasn't worked in this field at all and hasn't tried anything so i'm going to give the uh the um the right away to the authors right here but let me know what you think and i'll i'll see you next time [Music] you
Info
Channel: Yannic Kilcher
Views: 13,923
Rating: undefined out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, paper explained, virtual outliers, how to detect outliers, deep learning outliers, deep learning outlier detection, vos, deep learning energy, latent space outliers, density estimation, classification boundaries, generative models
Id: i-J4T3uLC9M
Channel Id: undefined
Length: 35min 57sec (2157 seconds)
Published: Sun Mar 13 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.