#040 - Adversarial Examples (Dr. Nicholas Carlini, Dr. Wieland Brendel, Florian Tramèr)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello folks and welcome back to the machine learning street talk youtube channel and podcast today we have an absolutely fascinating conversation about adversarial examples the tldr is that we are screwed we have fewer options than a welsh fish and chip shop enjoy by the way remember to like comment and subscribe we love reading your comments your comments are always amazing you know i think we punch above our weight on the comments section peace out today we are speaking with three of the currently leading researchers in adversarial examples florian tramer a phd student from stanford dr whelan brendell a machine learning researcher from the university of tubingen and dr nicholas carlini a research scientist at google brain adversarial examples have first been demonstrated around 2013 and 2014 and even though they're easy to grasp you can make a tiny change to an input data point like an image and achieve a complete fooling of a classifier but the change is so small it's imperceptible to humans so a lot of mysticism has arisen around these adversarial examples and we haven't gotten much farther since then there is a lot of speculations about why adversarial examples exist and what exactly they are many people hypothesize that it has something to do with the way we build models or the way we train them or the input representation or any of those things a landmark paper from mit however showed pretty convincingly that adversarial examples might actually be a property of the data itself they demonstrated that they could separate any data point into its so-called robust features and its non-robust features a data point usually has both features on top of each other these researchers split those features apart and showed that you could actually train classifiers on each of these features separately now the interesting thing is that whether or not you train on the robust or the non-robust features both classifiers will actually perform very well on a test set that means that those features generalize interestingly enough however if you train on the robust features then the resulting classifier will be robust to adversarial examples however if you train on the non-robust features it will not be robust to the adversarial examples this is very convincing evidence that there is something in the data there are features in the data that are very well generalizing but are so small that we humans we just usually don't use them to classify in another experiment these researchers showed that they took images and they made adversarial examples out of them thus changing what the classifier thinks their label is however humans would still label it as the original label they then trained another classifier on the adversarial examples but instead of supplying it the old labels they simply supplied it with the new labels it turns out that this classifier now correctly classifies the test set the original test set that means that the classifier has learned this has learned to identify these features that were added by the adversarial example generation process and again is a step towards us understanding what exactly makes an adversarial example adversarial and why do they exist so how many adversarial examples are there well um almost infinite for all intents and purposes you can think of it as being infinite so this is such an incredibly difficult problem and what are the main defense mechanisms well the first one which we all know about when we create robust machine learning models is data augmentation and data augmentation is when you take the existing data set you have and you perform semantically equivalent transformations so you might perform rotations or reflections or change the the color histogram or something like that and you're trying to maintain that the semantic meaning of of the image or the example while randomizing on all of the other things that the model might be kind of overfitting on so that seems like a fairly blunt instrument actually the best thing you can do is just add more data right more natural data is better but then you have to ask yourself the question why is it better because it might be better in the sense that it's more natural data but it's also worse in the sense that you're just adding even more adversarial examples at least with data augmentation you're kind of reducing the number of adversarial examples that you already have in inside your your data set and of course adversarial training is almost like the worst case scenario of data augmentation because what you're doing is you're finding adversarial examples and then you're augmenting on them so you're telling your model um don't look at those features we don't want you to look at those features but the problem is it's a completely random process because how can you direct your machine learning model to find features that are good versus not good and with this adversarial training you're just stabbing in the dark because you might find a few adversarial features but what about all of the other uncountably many adversarial features that you might overfit on so the other thing you might want to do is randomize smoothing so you just add a load of random noise to the data before you train on it and you know generally speaking what we're saying here is that you know when you shift inputs or you add different types of natural noise it shouldn't matter you know you should train your model to be robust to all of those different things but i think the main point here though is that this is kind of we're on the road to nowhere right there's basically nothing we can do about this problem this is a huge problem and there's almost nothing we can do about it just let that sink in for a second okay so let's talk about what these things actually are now remember we were saying there's a dichotomy between robust features and non-robust features and a lot of this is actually to do with the magnitude of the feature and i don't mean how strong the feature is as a representation i mean how easy is it to mutate this feature into something else so a great example of this is the first structure on a cat now a cat will have shape like features and texture like features right and it's very difficult to check to change the shape you have to kind of do so much stuff to the pixel mass to change the shape of a cat into a shape of a dog but the fur is completely different it's that's why it's called a low magnitude feature because you don't need to do much to it to change so as long as your features contain these low magnitude features which could easily be perturbed then you are vulnerable to these adversarial examples so you might then ask the naive question well um if we don't want the models to learn the cat fur because it's such you know it could be attacked so easily why don't we just remove the cat fur and only train on the shape well the problem is it's an incredibly good feature the magnitude of the feature is orthogonal to its predictive properties so if we remove the cat fur then all of a sudden our classifiers don't have very high accuracy anymore so there's a fundamental trade-off between these adversarial robustness and the predictive accuracy of our classifiers another thing that i was thinking about and i asked this question on the show is what can we do as security professionals to kind of change the attack surface of our models and what knobs and levers should we have to pull and tweak and it reminded me of the session we did with lena voiter last week where she was saying that machine learning is compression and some data sets have got strong regularity which means they can be expressed using very few examples and is that a good thing or a bad thing because if you can express your data set with a few examples presumably that means fewer adversarial examples but it also means those adversarial examples are more dangerous because they have higher generalization power to other data that's likely to be seen in testing so my intuition here is that you know the reason the lottery ticket hypothesis works is that almost all of the representational capacity in neural networks is memorizing challenging examples or low frequency attributes and because they've essentially been memorized probably the features learned are not very robust so does that mean uh fewer adversarial examples or more adversarial examples i i don't know the answer to that one of the things i've been wanting to talk about for a while is machine learning security i'm involved in various you know governance processes at work around how we can apply and think about machine learning security and think about some of the consequences of adversarial attacks on machine learning systems and i must admit today was quite a refreshing conversation because nicholas carlini was basically saying there's nothing you can do about it right these folks were making the argument that um we've got we just got bigger fish to fry with machine learning it's hard enough just getting the models to work robustly and actually by doing adversarial training and making them robust to some of these attacks we are making the models less accurate so they're even worse than they would have been and actually if an adversary wants to um you know take on your system then they're going to get you anyway so they were almost being quite fatalistic and defeatist about it saying that we shouldn't be worrying that much about machine learning security but anyway with that caveat put in place i was still intending at some point to make an episode about this and there's a really interesting article here from microsoft uh failure modes and machine learning you might recognize jeffrey snova he's the guy who invented powershell and uh it talks about you know kind of like intentionally motivated failures and unintentionally motivated failures and just having a taxonomy of these failures is quite interesting so things like perturbation attacks poisoning attacks model inversion membership inference model stealing and so on so um if there's sufficient interest i might make another show on that later there's also another sub article here called threat modeling aiml systems and dependencies and this is really good it kind of gives you some examples of questions you might want to ask in a security review at work because presumably if you work in a large corporation you will have a whole load of best practices around how you do peer review in your software engineering or how you do technical design reviews and so on and one of the things you should be thinking about is how do you review the security of your machine learning applications and this is a great starting point for some of the questions you might ask so if your data is poisoned or tampered with how would you know right are you training from user-supplied inputs how sensitive is the data you train from can your model output sensitive data so there's a really cool list of things to think about here and of course you know we get into the philosophical discussion of how much of this stuff generalizes how relevant is it really and should we be thinking you know at the general level or should we be thinking at the application level but um anyway from a security point of view i i find this stuff really interesting okay right well in which case uh welcome back to the machine learning street talk youtube channel and podcast with me tim scarf and my compadre just the one today yannick lightspeed culture now today is about adversarial examples adversarial examples have attracted significant attention in machine learning but the reasons for their existence and pervasiveness remain unclear there's good reason to believe that neural networks look at very different features than we would have expected as articulated in the 2019 features not bugs paper adversarial examples can be directly attributed to the presence of non-robust features features derived from patterns in the data distribution that are highly predictive yet brittle and incomprehensible to humans adversarial examples don't just affect deep learning models an entire cottage industry is sprung up around threat modeling in artificial intelligence and ml systems and their dependencies now joining us this evening are some of the currently leading researchers and adversarial examples florian tramer a fifth year phd student at stanford university studying computer science dr whelan brendell a machine learning researcher at the university of tubingen and co-founder of layer 7.ai and dr nicholas carlini a reset scientist at google brain working in that exciting space between machine learning and computer security now some of you may remember the landmark paper in 2018 which brendol was associated with imagenet trained cnns are biased towards texture and increasing shape bias improves accuracy and robustness carlini and brendell wrote a paper in early 2019 called on evaluating adversarial robustness where they established a methodological foundation they reviewed commonly accepted best practices and suggested new methods for evaluating defenses for adversarial examples florian has done some work on adversarial examples in perceptual ad blocking so ad blocking has moved on a long way now in the olden days it used to be about html structures and the document object model and css rules and so on and now some of the ad blockers are even using perceptual models and also in collaboration with nicholas florian investigated a new concept a complementary failure mode if you like invariance-based adversarial examples as opposed to sensitivity based that introduced minimal semantic changes that modify an input's true label yet preserving the model's prediction all three of the gentlemen with us today wrote a paper earlier last year called on adaptive attacks to adversarial example defenses where they give guidance on how to properly perform adaptive attacks against defenses to adversarial examples anyway gentlemen it's an absolute pleasure to have you on the show and welcome yeah thank you welcome everyone thank you very much i think tim you you've already you've already kind of uh some of what you said is already lighting a war in the community because um so i'd for people who don't know much about adversarial example community this is a community that's has a lot of opinions and no one i would say no one really really knows but you know everyone thinks the evidence is on their side of the opinion so if you think about adversarial examples it's pretty easy to explain what they are but the very next question that people have is why do they exist and so for for you three what are sort of your your leading like your best explanations for why do adversarial examples even exist what are they what's wrong with neural networks i think this sort of overarching idea that neural networks are just learning something very different than what we as humans are learning while still being somewhat predictive on average is i mean that's that's a fact right the fact that these neural networks can reach say ninety percent accuracy but not 100 accuracy is already proof that this is happening right they learn something that's somehow truthful but not really what we're learning otherwise they would get 100 accuracy and it's somewhat surprising that this this brittleness is sort of so um so strong that for any input you can really just perturb features in a very small way so that the model pretty much fails in any any way you want and there i think there's still a bunch of of competing hypotheses as to why this happens a lot of arguments from the sort of high dimensionality of these models and so on and i think a very clear explanation of what's exactly going on there is still is still out there i i think that it's easy to fool yourself in lots of areas of science it's especially area easy in areas where rigorous experiments are hard and i personally try not to believe any one of the reasons in part because if you believed that one of the reasons was certainly true this was the cause for every single examples once you have a perfect explanation it's usually possible to say there is a defense based on this explanation which removes them and most of the defenses that have said this is the reason why i have still examples exist therefore we do not that are wrong or broken and so most people who can come up with a very clear explanation for why they exist and then try and prevent them with that have not been able to do so and so there are good reasons to believe that they exist because of maybe multiple things high dimensionality non-robust features maybe absolute examples are just the closest test error in high dimensions you know there are all these reasons but i don't think that we can comfortably say exactly the problem is that i'm fine living in a world where i don't know what that is right now like i would rather in my head not know the answer and think i not know the answer then try and pretend that i know exactly what the cause is because it just makes it harder to do good research yeah i i second that um i mean the the interesting so i agree if we would know exactly what it is we would just solve the problem and obviously that's not what we're doing right now at least the field as a whole i mean as you see the progress in the last couple of years it's not huge simply because i mean we are trying a lot of things but um nothing really i mean or very few things have sticked but i mean there is kind of one i mean there are at least a few hints of what things might play a role and that's kind of why the box vs feature paper is um uh is so interesting because there was the big question in the field also for myself i also thought exactly about this the question is are adversarials just something that you know happened because of the architecture some peculiarity of neural networks something or is that actually something you know are these features that the network have learned that they're using for classification and that paper kind of shed some light on it that at least in part some of the brittleness that we see are due to the features that that the networks have learned and that they're using for classification no there's no doubt that like this paper i guess in wayland one of your papers on um just texture bias like are these two papers where if you say if you give the main result it's sort of no one was surprised by there are features that neural networks pick up on that are not robust that's not why that people like the paper people like the paper because that was a really nice experiment to show this um and same thing with with weyland's paper i think it's not the case that people are like surprised that neural networks have texture bias it's just the way that you show it is like is just a very very nice way of saying well let's just like not let the texture look at all about the the global features and just basically do bag of features or something and this is a very nice experiment to show a fact that people believed probably was true and everyone was sort of folklorists saying is true but no one could come up with a nice way of actually demonstrating this fact on this texture thing in a weird way it's not a disaster that neural networks could actually learn a texture that is imperceptible to humans the the alternative seems to be unthinkable there is a school of thought which is that neural networks just memorized the training data pedro domingo's just published a paper didn't he saying that they're super positions of the training data and did do you guys believe that or do you think they are actually learning very generalizable representations that's also a very very difficult question to answer i guess nicholas and i have this recent paper that shows that for for text models it is actually closer to the truth than than you'd want that these models very explicitly memorize some of their training data at the same time there has been quite a bit of work coming from the the privacy side on machine learning where you can probably show that you can train models that that do not memorize any of their training data in a very formal sense but this tends to hurt the model's accuracy and this is still a bit of an open problem in this space as well is how accurate can you make a model if you probably prevent it from memorizing any individual training point and i think this is still one of those grand open questions in in deep learning to really understand how these models learn sort of where the barrier between memorization and and generalization is right i mean just quickly nicholas and flurry and you did just publish a paper didn't you about the the propensity of language models to memorize data it was called extracting training data from large language models and yannick did a video on it and given sufficient prompting and given lots of sampling and then some internet searching you can kind of demonstrate just to some extent that these language models are are essentially memorizing uh the input yeah i mean this is true um so the caveat here is like this we haven't seen this on image models like for the language models in some sense label memorization is the same thing as training data memorization for image models you can imagine a world where the model memorizes the labels but it does it through some compressive hash function something and it doesn't memorize anything about the input in particular but it just memorizes all the labels like this is a completely consistent view with what we've seen and on language models because the task is predicted next word they're basically one and the same so it's conceivable that models you know most of the time do well sometimes memorize labels and maybe the world is just fine and machine learning does everything exactly as it's supposed to but uh again i'm not i don't like to speculate on things where i don't have good i mean there's a there's a an interesting nearly a bit sort of philosophical question there that i've i've been thinking about for some time now which i find kind of fascinating is that on on the robustness side of things um we as humans seem to be uh a proof of concept that you can do sort of what we want right like it we seem to be able to classify things in a robust manner when it comes to privacy and sort of not memorizing things we as humans are absolutely not the proof of concept and that i think the way that we learn as humans we don't quite understand how this works either but memorization plays an incredibly important role in it in that it's it's even hard to imagine like what it would mean for us to learn something without ever memorizing like a specific example and so there's there's this kind of interesting duality there that uh some of the things we might even want from machine learning models like to not memorize part of their training data it's not even clear whether that's really the right thing to ask for um in terms of learning ability from a privacy perspective it's the right thing to ask for but maybe you actually need to do this in order to learn i mean this also in a sense kind of mirrors two perspectives on robustness right if you're coming from a privacy uh or security kind of perspective then memorization can be a very obvious problem if you're coming you know from a human perspective if you're interested in how human vision works and kind of moving machine vision towards human vision then you don't really care about memorization uh you know they could memorize as much as they want to if they solve the problem in the right way and if they generalize correctly right the humans are actually are actually a good example of a thing that doesn't learn with privacy in mind but a thing that can actually gener like a human can at output time decide no wait a minute i shouldn't say that right um so it's also very thinkable that we build models that do memorize but you know in a privacy setting would actually refuse to sort of output information that could could be memorized where does semantics come in here could it is is it the case that there's a problem using these geometric spaces and the reason why humans don't have these robustness problems is we have this um semantic type system and when when we communicate with each other you know the exact human thought that initiated that communication and that there's almost no ambiguity in that human thought i suppose what i'm trying to say is if we anthropomorphize computer vision models is is that really going to help us because if we improve what we have now in deep learning are we just building a taller tower to get closer to the moon or can we actually use that insight so here's one important thing i think for a thing of episode examples they also exist on domains where humans are bad and humans aren't the answer for which what's true so so think malware classification a piece of malware is either malicious or it's not like a given binary like it's either you know it's microsoft word and it does everything fine or it's going to like destroy your hard drive and then maybe there's some gray area in the middle of things that like are slightly malicious and like our adware or something but they aren't straight malware but if you take a set of programs that are only benign instead of programs that are only malicious it's not up to the human to decide which one is going to eventually wreck your hard drive either will or it won't and you can train a classifier to try and predict is this program going to destroy my hard drive or not it doesn't matter what the human thinks there is a certain ground truth independent of human reality and machine learning classifiers on this task will still perform well but we'll still have adversarial examples and here you've removed the human completely the human no longer matters but the classifier still is going to have cases where you flip one bit of an unrelated piece of dead code and now all of a sudden the classifier decides that this previous thing which is classified as malware is now benign and so even if we're like this is why i don't like going too far into the human space because i don't know it's a little it's very hand wavy for me like i like to talk sort of about these the sort of things we can talk rigorously about and this is one of the areas where we know that we can talk rigorously because there is a ground truth and classifiers have these problems still that is a fascinating exam i was like the next question i was gonna ask i'll i'll have to rephrase it right now because so if if you look at just the domains where we usually you know look at as humans in adversarial examples like computer vision speech processing and so on um what are the dangers because people always say oh it's a security risk and so on and i'm wondering do you in these domains like now i can clearly see in in the in you know these in the classifying malware domain but in like the computer vision domain what's the harm so to say of adversarial examples yeah so this is this is my my big pet peeve with with a lot of the research that happens on on machine learning um security is that people will will show how to make a machine learning model do something weird or bad and claim it's a security issue and that's i think in in isolation now a machine learning model doesn't harm anyone and so it doesn't matter and you really have to think of like how these models get deployed as part of bigger systems to and and whether there's going to be any adversary there to actually do something for it to matter or not and so my my maybe go to example here of something that i don't really consider a security issue with address cell examples is self-driving cars where i mean yes we've shown that i mean there's been a lot of works that have shown that you can perturb stop signs and some computer vision model will think it's looking at uh turn left sign instead and so on um and the thing is you could also these these uh models also fail when there's like snow on the sign or when it's raining very hard or when the visibility is bad so at some point i think thinking about adversarial examples in this context is kind of maybe this will matter in in 20 years or in 10 years or but for now it's it's sort of there's there's things within the distribution that can cause your say your self-driving car to fail and that's kind of that's a problem enough there's a safety problem there and i'm not particularly concerned about a malicious adversary um trying to to do this to your car it's kind of if someone wants to kill you then you know um let them try they might like throw rocks at your car or they might take their own car and ram into you i think like perturbing pixels is not the the main threat there and so about two years ago um we're discussing this with um with my advisor and we're trying to come up with like a setting where this this way that we study adversarial examples um makes sense as a security threat so basically we're looking for a setting where you as an attacker would be interested in modifying an image so that to fool so that you could fool a classifier sort of any time you can't just find the worst parts of the distribution and that's enough and also where there's a human in the loop so you actually want the perturbations to be small and there's one area where we found this to make sense and that's in online uh content blocking so the the paper we wrote was about ad blocking and so there it's this threat model makes a lot of sense like what the adversary say facebook or some other ad network or publisher wants to do is to make sure that content is pushed onto users so it fools the ad blocker but with minimal changes to the content because humans have to see it the right way there's other examples of this like when youtube wants to take a video down um say uh i don't know someone someone's uploading some some terrorist content that they don't want to see it's it's kind of the same thing they have a classifier that wants to um check every video and see if they have to take it down and as an adversary of the system what you're trying to do is get past the machine learning model but while making the smallest amount of change so that the humans still see the the video so there's really there's there's these sort of settings that i would say are nearly a bit of a touring test for machine learning models where there's really there's a there's a human in the loop and the goal is to make the machine learning model fail the turing the sort of visual touring test say and those are the settings where i think today adversal examples can matter from a computer security perspective i thought quite a bit about other settings that have this similar property and it's it's hard to come by i really agree with you about some of the gratuitous examples of adversarial examples whether it's recognizing turtles and and so on and i think working in a large corporation myself i'm looking for ways that we can actually take this kind of security training and we can apply it to our applications through some application governance protocol and the extent to which those questions that we should ask the developers generalize i don't know so i'm looking online at a threat modelling taxonomy and you could say well you know if confidence levels of your model suddenly drop could you know why and and which data caused it and you know does the model only output results necessary to achieving its goal or some of them might be far more specific to a particular application like you were mentioning in in youtube but what you said before also made me think when when does overfitting and good machine learning design stop and when does adversarial example start because you gave the example of oh it might be recognizing snow you know it might it might see the snow and think that the snow means leopard or something isn't this just kind of machine learning hygiene at some point yeah so maybe one one extra extra point i would make there that i think is also is really the big difference between some of these settings is that their settings like self-driving where one mistake matters it's basically you you have to be right all the time and in those settings yeah i think the the big challenge today is just this distribution shift or low tails of your distribution like adversary examples might matter at some point but we have sort of bigger fish to fry before then and then there's settings like like for instance ad blocking where if your classifier has like 97 accuracy that's probably okay like users are not gonna be particularly happy that from time to time things break but it's not like anyone's gonna die in the process um and so i think in those settings where a classifier that's good on average is good enough and where the adversary really has the goal of getting control of the system fooling the the model every time they want that's a setting where address cell examples really matter from a computer security perspective and there's i think there's relatively few settings where it makes sense to deploy machine learning in this way it is an interesting story maybe i recently uploaded i upload my videos also to a a platform in hong kong called billy billy just because chinese people can't access youtube and i uploaded a recent meme review and it was the first video that got blocked there and i figured out because they gave me the timestamp i figured out it's because i had a meme of winnie the pooh which is a banned image in china because people make fun of xi jinping and with using that it's so crazy i hadn't even thought about actually building an adversarial example into the video i'm gonna do that uh that's gonna be awesome um yes sorry wieland you wanted to say that's a cool story uh no but but adding to what florian said i think um i mean from a from a pure security point of view probably adversarial examples is not so bad in if you view it in isolation but i think what we should forget is that adversarial examples are not an isolated phenomenon right so if we are thinking about autonomous driving we are talking about issues in generalizing to new situations and this is very much related also to adversarial examples at least to a certain degree because adversarial examples basically show you that the networks are not using using the right features they are processing the world in a different way they are basically extracting a certain set of patterns to solve the tasks they were taught to be solving but uh you know these patterns don't or not necessarily generalized new situations like let's say snow which then is like oh no whenever there was snow i saw a leopard so probably there's a leopard right so it is that's why studying at wizard examples is also partially so interesting not only from a security point of view but also because it really points to the wider problem we have as a community and it's so deeply connected with many other issues that we see in machine learning also by the way also data efficiency right because you know we we need to extract these statistical patterns and we need so much data to really extract something that is kind of useful on the data but that's really again pointing to the the phenomenon that they are kind of the networks are learning shortcuts they're solving the the task in the kind of wrong way or let's say in the unintended way and that's creating all these kinds of problems you actually had a paper called shortcut learning right that is that is related to adversarial examples but i think the conclusion there is a bit different and i think also a quite relevant you you want to shortly go into what sort of the the paper was about and what you figured out there so the shortcut paper i mean it yeah it's kind of it was a summary of our you know maybe also internal thought processes right um so um i mean at the end i mean basically what we said is that um you know when we are trying training neural networks on a task we usually kind of miss the many ways in which we can solve them right and imagenet is kind of a nice example because we humans have a very strong sense in how this problem has to be solved right we see the shapes we generalize in certain ways and it comes so natural to us that we are missing all the other ways that kind of the solution space that exists and you know other ways in how into solving the same problem to the same accuracy both on the training and the test set and so kind of if you look at the task we should not forget that there is this large space of solutions but in this large solution space there are only some sub solutions that actually generalize to out of distribution domains in the way that we intend them to and right now we are not good at you know basically choosing the right solutions from this large pool of possible solutions or even understanding what how large this pool of solutions even is and that's kind of one of the underlying problems that underlie you know problems in generalization distribution shifts data efficiency all these kinds of problems adversarial examples are all connected to this because from this large pool of solutions we're kind of picking to a degree the wrong one and now the question is how do we solve this how do we add the right inductive biases in the architecture and the training in the in the in the data set in order to move neural networks to find kind of the right way of solving the problems finding the intended solutions um that we that are then also generalized to new domains that's fascinating because you said we are picking but presumably you mean the machine learning algorithm is picking and as you say it's it's determined from the objective and the loss function and the inductive priors and so on so so there are so many potential solutions but it picks the wrong one yeah well there's even a recent paper from from google showing that this actually even happens when you just change the random seed right like you get to a solution that generalizes differently just it's not like a bit weaker or a bit stronger but it actually generalizes differently and i think what you're saying you know goes even beyond that so the space is huge yeah i mean that's i think it's still kind of an open question on how different the generalization is so we have um like very recent work which is uh where we looked at basically a lot of different models and how they perform on out of distribution data so um and basically so we compare humans and machines on several generalization data sets and we look at which samples they're doing right and on which samples they're actually failing which they're classifying incorrectly and then we basically look at how similar the the errors are um you know between cnns between humans and then between cnn's and humans and what we see is that whatever architecture you're using whether it's a convolutional network at a transformer network whether it's unsupervised supervised doesn't really matter they all make fairly similar errors it's actually quite consistent if you compare all of them in between and humans are also relatively consistent on which uh images they make errors but if you look at the cross-consistency between neural networks and humans you see a large gap so it is yeah it is an open human generalization paper from some folks out of berkeley where what they do is they just relabel imagenet for like tens of thousands of images or something and then they show that um humans for example on object classes on imagenet like a well-trained human can get like basically no errors like maybe i think like the best human made one mistake on like a thousand images but you ask a human to label the 136 kinds of dogs and it has the best human like they trained themselves they went to the the kennel like society they tried to do as like as best as possible and the best humans that among them who i mean the middle grad students who are not dog breeders but um they could only get you know 70 or something percent accuracy and the best image net models were substantially better than them and so i mean for imagenet models like they make these mistakes on like things that to humans are completely obvious you know is this a banana or a cup or something but can accurately distinguish the like craziest breeds of dogs that look almost essentially the same to us but we have hard time sometimes on these dogs but it's trivial for us to identify two objects from each other one to one another and yeah this is just generally one of the problems of machine learning is you can't it's not that it makes mistakes it's that you don't know when it's going to make a mistake and you can't trust that what's the mistake and what's not it's a similar thing with different types of mushroom as well i think we're incredibly bad at that but it does come back to this dichotomy though that on the one hand you could just say okay it's not a it's not a problem with the substrate of neural networks it's a problem with shortcuts and the way we're designing our models and i really like that idea we spoke with lena voiter the other week and she was saying how the accuracy probe is rubbish it doesn't tell you anything about the mutual information between the representations and the labels and if you think about it we're just in this quagmire where we actually have no idea what these models are learning and we're clearly not directing them to learn anything useful so maybe we just need to have better models or do you think that there is actually an existential problem with deep learning and it will never do what we want it to do i guess you could say that the the objective that we're typically optimizing for right which is just accuracy on on these tasks it does seem i mean i i believe that we're we're reaching maybe the the limits of what can be done with this because yeah on depending on how you measure things on on imagenet we're sort of already doing better than humans with with models this way as long as you don't change the distribution in in any way and the models are usually nearly fit the the training data perfectly and so it really seems like we need some new form of some some different objectives some different form of regularizing the the search space because just pushing for more accuracy is not the is not the way to go i think it's a fascinating question um which i think it's super hard to answer until we actually know what the solution is so in the sense that i mean we have we what we do know is that the solutions that that that the machine finds depends on its architecture it depends on what task we are training on what loss we are training on even the optimizer itself right whether you are using uh you know atom or something else it will all kind of change your solution the problem is i think we have very little idea very little understanding right now how each of the the pieces kind of influences the solution that we find and how much we actually have to change those in order to um in order to find much better solution and solve adversarial robustness i mean i don't know i think i guess nobody really knows um at least i don't and um it's uh i mean my my hunch would be that we have to change quite a bit but what it is exactly is is unclear so i think we are we are reaching kind of a certain limit uh which is also obvious from the fact that you know if you look at the past few years really the progress at least in robustness isn't huge right so i although a lot of people are working on this problem and are thinking about this problem so it is definitely something that is not you know uh solved by what i call a silver bullet right i'm always skeptical if a paper comes up and says oh i changed this little thing and now suddenly everything works um it's unlikely and and we probably have to change several things in concert in order to learn substantially better representations closer to humans but that's a hunch one of the interesting continuums here is presumably we are all fans of data-driven methods but we all admit and agree that we need to imbue more knowledge into our models but the question is how much more knowledge do you think that we're quite close to something which could work or do you think that we're going to have to impute so much knowledge into our models that they're going to resemble expert systems again i worry that if we speculate too much i don't know this is it goes back to the same thing where if we knew what to do then we would do it and i mean we know that adding more training data helps with episode robustness too by like five percentage points or something you know take cfar and augment most of tiny images and you go from 40 to robustness to 45 robustness with adversarial training and that's good but you know if you extrapolate the curve we'll need like 10 trillion images before you get anything reasonable uh i don't know it yeah maybe to to echo this i would i think i would speculate that even if we were to find the right solution right now say you know um some some grad student somewhere in his 10 000 iteration of trying to train a model on imagenet actually happens to like learn the model that is uh that that solves imagenet in the right way we wouldn't even necessarily be able to recognize this because i think we we have a bunch of benchmarks at this point to that people might measure and of course if you threw a very good model at some of these benchmarks they would probably do better than what we have now but yeah i think even even if we have the solution right now sort of convincing ourselves that it really works in all situations that we care about is something that's hard to do it by uh in itself so with with respect to convincing ourselves and and what works there there is this notion of defending against adversarial examples right and so that's that's kind of the the thought people have i want to make a method to defend and as you say if we if we knew why they were why they happen we would just be able to defend against them and this is correct me if i'm wrong but adversarial training which is simply i add adversarial examples to my training data is still like the the most like com the best method of the best we have so far and that was suggested in like one of the first papers um to add these why like what's can you could you elaborate a bit of the space of kind of different defenses and counter-attacks and so on how does that look i mean the space of defenses is sufficiently broad that i don't think i can summarize it basically if you can imagine the defense someone has probably tried it i don't know at least 500 or a thousand defenses have been tried everything from trying to detect if inputs are adversarial just by looking at them to monitoring how the forward process has gone to detective inputs or adversarial to pre-processing them with various crazy preprocessors to post-processing the output with who knows whatever function you can come up with most things like this have been tried as for how to attack that's an easier question the answer is just you run gradient descent on a simple loss function that's easy to optimize and this is the only thing that any attack does and the only thing any attack should do and they have different names depending on what paper you look at but they're all the same like up to like maybe a couple percentage points and which which norm they're optimizing on like they all just do essentially the same thing uh and so this is fairly easy to just say the way we generate attacks i mean i guess i should say in the white box setting where we compute gradients of models they're all the same you just run gradient descent with respect to loss function in the black box setting where you want to make a navigational example without ever having access to the weights there are a lot of different ways that behave very very differently most of them are some variant of a paper that weyland has proposed on on decision boundary attacks but i don't know if i want to get too much in to like the whole space right now but yeah well one of the fascinating things though is that the adversarial examples generalized between models and architectures and from my perspective naively does that imply that this hypothesis about it being you know it's just learning imperceptible features does it support that hypothesis so what like so this is what the the paper from the folks at mit did is they they say the reason why the classifier well why ever so example is transfer like this is sort of amazing that they transfer you to train two models on similar data sets not even the same data and everything examples on one transfer to the other um and the argument from the paper is that there are well generalizing features that like are like legitimately well generalizing that just are not the things that we as humans want to think want to think of as being well generalizing and the classifier entirely correctly picks up on these well generalizing features and so two classifiers trained on the same kinds of data will learn the same well generalizing features and as a result you can have an adversarial example just because you've attacked these well generalizing features that are high frequency or imperceptible to humans and and things work out there what if what if we could have some kind of discrimination on which features are allowed and which features are not allowed now the problem is a neural network learns this weird entangled hierarchy of features i'm not sure that that would even be possible but imagine if it were possible what would your approach be you can you can view adversarial training as doing exactly that in that it's sort of it's fixing some form of of perturbation that it wants the model to be robust to and by optimizing for this form of robustness what are you essentially telling the model is any feature that isn't robust to these form of perturbations you can't use it because you're going to be penalized for doing this and of course the issue here is you have to specify in advance what you want to be robust to and that's kind of the one of the issues in this space is that we don't know how to how to sort of generalize beyond that and get get robustness to yeah more more general forms of of noise than those we can sort of fix ahead of time but adversarial training you can view it really as the way to force the model to do this if you look at randomized smoothing which is one of the other few defenses that actually seems like it works it does exactly the same thing it just you you want to prevent some kind of noise so you just like make you you say that you must be able to classify inputs with extreme amounts of that noise correctly and then you do some statistics on top of this and just you you require that the inputs be you say i want to be robust to l2 noise so i'm going to add gaussian noise to my images and classify those and like it's sort of it's good that we have things that work like it that's better than having nothing that works but it's a little uh annoying that the things that work are basically just specifying exactly the just sort of cut out the corner of the problem and just say yeah don't do that thing and now we have a attempt at a solution i i would much rather have someone who could tell me this is a problem with machine learning and here is some some nice inspired way to solve it and as a result you know section seven of the paper says adversarial examples are fixed um but currently the fix basically is uh it looks like neural networks are good at ever or bad evidence for example so we'll train on lots of episode examples and therefore they go away for this kind of episode example and that's fine but it's not as exciting as we would like but every paper that tries to do the exciting thing and says like we've solved everything most often just how to solve nothing it also mirrors other other advances we have in in machine learning that even though the numbers are are good you could say the process with which we get them is kind of unsatisfying which is um basically everything we do with data augmentation and getting more data right which is kind of adversarial training you can think of just the worst case uh setting of this and so with with data augmentation it's the same thing we sort of say well we want our models to learn that say shifting inputs or rotating inputs or adding different kinds of natural noise shouldn't matter and one way you could do that is to really design your model in a way that that it is robust to these things kind of by design and people have tried to do this but consistently what always works better is just throw more data throw more data from more data at models and we're still in a regime or it seems like we we haven't quite hit the the top of how well we can do by just adding yet another 10x in terms of data and yeah it's going to be interesting once we hit the the the limit there what will happen that's right i suppose again you know as a naive uh bystander just to help folks at home understand this if we were in a perfect universe where because essentially what we're talking about is that is the boundary between our data distribution and anything that's outside of it and if we were in a perfect universe where we had all the data in the world all of the data that would ever occur and we would ever you know that that would ever need to make a prediction then there would be no adversarial examples it's because we're dealing with this boundary where we train it inside the boundary and then the model gets used outside the boundary that's why we have adversarial examples but but then then it comes back to my my thought before about when you say adversarial training is it adversarial training because what you're really doing is you're just enumerating all of the features that the model has learned and you're kind of neutralizing them and if you keep doing that forever there'll be no features left and that the features are entangled so it's very difficult to say well let's not have those ones but let's have those ones it's not like some are good and some are bad i think there are objections to your first statement already go on go so if i mean if if if you had all the all the data in the world and you built a classifier on it there would presumably still be adversarial examples because you know you haven't you haven't actually you haven't specified a sort of a a data distribution model if you could accurately specify an actual distribution model of where your data is and where it isn't um maybe that could help but as as long as you're building classifiers um and especially with classifiers picking up on these features and i mean these that's the point these are features that are actually matter they are features right they're not aberrations and as long as they're picking up on that you know well i mean i guess there there is a thought experiment here it's it's a fault experiment you would never be able to implement this but say you took every single um pixel say you take your 300 by 300 by three images and you you take your 255 to the whatever combinations of pixels uh label each of these by a human um and then train an uh uh machine learning classifier on this one that that somehow reaches like good good training accuracy um well then at some point well yeah you're doing curve fitting uh if if you can classify everything perfectly then you don't have addressable examples um but that's i think also still still an open question there in terms of just the the yeah whether of course we're never going to do this but whether there is a point along the line we're just sort of collecting more data um in itself can help because of course as you collect more data the chances that there are features that generalize but are actually non-robust it this probably decreases at some point so you'd hope that classifiers are not going to pick them up if you just give them so much data to classify that the only solution is the right one but we don't really know when that's when that occurs hello from postscript i've had a little bit of time actually to think about florian's thought experiment here and imagine that you did have all of the possible images so we we have the entire space of pixels we got a load of humans to label every single permutation of pixel values we might have something that looks like this so first of all it assumes that the humans will know what the labels are and there's no ambiguity but it also assumes that there's a regularity in the data which means when you have the space of cats um they are all cats what if it was quite chaotic and between a pair of cats you might have a dog so it suddenly goes cat dog and you have these kind of challenging decision boundaries isn't isn't that still creating uh the possibility for adversarial examples um anyway i'm i'm naive to this but this this is this is my my thought process after the show it's it's i mean it's also the question whether this is like how interesting that is i mean for applications where you have so much data okay that might be interesting but i mean it's even adversarial training right i always you know when when i when i learned about it and you know i always i wasn't really fascinated it was like oh this is the deep learning canon of solving a problem just throw a lot of compute and a lot of data on a problem until it breaks somewhat um but of course it's not i mean it's not a super interesting solution in the sense that i really learn a lot i mean it's it's still cool like how good the rep is like what the reputations are and they they kind of they generalize a bit better in some directions you can use it for certain things so it's super interesting what it can do from application perspective but from a point of understanding representations and understanding how learning works and how you know learning the right feature works it wasn't a super i didn't find it a very elegant solution because it didn't really give us an understanding of why the inductive bias of the model in the first place was wrong why it learned these brittle features in the first place right so the cool solution that no one has yet found is really understanding how in the first place we can actually change the inductive bias similar to how the human brain works i mean obviously the human brain also doesn't need so much data right we have a lot of ingrained inductive biases and probably also very clever algorithms and learning algorithms it to extract the main features and and the things that are relevant about the world using fairly little amount of supervised data and a bit more unsupervised data and understanding how that works i think this is the i mean at least for me personally that's kind of the the real goal of course and plus it solves a lot of problems in applications too so this is also not for lack of trying though right like guess this is sort of the yeah like they're like there are like i can't count the number of papers that are a biologically inspired way of pretending adversarial examples and like one of the first ones was one that weyland had broke like in 2016 2017 or something and like since then just sort of every couple weeks you get another biologically inspired abyssal example defense that looks at the human vision system or looks at the way that i don't know there are some crazy ones with like biology and antibodies and anti whatever these and then like every attempt that people have made at doing this like is a legitimate attempt saying here is a thing that machines do differently than humans i will try and do this new thing and this will hopefully give us some robustness and they evaluate it and they find that it does and then someone else evaluates it and finds that it does that basically doesn't do anything and this is sort of generally the problem with trying these biologically inspired approach like i like them i want one of them to work uh but so far none of them have i mean it's not clear that it's the biologically inspired piece that makes them not work i mean most things that are presented don't work but the biologically inspired ones tend to have to and i might be fairly easy and fast to show that these ones don't work as opposed to some of the other ones at the very least i feel like i would i would give these papers the credit that they they try to offer like some some um explanation sort of learn something something new about the problem and that i feel like what what we also see a lot which i the kind of defense papers that i feel like ultimately contribute little to to our understanding of the space or just they propose just some some change to model architectures loss function whatever and then the numbers the robustness numbers jump from like 40 to 50 to 60 and very often it turns out that if you actually evaluate them correctly they don't jump from forty to fifty percent they jump from forty to zero percent and and there it's kind of even if it had worked even if you had gotten from forty to sixty percent it's not entirely clear what you're really learning from this you're like okay maybe changing this like maybe soft max cross entropy is not plays a little bit of a role and taking another loss function does slightly better but what gives you know and so i think um trying for for a bit more something a bit more ambitious of like okay we're really gonna try to find an explanation for what's going on and propose something that helps i mean ultimately i think this is this is going to be the way to go unfortunately um nothing has has worked yet um yeah but i still think this is going to be the right the right solution if we ever find one yeah i mean maybe to yeah to second that uh and also give maybe an example so there was a there was a very nice paper from from ankit patel um group last year and basically what they what they looked at was kind of at the implicit bias that uh the standard cnn architectures with bounded with convolutions actually have and what they showed is that uh they actually the these architectures are biased towards features which are kind of sparse in the frequent the high frequency domain and they're kind of they kind of reason that this is probably the reason why uh the the high frequency features are being picked up by the network during learning and so they're biased to these kinds of features which are also seen adversarials so maybe that's the reason why you know we learned these brittle features and then they kind of used a different architecture with has which where you can see that it has different biases they learn more low frequency features less sparse that is kind i mean unfortunately that new network is also not more robust than the old one uh so it didn't really help in terms of robustness unfortunately but it's kind of the it's it's the kind of insight that is that would have been super cool if it worked if it would if at least give a little bit of an advantage because you give a kind of an inductive bias towards a different set of features and maybe you know these are closer to the ones that that humans actually learn maybe not but uh so it's these kinds of understanding actually what happens what what the inductive buys of these networks actually are and changing that that would be super cool but at least right now i haven't really i mean as nicholas and florian said unfortunately there is no work one could cite at the moment that actually does this and shows improvements in adversarial robustness so obviously we didn't really understand yet but i think that's on the inductive bias how much of it is dependent on the way we get the data because the data comes to us on on a grid which is regularly sampled with rgb uh so straight away that's nothing like a human and and all of our inductive priors are derived from that is there is there a fundamental problem even at the beginning of the pipeline do you think also people have tried this right like people like there's this paper thermometer encoding which is paper by ian goodfellow and others and there are a couple like several papers that that turn the input into fourier space or something and none of these all have adversarial robustness they make it harder to evaluate a little bit but you do the right pre-processing tricks you make it differentiable in the right way and now it's not robust and so like it may be the case that the input representation contributes we can't rule that out but for every input representation paper that exists that has been evaluated none of them work and i guess like i for all of the things that we've been saying so far when we say that it doesn't work there probably exists lots of papers that claim that it does there are no papers that i'm aware of that do a thorough evaluation that show that it does or where someone else has gone on to evaluate it and find that it does like any claim that you want to believe is true you can almost certainly find a paper that says this solves the problem completely for you but this is just a function of the fact that evaluating is really hard and these papers are probably wrong so um you've we've actually written a couple of emails beforehand and i i want to i want to quote you uh from this one where you where you talk if you know where you where you talk about how researchers kind of self-deceive into these defenses working so you say i completely understand why this self-deception happens though i think a big part comes down to the fact when when you do research there are already so many things that might go wrong that it's almost required to believe that you are right all researchers necessarily have to have a bit of arrogance that they're going to be the one to solve the problem that no one else has solved before otherwise why try the difference is that in security work so in your other area after you've done that you now have to completely switch your frame of mind and commit just as hard to trying to show your idea is wrong after spending months trying to show that it's right and this is hard especially because if you succeed then you can't publish the result to abuse the saying it's hard to get a researcher to understand something when their publication depends on them not understanding it and so you're also known maybe as a bit of of these defense breakers you do break a lot of these defenses what what can you tell us about sort of how how what do people do wrong how do people go about or how could people people do a better job so let me start responding to that a little bit um like so i completely stole the beginning of this quote from from fineman um who has a very nice um piece where he talks about self-deception and in science and that yeah like i think i think this beginning is basically a quote from him where he talks about how the easiest person to fool is yourself and you you can't fool yourself didn't want to do good science but that you you have to in order to like i want to come up with the idea it's necessary to to make some leap which isn't justified to begin with and then you have to go but then you have the important part is to then next go and actually justify it correctly and that's the sort of hard hard part where you have to come up with some you have to come up with some idea which usually means just being inspired in some way and not being driven by by rigor and then you have to go and then make that very very rigorous in order to to make things work out and in most of machine learning it's fine if you don't do the rigor step because you can still get a paper that pushes image and accuracy up by one percentage point and it gets accepted and at least like you've done something the problem in in episode example defenses is when you don't do the rigor step then probably your idea is broken and then in reality it offers no robustness and someone else has to come along and break it and this is that's two papers right that's two instead of zero i'm not complaining i mean i get a number of papers out of doing this um but i would prefer a world in which the original evaluation was correct then a world in which a paper initially claimed 80 accuracy and then we had to come along and say you know it's really five percent i mean there are a couple reasons why first lots of papers lots of people don't notice that the first paper was broken and so you'll like there are i can't i can't count the number of schemes that implicitly build on or explicitly build on a broken defense and like this happens occasionally with accepted papers we'll have a paper that says we build on the wonderful idea of the perfect or almost perfect defense of some whoever and we show and we extend it to make it even more perfect and it turns out the first paper just didn't work at all to begin with and so the same attack that breaks the first thing breaks the second thing and if the first paper had a proper evaluation then you wouldn't have had the second paper that uh that that got it wrong but more importantly it's just sort of the speed of science right like if papers are mostly wrong to begin with and you need a second paper to go come along and correct them then the time from when the idea is proposed until we know if it works is on the order of unfortunately it's on the order of months because we can break these things fairly quickly um but it would be better if the paper came along the first time and proposed the idea and said here is actually how robust it is and we are pretty sure our evaluation's right and because then another paper could build on the idea knowing like even if it goes from zero to five i'd much rather go from zero to five than zero to eighty to five which is what we currently do and i guess this is what with like with weyland and with florian we've been trying to like in various ways trying to help people figure out how to do their evaluations right the first time around so that we don't have to keep on writing these these one-off papers that say and here is yet another idea that does not work and we just apply out-of-the-box pgd and on the right loss function the accuracy goes to zero and it takes us an hour to do this but it takes you know two days to write the paper and half the people who know about the original offense don't find the break and it's kind of frustrating in that way what are most people doing wrong here maybe at the at the highest level i think there's a there's a [Music] a difference in mindset uh between different communities in terms of how they approach results and this especially incorrect results so i think actually the most extreme is if you go into very very formal areas like mathematics or theoretical computer science there's like people if they if they ever publish a result that's wrong it can like break their career like writing up a theorem and it turns out that your proof is wrong this is like a mathematician sort of worst nightmare to ever happen even if they notice it and retract the paper that's like that's that should never happen to you um i think security is like somewhere a little bit in between where people will sort of trying not to write things that are obviously broken and when they do get broken it is not a very good thing for your reputation for it to happen and so people sort of try to avoid it um and also reviewers tend to be very um pessimistic about everything and sort of assume everything is broken by design and uh and i'm not advocating necessarily for that to be the right to the right mindset either but then i feel like in machine learning we see things in a bit of a more permissive way where you know you have an idea you run a bunch of experiments you write a paper and then okay a few months later maybe it turns out that it it wasn't as robust as it was claimed well you're sort of on to your next paper by then um and so one one thing that i i noticed uh about a year ago because it actually happened to me so i i wrote one of my my first paper on adversarial examples was a defense paper and it wasn't uh i must say it wasn't uh a particularly good one in the end like the defense doesn't really work as well as it was claimed and it was also at the time it was very hard we didn't really know how to evaluate things um so i don't think we did like as good of a job as we could have in hindsight and so people have been like building better attacks over the years and so what we did is at some point we updated the paper the paper now says this is these are the ideas that we had in this paper it turns out some of these evaluations aren't as good as they should have been here's a bunch of other papers that came after ours that show that you can reduce accuracy i checked last year i think 30 papers that have been written and published over the years on adversarial examples that have all been broken most of them by nicholas actually of these 30 papers there's one that has a footnote on page 2 i think that says it is broken the 29 others sit on archive many of them have been updated after the break they're like on version 3 or version 4 at this point none of them mention the fact that this defense is ineffective and i think that's one of the reasons as nicholas says that you then sometimes see people building upon defenses that are actually broken because they might not even know all the all the literature and the paper sort of don't make this explicit and i think this is it's it's a mindset to have that if your goal is really to build towards more robust machine learning you you have to be honest with with results that that ultimately aren't aren't correct or or with evaluations that weren't as far as they should have been and actually the the even the the adversarial training paper from from folks at mit which is one of those defenses that really has has stood the test of time even they did this like if you look at their paper on archive it's like it's it also at its version 304 at this point they updated it a couple of times with some caveats with with some evaluations that weren't actually done quite correctly the first time um they had a challenge set up that people could could chip at and i think this is this is ultimately the right way of doing of doing security to make falsifiable claims and then if someone falsifies them well to acknowledge it and uh yeah i i think you're right in security we need to have a lot of humility but you're touching on an interesting problem in in research in general especially in machine learning that we present papers and there's a credit assignment system and there's group think with the peer reviews and there's all sorts of problems but you did say something interesting a little while ago you said that machine learning and security is more permissive uh in in the sense that we're allowed to make mistakes and it wouldn't necessarily destroy our reputation and that fascinates me is it because we don't really understand how these systems work i'm not advocating for destroying anyone's reputation by the way oh no no no no neither am i but i think that the key concept is shortcuts i i think there is in security there exists so many ways that an adversary could attack your systems and and even in in data and machine learning there are so many shortcuts that a machine learning model could take and we don't fully grasp what those shortcuts are and some of it is like dark matter it it's the the space of possible shortcuts that we are currently unaware of so is is that the problem well i think the bigger problem in machine learning right now is the attack techniques are known we know exactly what to do and people still get it wrong like and this is sort of the frustrating thing is like like there was a time when in order to evaluate a defense you needed to do something new and clever like there was like research quality ideas of how to break a defense wait so so fun example in in 2018 um but the 2017 2018 i wrote a paper with a niche where we broke some defenses that i clear best paper at icml like first of all it's crazy that the best paper i don't think it is um but like the community as a whole like icbl community decided it was best paper fine um the paper that i wrote with with the with floriano wayland we broke more defenses and we submitted it to icml rejected the reason for the rejection like i i i i don't have the quote from the reviewer right now was um this paper introduces no new ideas the only thing you do is apply existing techniques to break the existing papers and as a result we it doesn't matter we broke 13 published defenses um this isn't a this is not no longer a publication because like in the paper i wrote with a niche like we sort of you do this thing in papers where you make your ideas sound like they're more important than they are and this is sort of something like i don't like about research but this is sort of the thing that everyone does is you you sort of frame this as if it was a more generalizable approach than then you can do and and we sort of like it's i don't think we should have done this in that paper we we did as all researchers do um and that made it sound more important and in this paper we tried to be very explicit and we said like we have no new ideas we're just going to use what's already known and show that it's sufficient to break everything and so the reviewers agreed with us you didn't do anything new you didn't so we're going to reject you but this is i think the problem now is we like everyone agrees you don't need to do anything new well and still like the papers are still already like the broken paper still get accepted even though there's nothing you need to do to evaluate them you just need to like do it correctly um and can i say a slightly different way because we spoke to kenneth stanley recently and he's a big advocate of open-endedness and the tldr is that you know there's a tyranny of objectives in our lives and you know the peer review system is is a is a kind of group think if you like and it's creating this system that science should be about exploration it should be about discovering new knowledge that we don't already have and because of this kind of group thinking convergent behavior and needing to get papers published because we want you know that getting papers published is like the extended interview for getting into google these days right so don't you think that's kind of antithetical to researchers following their own gradient of interestingness it should be a divergent search and then some of those things will lead somewhere interesting but people can never follow their own gradient of interest because they'll just be completely ostracized by the system immediately i just want to interject that i i have a strong belief that my defense is one of the defense defenses you broke and if it's in black paper i'm gonna i'm gonna have to send that reviewer a cake or something so uh just to thank them sorry you wrote one of the papers that we broke i think so i florian sent me an email a while ago and uh it's a legitimate attack though in fairness it's an attack that i didn't know about when when we wrote it but yeah so i'm definitely guilty of you know doing doing what you said florian of of um sort of sort of not not going back but then again maybe to add to tim's question you know what what's what's sort of the incentive for people to do rigorous rigorous analyses like even like right now you published all these papers okay here is actually what you should do um but still you know and you say in the olden days and i guess that in in a bit of defense that was when we wrote the paper you you had you know it wasn't entirely clear what you should do what is the incentive even for people to do this because as you say right the broken papers they get published and in machine learning research you know publishing in this really noisy system is everything so so what how what is what is their incentive and how could we sort of change the landscape to maybe come to a place what what tim said where new knowledge is actually the goal i think in the grand scheme of things the the incentives might even be already somewhat set right in that the few defenses that do work are among the most cited papers on adversarial examples or even in in some sub areas of machine learning over the past few years so i think if you do actually have an idea that works and you show it well it will have a huge impact of course there are also a whole bunch of of ideas that have been presented along the way that didn't work and that were um that were impactful and i think in a way the fact that these defenses got broken doesn't necessarily diminish the fact that there was interesting knowledge in in some of these papers sometimes it took two papers to get all of the knowledge out like nicholas said i think ultimately the the one bad incentive we have in in machine learning and in science in general is this just reluctance to ever publish anything about negative results in that i think a paper it was actually this this happened uh two three years two years ago or so i was talking to these people who had published a very very interesting defense paper it was like some some new idea that looked very fun and it turns out it didn't work but it turns out that breaking it was also like took took work it was something very very new and different and that paper has just never been published anywhere and i don't remember if the authors tried or not but i tried to convince them at some point to write up this paper as like here's something here's a cool idea it turns out it doesn't work i think that paper would have never gotten accepted anywhere even though it would have been one of the more interesting papers on every cell examples and the problem is i know this paper and there was another paper that got accepted which does what they did but worse without the attack with that without the correct attack and this paper was accepted yeah and if like this is the like if that original paper had like been published with their negative result then hopefully the second round of authors would have been able to see this and and known not to do their thing this way and this is really the like yeah we should be optimizing for knowledge and not for like this is the current problem is everyone wants to know how good your defense is like what the accuracy number is do you get 60 robustness on on adversarial examples but this shouldn't be what we care about like we should be looking for trying to write papers that introduce us new things independent of if it gives you 60 accuracy we want correct new ideas and this is much more important than a paper that gives you another couple percentage points of accuracy yeah well and i think you wanted to say something yeah i mean the uh i think so i mean in in the general sense if we're talking about you know evaluating robustness on imagenet which is kind of but a lot of c for what a lot of people are looking at right i think i mean the the incentive system of research is kind of against our favor for the simple reason that um you know an adversarial robustness you need to solve this optimization problem in order to find like to evaluate robustness and that's inherently something you can fool yourself and that's a bit different to most other fields in machine learning where let's say you look maybe at imagenet performance or you look at performance on od data sets or something where you cannot really do so much wrong right you you basically you push the after training you push your data you there is a clear benchmark and then you know you can evaluate and that so those parts kind of work kind of well but in particular in adversary robustness there's a huge problem because the incentive you know of research is kind of working against us but um i mean at the end it's also partially a community spirit so the only thing that will really help us is accepting and that's kind of what what kind of nicholas is saying is we have to go we have to accept problems or rather look at problems where we can know whether we actually made progress or not or whether we're fooling ourselves or not that's the big problem usually the metrics we are looking at like there's 60 percent robustness or accuracy on on seifer that's not something we can really evaluate correctly or it is least very very difficult to evaluate it correctly right while other things in maybe simpler data sets or you know certified robustness where you can actually show you know that you improve things we should probably rather look at those scenarios but right now the community isn't really very um i would say doesn't really accept those ideas too much i mean i remember like when we we published our defense it was uh like using generative models that did but it only worked on mnist but we still found this to be a cool idea that maybe other people can improve on even though it was only working on ms and uh but we really had a hard time in the review process because people were like oh this is amnest i don't care you should do it on natural data sets otherwise we don't accept it why i mean although my opinion was if you cannot solve mnist in terms of robustness maybe that's the point we should start or maybe even do something simpler than that and really try to understand whether we can solve that completely before moving on to more complex problems i completely agree and it might have been a stepping stone that led somewhere interesting but because of the myopia you'll never get past the filter but um i i wanted to to pretty much round up because i know we're at time but just in um as a final question just some actionable kind of feedback because many of our listeners are working in the in industry basically and they're building machine learning pipelines they might even be using some of the vision classifiers that you know you can get them on azure and aws and gcp and so on and to be honest you're a bit stuck if you want to do any adversarial defense on that you probably just have to take what you're given so you know what kind of actionable advice could you give to the listeners i would say if if you're deploying say a computer vision model today in a setting where you can't trust the inputs you're getting good luck i again i think there's there's probably a few settings um of interest where machine learning is already good enough today that you're actually going to deploy it and where you really really care about it being super robust against against attack so i feel like for for many situations address el examples specifically are probably not what you're gonna be what you what you should be caring about primarily and sort of other out of distribution robustness benchmarks are probably more interesting to look at um yeah just on that is is that there's a weird interplay between adversarial robustness and um several examples for example you you might be building a prediction pipeline and you could be taking the operating book from samia singh so you might have all of this testing in your pipeline that does counter factual examples and data grouping and all sorts of stuff like that but i guess with robustness you you actually want it to generalize as much as possible but with adversarial examples you kind of don't want it to generalize it is it is it antithetical in some way not sure what you mean by it not generalized that you don't want it to generalize for foreign examples so with with um adversarial examples it's learning features which you don't want it to learn so in a sense it's actually very robustly learning something you don't want it to learn so you you kind of don't want that robustness but you do want robustness on features that do actually generalize to things that you're trying to classify yeah and so that's actually that's what we see with like the few defenses we have so the the i think i mentioned two so there's basically i do two classes of defenses which are adversarial training which gives you some empirical guarantees and then there's certified defenses which give you some provable guarantees and but again for like some very specific types of perturbations say small in l2 norm um and if you if you try to defend against these perturbations you're going to hurt your model's accuracy on clean data and so i think in in pretty much any situation of interest today that i can think of what you're probably better off doing is just deploying your model with the best possible average case accuracy because it's not like you're going to be able to do much for its worst case accuracy if you're actually faced with a motivated adversary so you might as well do as well as you can for the rest of the data this is the common thing in security right like if a nation state decided to attack me in on like and tried to get into my computer like they would succeed like and so i just i'm not going to try and defend against this this is like if i lived my world imagining that that the nation stayed without to get me like i would be living in the middle of the forest right um i would much rather just do everything just to just presume that's not going to happen and if they decide to do this for some reason then i just have to live with this fact and similarly for for most of applications of machine learning if you're going to use it for ad blocking for malware detection for any of these things make the benign accuracy as good as possible because this is the case where almost everyone lives in and if someone really wanted to fool you with every single example there's nothing you can do because like even the best client like like there's this wonderful slide by dave evans um who's a professor in computer security who talks about like guarantees in cryptography we talk about guarantees on the order of like 2 to the minus 128. like if an adversary can succeed with 2 to the minus 128 everything's fine and if you can go down to like 2 minus 127 like you know cryptosystem broke and rebuild it from scratch right like in systems it's like you know maybe two to the minus 32 you know guess a stack canary or something and like the best defenses we have in machine learning are two to the minus one like you succeed half the time like in the best case so like why would you sacrifice 10 percentage points of clean accuracy if the best thing that could possibly happen is the adversary has to generate five adversarial examples in one of the marks like things are very bad here and so like for the most part right now you just have to accept the fact that if someone wanted to follow your classifier they will be able to do it and just account for this and like if as if anyone could fool your classifier and someone would die then just don't deploy your classifier yeah because i think that's a useful way of thinking about it it's almost like um in some security people talk about the the likelihood and the consequence of something happening and you can kind of multiply them together to get your your effort but another thing is um neural networks do weird things don't they you know the pruning thing right so on the long tail the neural networks are just learning all of the the you know the difficult uh to classify instances and that they're memorizing them and is is pruning a good thing or a bad thing i mean if if they if if you're memorizing examples and you're not really learning any regular patterns does that mean there's less chance of adversarial examples or more chance i think again if you if you view it from a from a threat analysis perspective it's like if someone wants to evade your classifier they will um it might take them very little work it might take them a tiny bit more than a little work i guess this is going to depend on on the application um there's yeah i think very little we can do about this and and this is true i mean this doesn't mean that people don't deploy machine learning in adversarial settings right like any any company out there that does say malware detection is going to be using machine learning in its pipeline somewhere but they'll pull in a whole bunch of of obscure features some of which the adversary probably can't even control their entire pipeline is going to be completely obscured i mean they're even though this is something we don't like in in computer security there is merit to security through obscurity here in that no no one knows how say google classifies android malware and presumably if everyone just had public knowledge of this it would be much much easier to create malware um but could i slightly reframe the question because machine learning is like compression and if you have very regular data then you can describe it with very few examples and and you have representations that that generalize quite well so in a way the regularity of your training data is kind of like your attack surface is is that fair are there some things that you can do you know to change your attack surface when you're deploying machine learning models yeah in the sense that i think if you if you can in if you have some influence about what kind of data you're going to be seeing there's there might be some some things you can do although usually when you have when you have some kind of control over over what the input data looks like you might not even want to be using machine learning in the first place but yeah it seems like in in the general case where you want to use machine learning is when you don't really have a good sense of what the distribution or what the semantics of your inputs are that's why you want to learn them in the first place and then that usually means that there's going to be some blind spots in what you learn and uh and there's going to be yeah there's going to be potential for for attacks that exploit these mistakes this is a very optimistic note to be heading on amazing well gentlemen it's been an absolute honour to have you on the show thank you so much for joining us this evening and honestly i really really appreciate it thank you so much yeah remember to like comment and subscribe we love reading your comments and we'll see you back next week
Info
Channel: Machine Learning Street Talk
Views: 6,666
Rating: undefined out of 5
Keywords:
Id: 2PenK06tvE4
Channel Id: undefined
Length: 96min 15sec (5775 seconds)
Published: Sun Jan 31 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.