This Image Breaks AI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I’m at the University of Western Australia. That  is one of the world’s most advanced self-driving   vehicles, and this pattern should confuse it  enough into thinking that I’m a banana and   running me over. Please, don’t try this at home.  Before I recuperate my university  fees by committing insurance fraud,   we first need to try to understand what’s  going on inside our autonomous bus’ brain.   We’ve been trying to get autonomous vehicles  working for the better part of 80 years. In the   early days we had some really exciting successes  by embedding metal magnetised road spikes into   the roads which our vehicle could navigate  along. While this was surprisingly successful   especially given the rudimentary level of our  computing technology at the time, it's not a   particularly scalable technology. Australia alone  has about 800 thousand kilometres of roadways,   only 40% of which are actually paved over. While  it might work to get you from A to B, if you want   to get to any other location it's really not going  to work out. For that we need something that is   infinitely more scalable. Maybe GPS could work.  While it’s great at letting you know wherever you   are on the planet, it's not particularly precise:  you know your car is on the road but you don’t   know which bit of the road it's on. It might  crash into a bus going through the other lane.  We already use sight-based navigation to get  around the roads today. We use it to stay   within a lane, to stop ourselves from crashing  into the bus ahead, and importantly for knowing   what hazards are up by looking at street signs.  On top of that, cameras are getting pretty cheap.  Because of this most of the serious  attempts at getting self-driving vehicles   commercially available use visible  light and high definition cameras.   However, this means that we really  need to understand image recognition.  Given how easy it is for us to spot and identify  objects we often forget that this is actually   quite an impressive feat. Even within the animal  kingdom we have especially well-developed vision.   There is a compelling theory that our unusually  complex brains came about in order to facilitate   vision processing. Individuals which could tell  the subtle differences between the shape and hue   of safe and unsafe berries, or could spot  a lion hiding in wait in the long grasses,   had a competitive advantage and were able to  pass on these successful vision genes onto future   generations. We think, according to this theory,  that science, literature, art and engineering,   while all being very useful, are actually just  unexpected byproducts of these complex brains   developed for vision processing. When we try to  replicate how we humans identify objects within   a computer, especially a computer with limited  processing power, we run into a few problems.  To see why that's the case, let's take  the basic example of street signs.  Fortunately for us, they are nice, regular,  standardised shapes with high contrast and bold   patterns. They’re also pretty universal so once  you’ve programmed the street signs for one country   it should mostly carry over to the next. This standardisation is as a result of the   1924 National Conference on Street and Highway  Safety which laid out the ground rules for traffic   signs which still influence the streetscapes  around us today. Apparently, the committee was   really afraid of pointy objects because when  they elected to use a shape-based system of   identifying hazards they reserved shapes with  more corners to denote more dangerous things.   Pedestrian warnings – not that dangerous, let’s  make it a triangle with three points. Speed limit,   a bit more dangerous lets go four. Stop sign,  oh boy, you get eight sides! The committee   thought that train level crossings were the most  dangerous of all, so those signs didn’t get ten,   twelve, or every 50 corners – they got infinity  corners, or what we commonly refer to as a circle.  I’m in Australia so some of the most  important street signs for us to recognise   are those that say ‘Warning! Kangaroos,  cows, bobtails and snakes crossing!’.   Which of those animals do you think is most  dangerous to hit going at 110 kilometres per hour?   Regardless of which one you really don’t want to  hit, ultimately we need to be sure that our car   is able to recognise the street sign when  it's quite far away and respond accordingly.  How are we going to get a  computer to recognise our sign?   Here is one approach which may be an option. Ok. It’s a yellow square with a nice thick   black boundary, rotated 45 degrees and  a silhouette kangaroo in the middle.   Let's start out with that boundary; four lines  meeting at 90 degree angles. But what’s a line?   Well, it's straight? Fantastic! But that  doesn’t really correspond to computer code   so we’re going to need to come  up with some sort of algorithm.   Let's look at the pixels and just consider the  ones that are dark. Pick one and now consider   the pixels which surround it. Are any of those  also dark? Nope, ok let's find a different one.   This time are any of those dark? Yes, cool,  ok now have a look at its adjacent tiles,   are any of those also dark? Great. Now look around  its adjacent tiles. And that one’s adjacent tiles.   And another. And another. Now plot the coordinates  together. Is the gradient roughly consistent?   Perfect - that means we’ve found ourselves a  line. Repeat this a bunch more times and if   the plotted lines match up in a roughly diamond  shape then we’ve found ourselves a street sign.  Now that we’ve identified our square, we need to  check if it's a warning sign or just a random box   on the side of the road. We do this by taking the  average RGB value from inside our rectangle and   corresponding it to either yellow or not yellow:  telling us if it's a warning sign or just a random   box on the side of the road. Now we need to work  out what the caution sign is warning us against.   If it’s kangaroos then our response should be to  slow down and be a bit more cautious. If, on the   other hand, its warning us about drop bears then  the only appropriate response is to wind up the   window and get out of there as fast as possible. For this we need an even more advanced   line-plotter algorithm, and maybe  include a few matrix transformation   routines to take into account if you’re  looking at our kangaroo from an angle.   But not all kangaroo drawings look the same  - so we need a database of potential designs.   What if there are branches in the way? Well, that  would mess up our shape so we need to accept a   range of kangaroo-like objects and maybe include  a function that removes branches from the image.  Now do that for every single street sign,  zebra crossing, set of traffic lights,   and road marking that you can think of. And  a few which you might have forgotten. You   can start to appreciate how this quickly  becomes a very complex coding nightmare.  Rather than getting an underpaid, overworked,  sleep-deprived, caffeine-addicted,   undergraduate student to manually program every  single object that your car might come across,   another option is to get the computer to do that  for you. This is what's broadly defined as machine   learning. One approach is what is called a neural  net; similar to the neurons within a human brain.   Rather than taking a set of instructions  and just going through it one-by-one,   instead we have a complex web  of relations between inputs,   analysis, and outputs. Here is how you  might apply it to a 16-pixel image.  We start out with 16 neurons each of which look at  the specific brightness value of a single pixel.   They then send this value to a set  of 20 nodes which each compare the   output of these signals and give their own  output based on some rule. For example:   in this node it might be to find the average  brightness between these four adjacent pixels,   ignoring all of the other inputs. The  results for all 20 of these comparisons   are then sent through to the next layer of the  network which does another round of comparisons.   Subsequent layers do more and more analysis,  eventually leading to the final output layer,   which by this point ‘knows’ that the input image  was of the letter A and gives ‘A’ as the output   result. Putting in a different image into the  exact same neural net would give the letter “B”.   While this may look like magic, this complex  behaviour derives entirely from simple relations.   Unlike with normal programming, no one is manually  going through the net and manually changing how   the nodes do their weighting. Instead, a special  training program is doing that for itself.  Our neural net training regime is basically  pattern recognition turned up to 11. We start out   with a few hundred randomly assigned networks,  each with slightly different node weightings.   Obviously when we chuck in some sample images the  results we’re going to get are going to be pretty   much gibberish. However, a few of them are going  to be slightly less gibberish than the other ones.   These are the most fit of their generation  and, in a process similar to evolution,   are allowed to combine with each other - giving  us the next generation. We also include a little   bit of random mutation which means that the next  generation can be even better than their parents…   or alternatively completely garbage.  We take the ones that are really good   and combine those together again; giving us the  next generation. Which we combine the best ones of   again, giving us another generation, and another  and another and another. Eventually after maybe   hundreds or even thousands of generations,  and a huge amount of processing power,   eventually we get a network that works pretty-much  all the time with pretty-much all of the samples.  What happens during this training is that the  net is slowly identifying key characteristics   which correspond to each of the objects being  analysed. If you or I were describing a bus   we’d probably say something about it  being bright yellow, with lots of windows,   and big wing mirror. The difference being that  the net will occasionally identify characteristics   which, while true in the training data, don’t  necessarily correspond to the real world.  Here is one particularly dangerous attack  created by an American research team.   The team trained up their own neural net using  some publicly available street sign data.   After a couple of weeks of training they could  reliably identify stop signs, such as this one.  Next, they attached some white and black bars.  You and I can still tell that it's a stop sign,   but the team’s neural net could not. Instead,  it was certain that it was a speed-limit 45.   If the neural net was in control of  an autonomous vehicle then it would   have driven straight through the intersection. Instead of recognising the shape or letters S-T-O   and P that you and I might recognise in a stop  sign, instead I think the neural net has latched   onto something different. I think that because the  number four always has white patch in the middle   and the number 5 some sort of line at the bottom,  it’s seen these characteristics - knowing that   it only ever sees them in the context of a  speed limit 45 sign - and just run with that.  This is always going to be an issue when we  have limited processing power and storage. If   we have a complex way of identifying a sign and a  simple way, and both are equally successful with   our training set, then the evolutionary pressures  of this algorithm will prefer the simple option.  So what makes my T-shirt so dangerous? Here is a  demonstration come up with the folks at Google,   working on the Google Vision AI. Here they  are identifying the subject of an image,   and yep they’ve correctly identified that it’s  a banana. And now they’re adding in a picture   of a toaster. But no simple toaster picture is  going to make Google confuse what the subject   of the image is. No sir, they are much smarter  than that. Oh dear! Pretty much 100% certain   that it's a toaster. And this thing doesn't  even look like a toaster. What’s going on?  Here is what the classifier algorithm thinks is  the pure essence of toaster. When the neural net   sees it, it ignores everything else and just  looks for the toaster. This is what the Google   team dubbed as adversarial patches, and they’re  surprisingly simple to create. Here is the one   that I have on my t-shirt. It’s what the algorithm  thinks is a banana. As long as it can get a clear   shot of the patch then it should ignore me  entirely, and potentially run me over. All it   thinks its doing is running over a banana. Let's  head to the real world to test it out for real.  So what’s going to happen with the UWA bus. I have  the right t-shirt on, it has the wrong sensors,   in a few metres I’m going to become a  banana-flavoured pancake. Except… it stops.   That's because although we may have been able to  confuse the neural net, ultimately there was a   backup system: there was some infrared proximity  sensors which stopped when I came too close.  No autonomous system should rely on just  one level of redundancy. In addition to our   proximity sensors, we could also cross-reference  the government databases which store the location   and types of all the countries’ street signs. We  could then compare this with what we’re observing   on the streets around us. On top of that,  Google appears to be improving their vision AI.  Here is a Python program I’ve written using the  Google Vision AI. We take a photograph and then   the Google AI identifies the most likely  candidates for the subject of that image.   Here we can see that we’re correctly identifying  a banana. Here we have some sort of book.   And finally we have a collection of tools. Now,  we expect that this should have worked: remember   that Google AI was trained for just this sort  of use-case. But, what I want to know, is what   happens when we add in these adversarial patches  designed specifically to trick the Google AI.  Here we see that the banana is still a banana,  the book still a book, and this collection of   tools is still just a collection of  tools. Either Google were lying to us   when they said that they had been able to trick  their own AI… or more likely the version AI that   I’m using has been improved upon the 2017  edition which was discussed in the paper.  Like with a game of whack-a-mole,  what I think must be happening   is that whenever research comes out that says the  AI can be fooled in some way, the engineers are,   behind the scenes, quietly patching the issue.  While this may work fine in the short term,   I’d rather not trust my life to an AI which  may-or-may-not stop at the next stop sign before   Google manages to catch and then patch the issue. In order to get a truly safe autonomous vehicle,   not only does it need to reliably identify  objects, but it also needs to understand   how these fit within the broader road context.  As our vehicles become smarter and smarter,   where do we draw the line between the human  passenger and the vehicle becoming more like us?  This has been James Dingley from the  Atomic Frontier. Keep Looking Up.
Info
Channel: Atomic Frontier
Views: 1,604,695
Rating: undefined out of 5
Keywords: AI, self driving car, tesla, tesla accident, stop sign, adversarial bananas, adversatial patch, google vision, google vision api, ai, stop sign fooled, two minute papers, deep learning, jordan harrod, tom scott, smarter every day, atomic frontier, perth science, adversarial perturbations fool deepfake detectors, adversarial pertubations, artificial intelligence, superintelligent ai, machine learning
Id: p6CfR3Wpz7Y
Channel Id: undefined
Length: 13min 53sec (833 seconds)
Published: Thu Apr 08 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.