I’m at the University of Western Australia. That
is one of the world’s most advanced self-driving vehicles, and this pattern should confuse it
enough into thinking that I’m a banana and running me over. Please, don’t try this at home. Before I recuperate my university
fees by committing insurance fraud, we first need to try to understand what’s
going on inside our autonomous bus’ brain. We’ve been trying to get autonomous vehicles
working for the better part of 80 years. In the early days we had some really exciting successes
by embedding metal magnetised road spikes into the roads which our vehicle could navigate
along. While this was surprisingly successful especially given the rudimentary level of our
computing technology at the time, it's not a particularly scalable technology. Australia alone
has about 800 thousand kilometres of roadways, only 40% of which are actually paved over. While
it might work to get you from A to B, if you want to get to any other location it's really not going
to work out. For that we need something that is infinitely more scalable. Maybe GPS could work.
While it’s great at letting you know wherever you are on the planet, it's not particularly precise:
you know your car is on the road but you don’t know which bit of the road it's on. It might
crash into a bus going through the other lane. We already use sight-based navigation to get
around the roads today. We use it to stay within a lane, to stop ourselves from crashing
into the bus ahead, and importantly for knowing what hazards are up by looking at street signs.
On top of that, cameras are getting pretty cheap. Because of this most of the serious
attempts at getting self-driving vehicles commercially available use visible
light and high definition cameras. However, this means that we really
need to understand image recognition. Given how easy it is for us to spot and identify
objects we often forget that this is actually quite an impressive feat. Even within the animal
kingdom we have especially well-developed vision. There is a compelling theory that our unusually
complex brains came about in order to facilitate vision processing. Individuals which could tell
the subtle differences between the shape and hue of safe and unsafe berries, or could spot
a lion hiding in wait in the long grasses, had a competitive advantage and were able to
pass on these successful vision genes onto future generations. We think, according to this theory,
that science, literature, art and engineering, while all being very useful, are actually just
unexpected byproducts of these complex brains developed for vision processing. When we try to
replicate how we humans identify objects within a computer, especially a computer with limited
processing power, we run into a few problems. To see why that's the case, let's take
the basic example of street signs. Fortunately for us, they are nice, regular,
standardised shapes with high contrast and bold patterns. They’re also pretty universal so once
you’ve programmed the street signs for one country it should mostly carry over to the next.
This standardisation is as a result of the 1924 National Conference on Street and Highway
Safety which laid out the ground rules for traffic signs which still influence the streetscapes
around us today. Apparently, the committee was really afraid of pointy objects because when
they elected to use a shape-based system of identifying hazards they reserved shapes with
more corners to denote more dangerous things. Pedestrian warnings – not that dangerous, let’s
make it a triangle with three points. Speed limit, a bit more dangerous lets go four. Stop sign,
oh boy, you get eight sides! The committee thought that train level crossings were the most
dangerous of all, so those signs didn’t get ten, twelve, or every 50 corners – they got infinity
corners, or what we commonly refer to as a circle. I’m in Australia so some of the most
important street signs for us to recognise are those that say ‘Warning! Kangaroos,
cows, bobtails and snakes crossing!’. Which of those animals do you think is most
dangerous to hit going at 110 kilometres per hour? Regardless of which one you really don’t want to
hit, ultimately we need to be sure that our car is able to recognise the street sign when
it's quite far away and respond accordingly. How are we going to get a
computer to recognise our sign? Here is one approach which may be an option.
Ok. It’s a yellow square with a nice thick black boundary, rotated 45 degrees and
a silhouette kangaroo in the middle. Let's start out with that boundary; four lines
meeting at 90 degree angles. But what’s a line? Well, it's straight? Fantastic! But that
doesn’t really correspond to computer code so we’re going to need to come
up with some sort of algorithm. Let's look at the pixels and just consider the
ones that are dark. Pick one and now consider the pixels which surround it. Are any of those
also dark? Nope, ok let's find a different one. This time are any of those dark? Yes, cool,
ok now have a look at its adjacent tiles, are any of those also dark? Great. Now look around
its adjacent tiles. And that one’s adjacent tiles. And another. And another. Now plot the coordinates
together. Is the gradient roughly consistent? Perfect - that means we’ve found ourselves a
line. Repeat this a bunch more times and if the plotted lines match up in a roughly diamond
shape then we’ve found ourselves a street sign. Now that we’ve identified our square, we need to
check if it's a warning sign or just a random box on the side of the road. We do this by taking the
average RGB value from inside our rectangle and corresponding it to either yellow or not yellow:
telling us if it's a warning sign or just a random box on the side of the road. Now we need to work
out what the caution sign is warning us against. If it’s kangaroos then our response should be to
slow down and be a bit more cautious. If, on the other hand, its warning us about drop bears then
the only appropriate response is to wind up the window and get out of there as fast as possible.
For this we need an even more advanced line-plotter algorithm, and maybe
include a few matrix transformation routines to take into account if you’re
looking at our kangaroo from an angle. But not all kangaroo drawings look the same
- so we need a database of potential designs. What if there are branches in the way? Well, that
would mess up our shape so we need to accept a range of kangaroo-like objects and maybe include
a function that removes branches from the image. Now do that for every single street sign,
zebra crossing, set of traffic lights, and road marking that you can think of. And
a few which you might have forgotten. You can start to appreciate how this quickly
becomes a very complex coding nightmare. Rather than getting an underpaid, overworked,
sleep-deprived, caffeine-addicted, undergraduate student to manually program every
single object that your car might come across, another option is to get the computer to do that
for you. This is what's broadly defined as machine learning. One approach is what is called a neural
net; similar to the neurons within a human brain. Rather than taking a set of instructions
and just going through it one-by-one, instead we have a complex web
of relations between inputs, analysis, and outputs. Here is how you
might apply it to a 16-pixel image. We start out with 16 neurons each of which look at
the specific brightness value of a single pixel. They then send this value to a set
of 20 nodes which each compare the output of these signals and give their own
output based on some rule. For example: in this node it might be to find the average
brightness between these four adjacent pixels, ignoring all of the other inputs. The
results for all 20 of these comparisons are then sent through to the next layer of the
network which does another round of comparisons. Subsequent layers do more and more analysis,
eventually leading to the final output layer, which by this point ‘knows’ that the input image
was of the letter A and gives ‘A’ as the output result. Putting in a different image into the
exact same neural net would give the letter “B”. While this may look like magic, this complex
behaviour derives entirely from simple relations. Unlike with normal programming, no one is manually
going through the net and manually changing how the nodes do their weighting. Instead, a special
training program is doing that for itself. Our neural net training regime is basically
pattern recognition turned up to 11. We start out with a few hundred randomly assigned networks,
each with slightly different node weightings. Obviously when we chuck in some sample images the
results we’re going to get are going to be pretty much gibberish. However, a few of them are going
to be slightly less gibberish than the other ones. These are the most fit of their generation
and, in a process similar to evolution, are allowed to combine with each other - giving
us the next generation. We also include a little bit of random mutation which means that the next
generation can be even better than their parents… or alternatively completely garbage.
We take the ones that are really good and combine those together again; giving us the
next generation. Which we combine the best ones of again, giving us another generation, and another
and another and another. Eventually after maybe hundreds or even thousands of generations,
and a huge amount of processing power, eventually we get a network that works pretty-much
all the time with pretty-much all of the samples. What happens during this training is that the
net is slowly identifying key characteristics which correspond to each of the objects being
analysed. If you or I were describing a bus we’d probably say something about it
being bright yellow, with lots of windows, and big wing mirror. The difference being that
the net will occasionally identify characteristics which, while true in the training data, don’t
necessarily correspond to the real world. Here is one particularly dangerous attack
created by an American research team. The team trained up their own neural net using
some publicly available street sign data. After a couple of weeks of training they could
reliably identify stop signs, such as this one. Next, they attached some white and black bars.
You and I can still tell that it's a stop sign, but the team’s neural net could not. Instead,
it was certain that it was a speed-limit 45. If the neural net was in control of
an autonomous vehicle then it would have driven straight through the intersection.
Instead of recognising the shape or letters S-T-O and P that you and I might recognise in a stop
sign, instead I think the neural net has latched onto something different. I think that because the
number four always has white patch in the middle and the number 5 some sort of line at the bottom,
it’s seen these characteristics - knowing that it only ever sees them in the context of a
speed limit 45 sign - and just run with that. This is always going to be an issue when we
have limited processing power and storage. If we have a complex way of identifying a sign and a
simple way, and both are equally successful with our training set, then the evolutionary pressures
of this algorithm will prefer the simple option. So what makes my T-shirt so dangerous? Here is a
demonstration come up with the folks at Google, working on the Google Vision AI. Here they
are identifying the subject of an image, and yep they’ve correctly identified that it’s
a banana. And now they’re adding in a picture of a toaster. But no simple toaster picture is
going to make Google confuse what the subject of the image is. No sir, they are much smarter
than that. Oh dear! Pretty much 100% certain that it's a toaster. And this thing doesn't
even look like a toaster. What’s going on? Here is what the classifier algorithm thinks is
the pure essence of toaster. When the neural net sees it, it ignores everything else and just
looks for the toaster. This is what the Google team dubbed as adversarial patches, and they’re
surprisingly simple to create. Here is the one that I have on my t-shirt. It’s what the algorithm
thinks is a banana. As long as it can get a clear shot of the patch then it should ignore me
entirely, and potentially run me over. All it thinks its doing is running over a banana. Let's
head to the real world to test it out for real. So what’s going to happen with the UWA bus. I have
the right t-shirt on, it has the wrong sensors, in a few metres I’m going to become a
banana-flavoured pancake. Except… it stops. That's because although we may have been able to
confuse the neural net, ultimately there was a backup system: there was some infrared proximity
sensors which stopped when I came too close. No autonomous system should rely on just
one level of redundancy. In addition to our proximity sensors, we could also cross-reference
the government databases which store the location and types of all the countries’ street signs. We
could then compare this with what we’re observing on the streets around us. On top of that,
Google appears to be improving their vision AI. Here is a Python program I’ve written using the
Google Vision AI. We take a photograph and then the Google AI identifies the most likely
candidates for the subject of that image. Here we can see that we’re correctly identifying
a banana. Here we have some sort of book. And finally we have a collection of tools. Now,
we expect that this should have worked: remember that Google AI was trained for just this sort
of use-case. But, what I want to know, is what happens when we add in these adversarial patches
designed specifically to trick the Google AI. Here we see that the banana is still a banana,
the book still a book, and this collection of tools is still just a collection of
tools. Either Google were lying to us when they said that they had been able to trick
their own AI… or more likely the version AI that I’m using has been improved upon the 2017
edition which was discussed in the paper. Like with a game of whack-a-mole,
what I think must be happening is that whenever research comes out that says the
AI can be fooled in some way, the engineers are, behind the scenes, quietly patching the issue.
While this may work fine in the short term, I’d rather not trust my life to an AI which
may-or-may-not stop at the next stop sign before Google manages to catch and then patch the issue.
In order to get a truly safe autonomous vehicle, not only does it need to reliably identify
objects, but it also needs to understand how these fit within the broader road context.
As our vehicles become smarter and smarter, where do we draw the line between the human
passenger and the vehicle becoming more like us? This has been James Dingley from the
Atomic Frontier. Keep Looking Up.