Dear Fellow Scholars, this is Two Minute Papers
with Dr. Károly Zsolnai-Fehér. Today we are going to see how Tesla uses no
less than a simulated game world to train their self-driving cars. And more. In their AI day presentation video, they really
put up a clinic of recent AI research results and how they apply them to develop self-driving
cars. And of course, there is plenty of coverage
of the event, but, as always, we are going to look at it from different angle. We’re doing it Papers style. Why? Because after nearly every Two Minute Papers
episode where we showcase an amazing paper, I get a question saying something like “okay,
but when do I get to see or use this in the real world?”. And rightfully so, that is a good question. And in this presentation, you will see that
these papers that you see here get transferred into real-world products so fast, it really
makes my head spin. Let’s see this effect demonstrated by looking
through their system. Now, first, their cars have many cameras,
no depth information, just the pixels from these cameras, and one of their goals is to
create this vector space view that you see here. That is almost like a map, or a video game
version of the real roads and objects around us. That is a very difficult problem. Why is that? Because the car has many cameras. Is that a problem? Yes… kind of. I’ll explain in a moment. You see, there is a bottom layer that processes
the raw sensor data from the cameras mounted on the vehicle. So here, in go the raw pixels, and out comes
more useful, high-level information that can be used to determine whether this clump is
pixels is a car or a traffic light. Then, in the upper layers, this data can be
used for more specific tasks, for instance, trying to estimate where the lanes and curbs
are. So, what papers are used to accomplish this? Looking through the architecture diagrams,
we see, transformer neural networks, BiFPNs, and Regnet. All papers from the last few years. For instance, RegNet is a neural network variant
that is great at extracting spatio-temporal information from the raw sensor data. And that is a paper from 2020. From just one year ago. Already actively used in training self-driving
cars. That is unreal. Now, we mentioned that having many cameras
is a bit of a problem. Why is that? Isn’t that supposed to be a good thing? Well, look! Each of the cameras only sees parts of the
truck. So how do we know where exactly it is, and
how long it is? We need to know all of this information to
be able to accurately put the truck into the vector space view. What we need for this is a technique that
can fuse information from many cameras together intelligently. Note that this is devilishly difficult due
to each of the cameras having a different calibration, location, view directions, and
other properties. So who is to tell that a point here corresponds
to which point in a different camera view? And this is accomplished through, yes…a
transformer neural network. A paper from 2017. So, does this multi-camera technique work? Does this improve anything? Well, let’s see! Oh yes, the yellow predictions here are from
the previous single-camera network, and as you see, unfortunately, things flicker in
and out of existence. Why is that? It is because a passing car is leaving the
view of one of the cameras, and as it enters the view of the next one, they don’t have
this correspondence technique that would say where it is exactly. And, look! The blue objects show the prediction of the
multi-camera network that can do that, and things aren’t perfect, but they are significantly
better the single-camera network. That is great, however, we are still not taking
into consideration time. Why is that important? Let’s have a look at two examples. One, if we are only looking at still images
and not take into consideration how they change over time, how do we know if this car is stationary? Is it about to park somewhere? Or, is it speeding? Also, two, this car is now occluded but we
saw it second ago, so we should know what it is up to. That sounds great. And what else can we do if our self-driving
system has a concept of time? Much like humans do, we can make predictions. These predictions can take place both in terms
of mapping what is likely to come, an intersection, a roundabout, and so on. But, perhaps even more importantly, we can
also make predictions about vehicle behavior. Let’s see how that works. The green lines show how far away the next
vehicle is, and how fast it is going. This green line tells us the real, true information
about it. Do you see the green? No? That’s right, it is barely visible, because
it is occluded by a blue line, which is the prediction of the new video network. That means that its predictions are barely
off from the real velocities and distances, which is absolutely amazing. And, as you see with orange, the old network
that was based on single images is off by quite a bit. So now, a single car can make a rough map
of its environment wherever it drives, and they can also stitch the readings of multiple
cars together into an even more accurate map. Putting this all together, these cars have
a proper understanding of their environment and this makes navigation much easier. Look at those crisp, temporally stable labelings. It has very little flickering. Still, not perfect by any means, but this
is remarkable progress in so little time. And we are at the point where predicting the
behaviors of other vehicles and pedestrians can also lead to better decision making. But, we are still not done yet. Not even close. Look! The sad truth of driving is that unexpected
things happen. For instance, this truck makes it very difficult
for us to see, and the self-driving system does not have a lot of training data to deal
with that. So, what is a possible solution to that? There are two solutions. One is fetching more training data. One car can submit an unexpected event and
request that the entire Tesla fleet sends over if they have encountered something similar. Since there are so many of these cars on the
streets, tens of thousands of similar examples can be fetched from them, and added to the
training data to improve the entire fleet. That is mind blowing. One car encounters a difficult situation,
and then, every car can learn from it. How cool is that? That sounds great. So what is the second solution? Not fetching more training data, but creating
more training data. What, just make stuff up? Yes, that’s exactly right. And if you think that is ridiculous, and are
asking how could that possibly work? Well, hold on to your papers, because it does
work… you are looking at it right now! Yes, this is a photorealistic simulation that
teaches self-driving cars to handle difficult corner cases better. In the real world, we can learn from things
that already happened, but in a simulation, we can make anything happen. This concept really works, and is one of my
favorite examples is OpenAI’s robot hand that we have showcased earlier in this series. This also learns the rotation techniques in
a simulation, and it does it so well, that the software can be uploaded to a real robot
hand, and it will work in real situations too. And now, the same concept for self-driving
cars. Loving it. With these simulations, we can even teach
these cars about cases that would otherwise be impossible or unsafe to test. For instance, in this system, the car can
safely learn what it should do if it sees people and dogs running on the highway. A capable artist can also create miles and
miles of these virtual locations within a day of work. This simulation technique is truly a treasure
trove of data, because it can also be procedurally generated, and the moment the self-driving
system makes an incorrect decision, a Tesla employee can immediately create an endless
set of similar situations to teach it. Now, I don’t know if you remember, we talked
about a fantastic paper a couple months ago that looked at real-world videos, then, took
video footage from a game, and improved it to look like the real world. Convert video games to reality if you will. This had an interesting limitation. For instance since the AI was trained on the
beautiful lush hills of Germany and Austria, it hasn’t really seen the dry hills of LA. So, what does it do with them? Look, it redrew the hills the only way it
saw hills exist, which is, covered with trees. So, what does this have to do with Tesla’s
self-driving cars? Well, if you have been holding on to your
papers so far, now, squeeze that paper, because they went the other way around! Yes, that’s right! They take the video footage of a real, unexpected
event where the self-driving system failed, use their automatic labeler used for the vector
space view, and what do they make out of it? A video game version! Holy mother of papers. And, in this video game, it is suddenly much
easier to teach the algorithm safely. You can also make it easier, harder, replace
a car with a dog, or a pack of dogs, and make many similar examples so that the AI can learn
from these “what if” situations as much as possible. So, there you go. Full tech transfer into a real AI system in
just a year or two. So, yes, the papers you see here are for real. As real as it gets. And yes, the robot is not real, just a silly
joke. For now. And two more things that make all this even
more mind-blowing. One, remember, they don’t showcase the latest
and greatest that they have. Just imagine that everything that you heard
today is old news compared to the tech they have now. And two, we have only looked at just one side
of what is going on, for instance, we haven’t even talked about their amazing Dojo chip. And if all this comes to fruition, we will
be able to travel cheaper, more relaxed, and also, perhaps most importantly, safer. I can’t wait. I really cannot wait. What a time to be alive! Thanks for watching and for your generous
support, and I'll see you next time!
This channel is awesome
I know the channel 'Two minute papers' well. If this guy is impressed by the innovation and application of new techniques then that says a lot because he is constantly following and reviewing the most interesting papers in the field.
Károly Zsolnai-Fehér's 2 Minute Papers is one of the 3 Youtube Channels I receive notifications about (the other 2 are Chiken Genius Singapore and TierZoo)
But, but, but…
Navigant is qualified to determine this, not some silly Two Minute Papers guy. /s