[CVPR'20 Workshop on Scalability in Autonomous Driving] Keynote - Andrej Karpathy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I think he's my son Rica thank you for running to workshop and it's my pleasure to be here I come from you I come to you from Hawaii or wherever this is Caribbean much great to talk about a thing about scalability not on this driving workshop and I'm driving especially in with respect to how we see that Tesla okay yeah so I'll tell you a little bit about Tesla autopilot the product that we're started building a Tesla and then I'll talk a bit about some of the networks that we have in production and some of the new networks that we're developing for fulsol driving and what they look like and how they're how they're different so first this is a scalability workshop I think no one has quite scaled autonomy as much as Tesla has so we have a massive fleet of course that is global so in the entire world we have roughly the fleet size of 1 million cars out there these are internet-connected devices and we sort of receive data from them to improve the autopilot and of course ship out about why to all these countries so it's a really massive scale product now these are some outdated numbers but I think the the intention with these slides is mostly just to communicate in the scale the product people have driven autopilot for billions of miles some of the products that we currently support in production are features such as navigate non-pilot well of course the core our power will keeping your lane they keep the distance away from the car in front of you navigating autopilot this is a functionality that allows you to basically point to an arbitrary put point in a world and then as long as you stick the highway system the car will do all the right lane changes and all the right Forks to get you there so this works quite nicely we also have smart summon from like last year where you can summon the card for you in the parking lot using your mobile app so the car comes out of its parking spot and comes fine you recently about one one and a half months ago or so we released stop snarl pilot so what this does is there's a crop here from the instrument cluster and showing sort of like a zoom into the user interface and here we see that the car is registering all the crowd clients although stop sighs stop lines and everything else that is needed to sort of figure out where the car should come to a stop in the world so we're sort of breaking up the entire problem of autonomy into small pieces and then we're trying to sort of a sort of salt than 1.1 and release these two customers along the way so in this case be released some of this functionality for recognizing traffic lights in this case we have two red unlocked and two green on the right we recognize them we understand not just that they're green and red we actually understand which ones are relevant to your current direction of travel and it's actually the much harder problem by the way not just recognizing this isn't green traffic light it's whether or not this green traffic light applies to you that makes this an extremely difficult challenge I'm covered even if the article is not active we have a lot of active safety features so for example automatic emergency braking from the collision warning claiming partial Wooden's and so on so these always were coming to hood monitor the environment in 360 degrees and if anything is bad is about to happen we've had to be poor slow and so on these features work really well so I believe Tesla Model 3 actually achieved the highest ratings for safety assist and we also did really well in all neural user detection and these numbers are somewhat abstract so what I like to show is I'll actually show exactly what we're talking about with respect to what this technology is doing on the roads right now in production and I was actually basically preventing really bad accidents were happening so here this person was not paying attention I think the driver was not paying attention either but we were able to like detective that's trained of course and slam on the brakes when it was appropriate here's another video this person on the left is not paying attention the car will slam on the brake here's one more they get somewhat pretty crazy so there's a person walking from the right lots of applicants we will slam on the brakes so we see lots of piece actually every day tens to hundreds not all of them are true positives but a good number of them are and you know this is where sort of detection really counts because this stuff is running on people's cars actually making a difference now our ambition of course is to produce fossil driving so not just stopping or taking Forks in the highway and so on but getting people from point A to point B Charlie so your PIN can is not just on highway system your PIN can be arbitrary our charlie place in the world and the car will start and do all the track terms it will take all the forks it will stop for traffic lights and so on so here the car took a left and on this intersection and we're coming down towards the highway emerge on the highway and containment zone now the thing I'd like to point out always because it's not clear if you're just a bystander looking from outside in here's a video I have a way mode doing a very similar maneuver it takes a left on this intersection and even though these two look the same under the hood of course they're completely different and that has to do with how we approach autonomy so of course we and many others in industry use high-definition maps so you have to drive first with some car that peanuts the environment you have a perfect line or math if you localize to with a centimeter level accuracy and you are on Rails you know exactly how you're going to turn through the intersection know exactly whose traffic lights are relevant to you you know where they are positioned and everything we do not make these assumptions so for us every single intersection we come up to we see it for the first time we have to figure out what did what does the intersection look like which lines connect which traffic lights are relevant to you everything has to be solved on spots similar to what a human would do in his kind of situation and speaking of scalability of course this is a much harder problem to solve but most naturally do solve this problem there's a potential of actually being this down to again millions of cars on the road and whereas actually building out these lidar maps on the scale that we operate in with the sensing that it does require would be extremely expensive and of course is you can't just build it you have to also maintain it and the change detection of this is extremely difficult so Alex showed some example from this morning in the UK where you know things are blocked off because of the role the Cova social distancing is on your lane graph has changed and so you have to have just an end correspond to it you can't just afford to rescan all of these departments all the time so that to us is not a scalable approach and it's not the one that we employ in our in our approach so just give you a sense of what these networks look like other ones that we do have in production of course they need to solve a lot of visual recognition tasks so they have to detect all these traffic lights on linglong markings and static objects and all things like that there are actually two customers for all of these predictions number one is the planning and control module that tries to wind its way around all these compartments number two is the instrument cluster so we like to show people as much as possible on the asteroid cluster to give them sort of some confidence that the autopilot is doing the correct thing so these detection czar very useful for that as well this is a video that we sort of produce basically or page with put together where the car is sort of we're showing some some the wrong detection is that the outlaw has to make so you saw a stop sign stop lines traffic lights cars keyboards little eye markings curbs static objects there's a trash bin over there what's not showing here is a lot of other do know things so every will every one of these lines actually we know whether or not it's just parking lot as an example so that's an example of an attribute of a line we know that these aren't crosswalk lines and so on so those attributes are not actually even shown here there's there's tons of predictions under the hood here they all have to work correctly now these predictions are actually really hard to achieve so even for a very simple case like a stop sign this you would think that a stop sign is a very simple thing I mean we've solved much harder visual recognition problems you know image that recognizes thousands of categories so why why would stop sign be any different difficult and actually when you try to actually scale this up again and deploy it wine you will come across lots of variations of even the simplest stop sign so what stop signs can be on walls if I'm just like to be a pulse they can be temporary and on various kinds of signs they can have these flashing lights associated with them they can be held by person or on top of the pole they can be included heavily by foliage they can be occluded heavily by my size by course actually these tops are the part of cars and of course they can be an enabled or disabled state so you have to pay attention to that if you want to actually respect it in your driving policy held by people may come with all kinds of modifiers so these stop signs don't just apply all the time they might apply only if you are going right right or if you're going left and so on so you have to actually recognize these modifiers because again we don't we actually drive on vision so we need to detect this and do the correct thing at the time that we're there that is not the same body wear that we don't use maps we do build maps and we use all kinds of fusion the maps but our maps are certainly not sent in unilevel accurate so we do not know the exact metric distance do the leaf on the tree over there but we do you the stop sign applies only when the gate is closed this one only when the arm is down instead raised this stop sign only applies for in the action conditional case of turning to the left so here we have to actually ignore the stop sign if we are going to the right now in order to get all these protections to work we employ what I call the big engine and so it's like the rough idea is you have 2x kale the name of the game sir that we play all the time in the team and this is the bread and butter what we did is need to xkl massage and curate this data set so it has all of these cases in it otherwise their network has no chance so I believe we currently have probably the largest data set of stop signs occluded by foliage we have the largest data set of accept right turn first out sign and these are all real data of course we don't actually utilize simulation all that much our bread and butter really is we have a million cars out there they can send us triggers in arbitrary conditions that we specify and how do you go how do you very quickly build up a data set of stop sign occluded by foliage and that's the kind of problems that we're solving there's not a lot of sort of like research on that but for example starting with just very few stop signs occluded by foliage we've develop techniques to boost the amount of that kind of data in our data set so starting with maybe ten of these we could we can develop the classifier for that the texture the issue and then we can ask the fleet to when they're driving around look for that possible thing occurring and when a classifier thinks that it might be occurring it will send us images so we do that all the time of course talking about scale both II this is not just in the u.s. that we care about these stop signs so these are Chinese science Korean science Japanese signs they come with their own modifiers they have their own challenges they have their own rules of law and how they should be handled so this becomes a very heavy sort of project to actually get something like a stop sign feature to work globally and you're sort of like because this is a product in the hands of customers you are forced to go through the longtail you can't just do 95% and call it a day we have to make this work and so that long tail actually brings all kinds of interesting challenges there's anything we were working quite a bit with speed limits as well so speed limits don't just look like this they come with lots of modifiers in all kinds of different countries so you know sometimes you have speed limits that give you a minimum and then in different countries this minimum is not it's not specified that it's a minimum it's specified by a different color like say that it's in the blue background or it has a line under it and stuff like that so those all have to be right out by the system and we have to in a very sort of malleable way be able to specify the ontology for all these different signs and in a very lightweight way and all these possible contributes all these possible detection and so your dataset is alive and the labeling instructions are changing basically all the time and in face of that you have to curate this massive dataset where you're encountering issues all the time so it becomes kind of like a crazy challenge so oh yeah here's a bunch of other sites speedily sized in China in Korea and Japan in Europe you know this for examples people it's only are to be followed only in certain conditions if it's wet on the road in sharp turns and stuff like that so it kind of gets added crazy now we curate not just the trains as we curate the test sets and spend just as much time on those if not more because you want to make sure that your evaluation is really good you can do arbitrary things on the train set but you must have a really good valuation because that gates your release into the world so basically what I'm trying to get across is that it's a it's a very difficult sort of domain because of the complexity of it so this is slightly outdated numbers now but I have a slide basically we we maintain roughly 40 knot networks in production Dutch l'amour now they make roughly a thousand distant predictions none of these predictions can every progressed all of them must improve over time it takes a long time to train these networks if you were to train them from scratch of course we can get away with a lot of fine-tuning and things like that and we do but if you were to train this stack from scratch it would train for a long time so in onenote this would our entire year train so how do you actually get this to work with also like an on a very large team so our team is not 500 people working on neural networks it's more like a few dozen of really really strong people on the team and the way this works is we search for everything around the core infrastructure that we're building out we're building out kind of like a general computer vision infrastructure in which it is very easy to curate datasets create new types of tasks that fall into certain buckets so maybe you want to create a new landmark task or the new segmentation task or any detection task you want to change the attributes around if you want to add an attribute anyone could that get that to work so everything we do it's almost like we're on the meta layer of like there's no you know you're kind of working on a general recognition system and then we have a huge team of not neural networks people necessarily let's say the labeling team or the PM team and so on that actually use that infrastructure and do all the heavy lifting so I showed you that there's like a billion touch with speed limits you don't want to have no all our ups engineers involved in that you want them to create this general infrastructure that allows some someone else to actually collect all these this make it work and to a large extent we're finding that this is actually tractable and that we can actually create these generic computer vision systems that people can use to develop all these features and then deploy them on the car so there's sort of this division going on which i think is really interesting so basically automation is extremely important the latency with which we deploy new features is extremely important and we think of ourselves mostly as developing this corporate structure not actually like pursuing individual tasks like the stop sign is actually tasks that's not something that that they need a lot work engineer would worry about the normal work engineers work sort of at the segmentation prototype or the detection prototype and that that can work and the active learning infrastructure for bubbling up these difficult data sets from the fleet so so far I've only talked about the initial predictions it's already quite crazy and involved and heavy it gets even crazier once you actually go into the full self-driving stack you can't just afford to do page level predictions and expect to actually drive on that I think I started made this point actually I make a big deal out of the whole software 2.0 sort of framework of looking at future development because people don't look at you know or training as programming but to me it's literally that you have teacher demands you have to make it changes you have to make errors you have PRS so you can borrow a lot of concepts so basically was happening in that team is when I joined we had small neural networks doing some deductions and then these were stitched up in firmware instead of the software 1.0 code the C++ code and so on and basically the neural our stack has been taking on more and more of the functionality everything is becoming more more end-to-end so for example you don't treat we don't treat line line detection as a segmentation task you could so you could actually we have very few segmentation tasks by the way because like having segmentations on axles doesn't isn't as easily minimal to driving so it looks good on the image but if you you need to project it out to make 3d sense of it otherwise you can't drive through it and that part is the hard part so you it's basically detecting these Lane line markings individually and stitching them up is highly aired from so you can just perfect it out at the network directly works much better part cars are parked not based on heuristics but when the neurologic says so contents happen not based on any heuristics they happen if in your long process so it's a lot of data and so on so for stitching up these environments for so we have to serve basically like layout on the parking lot so that we can get smart summon to remind its way around to come to the person who's sending the car and you can imagine breaking this down into number one curb detection task and these videos and then stitching up these curves in sort of the software walk on a lend which would look something like this we fill up this occupancy tracker that stitches up the image level predictions into a little map of the parking lot and then we can see how this cars driving around to wind its way to the person so this works to some extent but you have to do the stitching and the stitching is highly air from across the camera seams and across time so what we've been kind of working on is going much more toward these attacking predictions we're actually like relatively standard and bull understood but for us it's kind of a step because of the history of the autopilot and how if you stick to the highway you can actually get really far even just paying attention to a single camera imaged it's forward facing we sort of have to come Lemar stack from raw images and of course we don't have the light in our stack so we have to stitch up the images into these presiding predictions but we don't have the occupancy tracker living in the c++ land we now have the occupancy tracker living in a network so individual views go through the cameras we track features we have a fusion layer that does things like orthographic feature transforms and so on to put everything into if there we represent image to space and then you have to temporally smooth it and that's moving again as in a neural mat so we don't want smoothing in the C++ code base you want smoothing sort into that and then you have a decoder that gives you all the predictions this is just a slide showing I'm not gonna go into too much detail and think in interest of time but basically this works significantly better okay let me go into briefly so on the Left we see the ground truth of this intersection in terms of the road edges that make it up on the right is what happens when you do a relatively good job actually of detecting the curbs in digital images and then croute kicked out the pixels and it looks terrible and then in the middle is what arbors IP networks would produce as a prediction just kind of just comes straight out of than that and of course these protect we never networks and often I make really dumb errors like the one on the right because I sort of have a sense of what these intersections can possibly look like and give you something sensible so here I have a video that's showing sort of these predictions to impor over time what that can look like so we see that we have a pretty smooth intersection on the left as basically the road edges that make up the center section a green are dividers a bunch of attributes that I'm not going to go into you can just imagine we there's a lot of attribute detection is that I'm not showing here here I'm showing a few more attributes showing coloring certain different parts of an intersection as we're gonna turn through the lot I sort of like I mentioned here's another one like I mention is when we come to these intersections for the first time always so we don't actually know what these things look like is there a crosswalk how many links are there where are the dividers what is the connectivity structure at these links or are the traffic lights stop signs and so on and how do they relate so with traffic lights controller joints everything has to come out of the net and it's a highly structured representation we're sort of asking for in these cases and it's quite hard to achieve especially in the face of uncertainty so if you're coming up to an intersection you're really just staring at like a tiny sliver of this image in the middle at the horizon line and you're trying to guess is that two or three lanes you're not sure what is the output of the net when you're not sure are you outputting multiple samples if you're uh putting samples that are crisp then you need to track them but if you're outputting not samples but just like you should try to produce raster it's like I'm showing here then you can have all these like most button issues because the number because uncertainty and things become diffuse so these are really like delicate interesting challenges in terms of the raw neural network modeling and their and their heart to get right but if you do get them right again in terms of scalability that has large implications on your velocity when we talk about like a global deployment of this technology on the world scale because we don't have to prove everything in the world which sounds like a lot of work and keep it up to date but the challenge of course is we are coming up to these arbitrary pieces to these arbitrary geometries and we have to solve for what that looks like and so it's a very structured representation with a lot of uncertainty and so I think this is kind of the most interesting challenges for us right now and a team is how do you actually in terms of the modeling approaches predict these complex intersections so for those of you who are maybe in academia and industry and so on I would encourage you in terms of scalability for pushing autonomy forward do not assume that we can get away as an industry with a feliner maps for global deployment of these of these features I would take lidar mats and especially the flow of all the lanes in traffic and so on and think about how can you how can you predict an intersection without assuming a lidar map where are the approaches here like is there some kind of a because these are you know there's a set of lanes and the new set of lines can be controlled by a set of traffic lights and all the pointer networks sort of that are necessary to actually make this make this work well and what they look like is highly not obvious I would say so we explore for some of us in the team but I think predicting these highly structured representations and dealing without certainty of them is a very interesting deep technical challenge that I think academia can definitely contribute with uh and this is just a kind of an old video now showing that we apply this not just for static infrastructure but we're trying to apply for exactly networks and representations to you know cars and objects and of course how they move around what is the assignment of cars to lanes all these different things we of course have to know about so that we can anticipate how people are gonna move through these environments and yeah I just have a slide here showing that this can get out of hand very quickly the one nice benefit that you do have if you're trying to actually release this is the world is you aren't allowed to sort of know that you don't know in terms of like an actual product right you're not forced to handle every intersection as long as you come to something and and know that you can't handle it you're allowed to sort of route to here right around it for example things like that because of the complexity of what these things look like in the world actually suspect we're going to have to go that kind of around so you're sort of allowed to not handle something but you prefer to handle most things but when you do handle them you actually have to really do a really good job of it okay so basically that's roughly some of the some kind of technical challenges that we're facing just to summarize basically for us when I think about scalability and what's what's tricky and hard to get to work in terms of the kind of strategic landscape and especially where Tesla finds itself is number one we are dealing with a massive scalability challenge around the finding of the needle in a haystack and what this is referring to for us haystack is the fleet and you know you have cameras running its 8:36 Hertz and there's eight of them and they're driving around all these interesting scenarios all the time potentials will catch the interesting scenario so you can have them to your training set so the haystack is kind of like all the cars navigating the worlds you know streets and the needles are the tricky ones that actually make your network uncertain and we need to find them and we need to make sure we upload them to catch them so this active learning to us as parent what we do and you have to do this repeatedly integrally in simultaneously across a lot of labeling projects and you in order to be successful because there's such a wide sort of a breadth of challenges you actually don't want engineers in loop so the engineers are started designing the infrastructure for arbitrary tasks and then the VPNs and the labeling teams and so on are actually curating your the individual tasks so that's kind of like how we try to approach it but it's very tricky to develop this kind of infrastructure and these are sort of examples I have some examples of needles so yeah top left is a chair this is not a render this is not a simulation this is not inserted by some again this is a real thing it's a needle and we need to make sure that we catch it and catching these needles is actually non-trivial so you know I we tried a lot of approaches with respect to you know using the entropy of different neural our work on samples trained with bootstrap sampling and things like that and nothing like works really well I think so detecting basically that a network doesn't know doesn't know and doing it efficiently at this time it's kind of still an open problem in my mind on the right we have a person walking a dog here we have like mirrors and here on the right I'm showing toppled cones but actually they were recognized the toppled code in the bar was recognized as a traffic light it's a red traffic light but of course is just a topic um so these are needles and how you actually find needles in the haystack is kind of like the most interest in washing to us so if I was to pose this potentially to academia and how this can be worked on potentially is imagine giving yourself a data set of 50 thousand examples but actually only trained on ten thousand I basically have to somehow incorporate the fact that I can label arbitrary image for some amount of cost and I don't want to pay too much cost but in each label images because I know that that's kind of like the only certain way to actually get in your locker computer in tastic to work it's through labeling of images there's a lot of less certain things that maybe don't work as well work to some extent but at least like the one sure certain way I've seen of making progress on any task is you curate the data set that is clean and varied and you grow it and you pay the legal cost I know that works there's a lot of exotic approaches as well through self subversion on surprise learning etc but I think it's more headedness so these are interesting questions for us and number two in terms of scalability like I mentioned these intersections get out of hand and we need to actually in a structured way I think I think about how we can predict these complicated structured outputs so that we we don't have to sort of represent them explicitly and maintain them over time but we can actually get our neural knots to up with them directly and I think this is a very interesting challenge from a neural networks perspective so if some of these problems sound interesting to you then we're absolutely hiring for the team and trying to grow the team and so these are some of the things you'd like to work on again I think Tesla basically offers a very interesting environment that is I think I'm precedented in the industry and we push these things to production and it's very kind of rewarding to see the fruit of your labor actually make it into the world and a lot of your friends and family and so on are driving it in providing feedback and so and we also think this is kind of the correct way in terms of the incremental T to actually build out to autonomy just because the active frontier of the battle sort of so wide so large so you can't just like in binary fashion develop it and then ship it this will take this takes time really to develop and come up against all the issues and we represent your tasks and collect all these data sets
Info
Channel: CVPR'20 WSAD
Views: 83,175
Rating: undefined out of 5
Keywords: cvpr, cvpr2020, cvpr20, computer vision, machine learning, artificial intelligence, autonomous driving, self-driving
Id: g2R2T631x7k
Channel Id: undefined
Length: 28min 36sec (1716 seconds)
Published: Thu Jun 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.