Self-Driving Cars: Behavior Estimation (Benedikt Mersch)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi and welcome to the lecture on behavior estimation for self-driving cars this lecture is part of the course techniques for self-driving cars here at the photogrammetry and robotics lab at the university of born before we start i would like to motivate the topic and show you an example of how behavior estimation works in practice so this is this example is from the recent tesla ai day and you will see on the right the autonomous vehicle in red with its planned path and you will see the perceived environment so the car is equipped with sensors in this case cameras is the front camera on the left side and this is what the car thinks about its environment so you will see other detected cars and you can already guess that this is a very packed street with a lot of park cars and it's not easy to navigate through this street especially if there are other cars involved in this example you will see that there is an oncoming car which is detected here in yellow and because of that the red car first plans to yield to this car because of the higher velocity of the yellow car but then it realizes that the yellow car is basically also yielding and slowing down and then our ego path changes and we plan to overtake or go through these park cars if we continue the situation another car is coming up in yellow which is even faster and now we run our behavior estimation again and try to estimate what the yellow car is going to do and here are two possibilities one is that the yellow car is going to yield and our prediction method assigns at a low probability and the other possibility is that the incoming car is going straight and passing the park cars and we assign a higher probability so because of that the own ego plan will be to yield or to pull over in order to not block the road here or not to crash into the yellow car but if we run the behavior estimation again we will realize that the car yields instead and it will slow down such that our prediction of the future the yellow car enables us to now pass that because we really realize that the yellow car is going to yield and we can go because there's a freeway in front of us so where do we where can we integrate behavior planning into our planning pipeline and if we think about planning in general we can divide it in different modules so it's an hierarchical approach i show here basically the global planning behavior planning and the local planning and these different modules have a specific and different abstraction level and a level of frequency so this is the frequency with the with which you update the the plan and here you will see that at the bottom there's a lower abstraction level and a higher frequency so for the global planning we put in our start and goal configuration and the global planner returns a very coarse and rough path we can take so it for example takes a map of the street all of the cities and then plans where we roughly want to go so this is why we can we have a high abstraction level so we're not considering traffic signs or other traffic participants and we can also run it in a low frequency because we only need to update it if there is a traffic jam or a blocked road in the middle we have the behavior planning that handles different situations for example at an intersection or if you are on a highway the behavior planner then i can decide or will decide whether you stay behind the vehicle and stay on your road or if you want to overtake another vehicle by a lane change so all these different maneuvers are handled in the behavior planner that runs at a mid frequency and has a mid abstraction level and then below we have the local planning that based on the maneuver outputs the actual trajectory the car wants to follow so here we have a low abstraction level because we need to take a lot of information into account and we run it at a high frequency because for example other vehicles that cross our path and we lead leaders to re-plan the local plan in order to avoid collisions for example so if we observe new things in our environment we need to adapt our local plan more frequently and behavior estimation can now be considered as a part of the behavior planning so it can support the decision making here because in order to decide which maneuver to take we need to know what other traffic participants are going to do in the future so in general the behavior planning plans a maneuver to follow the global plan and these transitions between the maneuvers of course depend on traffic participants so we have seen this we have seen this in the video before where our own future motion or the maneuver so yielding or going straight depends depending on what the yellow car will do so we changed our decision to yield to the decision to just go straight just because we saw that the other car was yielding for us and this shows us that behavior estimation can actually support the decision-making and here's another example with an intersection so here we also need to perceive the environment uh check the different traffic lights but we also need to check where the other cars are because it can always be the case that the other cars are violating the traffic rules so even if we have a green light and the others need to wait it can be that one of the cars is just running the red light in this case it's also important to check this and to estimate what what is the intention of the other cars are they actually breaking the rule now or are they still following the rule and yielding so what information can we use in order to estimate the behavior of others so one common thing to use are past states of other traffic participants which can be position velocity acceleration or the orientation and these usually come from a perception system with the detection and tracking module on board or we can also use it by communicating our states among the traffic participants if this is technically possible another input we can use is map information so if we have an hd map of our environment we know where the different lanes are we know about traffic lights and we can use this for an informed decision for example if a car is on the lane dedicated for lane change we know that the car is most likely going to do a lane change finally we can also use sensor data in general so from camera lidar or radar sensors as you can see here because we could for example infer from a camera image if the car in front of us is going to turn just because of the turn signal or if we see braking lights we know that the car is probably decelerating to be more precise about the problem we can model the driving scenario with an agent as a partially observable stochastic game so this game is partially observable because we do not observe the full environment because there is noise involved in the process from our sensors we might have occlusions so we are probably not able to see all traffic participants around us and it's stochastic because the state transition is probabilistic in this graph you can see that there are nodes which denote the variables and there are directed edges that show the dependencies among these variables and the n agents in this model are now stacked behind each other from the depth dimension and the horizontal axis denotes the time so first we have a physical state x of an agent i at time t and this could be the position the heading the velocity or the acceleration and this state is usually not directly observable what we get from the environment are these observations so each agent i gets an observation at time t of its own state but also all the other states and this dependency just shows us that we can observe parts of these physical states but not all of them so that depends on the observation model we have after that each agent i updates its internal state bi based on the internal state b t minus i so in this case the under the previous time step and the received measurement so the internal state represents for example the goal state the agent wants to reach or the desired velocity or the driving style and in this model this depends on the as i said the previous internal state and also the new incoming measurements so if we observe a different physical state of others we might update this internal state and finally we have a control action ui which is then selected based on a policy and this policy depends on the updated internal state so it's this dependency here and with the control we derived you can apply it and then this this leads to a new physical state so we have a transition function that models the um how the state evolves from time t to t plus one given the previous state and the control input we applied so what are possible estimation tasks we can perform on this graph one is the state estimation so here we want to infer the states of all agents at a specific time step and if we do this for all states in the past we will have the full past trajectory of states and this could for example be a localization of our own ego state or if you estimate these states of others this would be detection and tracking for example another task is the intention and trade information so if you're interested in the internal states of n agents at time t we can do this so we will have a definition of the intention and trade estimation later but in general this is modeled by the internal state and this internal set can for example be used to integrate the navigational goals of others into the ego behavior planning so one task might be to reason about what where do the other traffic participants want to go and then finally another task or the last task is the motion prediction so the motion prediction aims at estimating the physical states of the agents for the future time steps so for the time step from t plus one to f um specific horizon so up to t f we want to infer or estimate the future physical state states and you can see that this depends of course on the state we are starting so the initial state but also the control policy also the internal states in this lecture we will focus on intention and trade estimation as well as the motion prediction so let's start with the intention estimation so here you will see a situation with a self-driving vehicle in purple and we want to do a left turn and here's another vehicle in white that might turn right or go or might go straight and our goal is now to just infer this high-level behavior so to answer the question is the white car going to turn left or is it going to continue driving straight and if we know this behavior or if we have inferred this behavior we can adapt our own planning based on this estimation so if the if we estimate that the car is going to do a right turn with a high probability we can already make our left turn if we say that it's going straight we need to wait and yield until the white car passed us and in general there are different levels of interaction you can imagine about intention estimation so here the white vehicle will most likely not change its goal based on our planning but there are other situations if you for example consider a highway merging scenario there might be so if you want to merge on the highway and there might be a cooperative driver on the left and if you signalize that you want to merge in front of him he for example could decelerate and let you merge if it's very packed and otherwise hard to merge but if it's for example an aggressive driver he might continue and even accelerate such that you then should merge behind him so here behavior estimation depends also on the on the interactions between the different participants so in general in intention estimation you want to infer what other drivers want to do in the future and as shown in the example this can often model with the probability distribution over high level behavior modes so we have different actions like lane changing turning or overtaking and we can assign a probability for each of these actions um or you can estimate the probability and then adapt our planning based on this estimation by for example conditioning our motion prediction on the estimated intention and then use the future trajectory of the traffic participants to plan our path there are different attention estimation paradigms and in general we can differentiate between recursive estimation or single shot estimation so for a recursive estimation we have our probability distribution over the internal state of the agents at a specific time step and we can model this as a function of the distribution of our belief at the previous time step and our new observations so these are the observations we as the ego vehicle got a time step t and then we always update our belief recursively in single shot estimation we model the probability distribution of the internal state as a function of a specif a specific window of observations from the past and some techniques you can apply here in general bayesian models where we explicitly define conditional probability distributions and then infer the posterior probability given the observed variables another method is the steep learning so here we define a mapping from input to output and optimize this model or this mapping based on data we observe from a data set and finally there are also game theory methods and these model the intention estimation as a game of players and each player tries to optimize its own cost function and with game theory you can explicitly reason about the interactions and especially about how our own actions will influence the decisions of others okay let's continue with trade estimation so the difference between intention and trade estimation is that with trade estimation we now want to define how the goal is reached or accomplished so it's more about dif different traits and these traits depend on the skill of the driver or the preferences like social preferences you see more aggressive more cooperative and this can be inferred with a trade estimation in this example i already mentioned it we have a highway merging scenario and if you consider this blue car on the left that is on the lane and we are the orange car here we want to merge we can for example estimate is the driver cooperating with us so in this case he might even decelerate and let us pass in front as shown here or if he's more aggressive we can also estimate that he will probably not let us pass and he might even accelerate and we should then merge behind him so example traits are parameters policy parameters of a driver model so we have a fixed model and we need to infer different parameters um to use this model and these can for example be the minimum desired gap a driver wants to maintain so in a driving scenario on a highway for example each driver has a different minimum desired gap you want to keep so a truck driver usually wants a larger gap because he needs more time to break another one is the maximum feasible acceleration so not all cars can accelerate in the same manner so these are parameters on we can infer from observing driving behavior and then feed this or use these parameters in our model to model the for example the future behavior with the driver model we can also infer parameters of a cost function so if we assume that each player has a specific cost function and he drives in a way that optimizes this cost function we can infer the different parameters or weights so how is the driver waiting the progress on the street versus the control efforts or the acceleration he needs to do for it and in general we can discriminate between offline and online methods so in an offline method we estimate all these parameters in advance based on observations and in contrast to online methods where we can update these parameters so if we observe a new driver and his behavior doesn't fit to our model we can update these parameters to get a better estimate of its behavior some paradigms we can use here are again bayesian models where we for example have a prior distribution of trade parameters and then we condition this distribution online on new observations we get for example a new driver comes up or a new type of driver you can also use optimization methods for example inverse reinforcement learning where we try to infer a reward function from an expert driver demonstration or finally it's also very common to use heuristics so in this case an expert defines and tweaks some parameters that are very interpretable and uses his expert knowledge to model how the drivers behave on the street but of course this is usually less realistic and also happens on offline so the expert needs to tune it in advance and then you deploy it and the parameters are fixed so now we have a look at our third topic which is motion prediction in motion prediction so if we go back to our example we do not want to know the high level behavior so we're not interested in just turning right or driving straight but we want to know the exact trajectory so for example the x and y positions of the white car over time and this can be used to plan our own path and for example to check if there is a collision and if we want to turn left here and if we do the attention estimation before this can of course inform this motion prediction so if we already know that with a very high probability the white car is going straight it's easier for us to estimate the trajectory which is underlying so in motion prediction we want to predict the future states for a specific time horizon of n traffic participants if we go back to our graphical model we can see that this can be modeled by a state transition function so the future state depends on the path state and the applied control of each agent in the state transition model can for example be physics or geometry based or also learn from data so if you have this model and you know the states and the control you can also model the future states of a car but of course in most situations you also want the interactions among traffic participants so what are different hypotheses we can output as motion so what are different outputs we might be interested in so the one i already mentioned is just a single trajectory so for example x y over time we can also output multimodal trajectories so if there are different um possibilities for the future we can output more trajectories that reflect these for example two modes in the previous example so we would then output some trajectories that go straight and some that turn right and then the planner needs to account for these different possibilities another way is to output bounding boxes so 3d or 2d bounding boxes which can also be used in the planning to reason about the dimensions of the traffic participant yeah which is either needed for avoiding collisions it's also possible to output gaussian distributions so here we already we also have a notion of uncertainty so we know about how certain our model is about this future position of another agent it's also very common to use occupancy grid map maps and here we we do not have the notion of what is a car for example what is a truck or what is in general an agent and what is for example just a static obstacle like a wall but we just model everything in a grid map and we reason about the occupancy of the different cells in this grid map and with this we can easily use it in planning and plan a path that is collision free so we do not want to traverse areas with a high occupancy probability so this is also a possible output for motion prediction then there are forward and backward reachable sets and so with the forward reachable set we derive the future states we can reach from from an initial state under specific constraints so here we can only predict feasible trajectories for the future and with backward reachable sets we get the initial states from which a goal can be reached under specific constraints and this can be used to check if a state is for example unsafe because of a collision or if or we can reach a specific state without collisions or we can also model adversarial human behavior by assuming that another car wants to hit us with the maximum velocity or acceleration it can achieve and the backward reachable set can be used here to plan a safe path with respect to these adversarial setup finally it's also possible to output the raw sensor data so to for example predict what the camera will see next or what the light i will receive next and the advantage here is that we can do this without the need of the true trajectories for example in order to evaluate the performance because we can always get the next sensor reading and we can check how good our prediction actually was but of course this is harder to use in planning because then we still need to infer um or we still need to reason about what are we going to do with this information about the future sensor reading so let's look at some paradigms we can use for motion prediction so one is the closed loop forward simulation so you roll out a closed loop control policy for each agent at each time step so if you look at this pseudocode which is based on the graphical model we have seen for so for each time step and for each agent we will run these steps so we will first receive the observation based on the internal uh sorry based on the physical state of the n agents at the time step and with this new observation we can update our internal state based on this mapping age so we use the previous internal state and update this one with the new observation and we will then feed this internal state to our control policy that gives us the control for a specific agent at a specific time step and then we can use the already mentioned motion model to derive the next state based on the previous physical state and the input we apply or the control we apply so depending on this observation function so we have a very or we have an interaction aware approach so if we perceive a lot of the environment we can actually include this and we can reason about what other traffic participants are going to do in order to define our internal state here so our observation captures a lot of the real physical states we can account for this in the prediction but if for example a lot of occlusions happen then it's probably not easy to to include this knowledge about future interaction so that depends in this case on the observation function a drawback of the method is that we require a control policy so we need to define in general all these mappings all these functions or models and especially the control policy so we need to define what is the control action each agent is going to take in a specific time and this of course depends on his own preferences or constraints or goals so we need to reason about this when we want to find a good policy for each agent other paradigms or methods to infer the future motion is the independent prediction so here we do not consider the other agents in this scene but we predict a future motion for each agent individually so this is very fast and paralyzable because we can run it for all agents in parallel but of course we do not model any interactions among them so it could be the output that we then predict trajectories for the agents that actually enter um across each other and will lead to um to a collision for example and because we did not consider any of the other agents and this outcome is usually less likely because traffic agents try to avoid each other and do not want to collide so this is one one drawback of this method finally we also have game theoretic approaches so i already mentioned this this method um here we model in the situation or the driving scenario as a game where each agent is a player playing the game and with this we can explicitly reason about how the others react to the future trajectory so there's the interaction between the different different traffic participants also for the future time step but it's usually not easy to solve this problem especially with an increasing number of agents because then we also have more interactions among these agents now we will focus on some motion prediction models in more detail we will start with the constant velocity model which is an independent prediction method so you see that there is only one car involved for our prediction so this is the ego car we want to predict the for which we want to predict the future motion we know the past states of the car and independently now means that we do not consider any other cars in the environment so we just take the path velocity of the car and we assume that it will keep this velocity and will also keep the orient current orientation this leads to a straight line prediction um for the future states you could you can do this with a single trajectory but you can also sample for example angular offsets and then output multiple trajectories in general this is a very easy to imp so this baseline is very easy to implement and also achieves a good result usually if you consider a highway scenario if the cars don't merge the lane lanes or change the lanes they will mostly just drive in a straight line with a linear velocity and only sometimes decelerate or accelerate depending on the traffic situation but yeah for most of the time it's already a very good baseline but if you can consider non-linear motion like here when the road is either has a curvature or if as i said the car decelerates or accelerates the constant velocity model cannot capture this you can then maybe use a constant acceleration or your rate model but all these methods share that they are independent so they do not consider any interaction of the cars so this can lead to worse predictions in case there are cars involved in your environment so one way to tackle this that was introduced in the literature is the social forces model here we consider different agents in this example agents a and b and we assume that these act in a force field and we then know from physics that if you have a force which is apply on which is applied on a point mass with a mass m then you can derive the acceleration of this point mass and from the acceleration with some initial conditions you can retrieve the trajectory by solving this differential equation here you see an example with the agents a and b and we want to predict the future motion of agent a and the question now is what are these forces and how do we determine these forces so we can model the impact of or the influence of agent b to agent a as a force that is acting in this direction so it's acting against agent a because we observe from human behavior from human motion that pedestrians for example don't want to crash into each other so they tend to avoid each other at the same time we have a goal we want to reach so we can model this drive to the goal with another force that acts in the direction of the goal and we can do this also with obstacles for example with static obstacles and yeah model these or their impact or their influence in our motion as the force acting against us so in this example the solution could be something like this that we will finally reach the goal but we will take a small right turn in order to keep more distance to agent b here's the simulation of the social forces model so there are point masters or pedestrians from all these four sites and they first go to the middle here and here they will get assigned a goal so they then yeah um change the direction and at the same time so the goal or the yeah the goal is acting like a force towards the goal and at the same time the other participants around each pedestrian are acting against it and you can see that they avoid collisions but sometimes as here you also see some states where the pedestrian pedestrians blocked so it's a good easy to implement situation uh simulation but yeah it does not explain all the um behaviors you can up you can you can see or you can see in real life so the main challenge is to define and parameterize these forces that explain the behavior so it's probably less intuitive to model this with a force and yeah each force needs to be modeled and you need to find parameters and maybe even adapt these parameters to different agents and as you saw before it leads to less realistic predictions so this was an easy example with just four corners but if you consider a street or a hall for example and people move in all directions um yeah it's sometimes not very realistic with this approach so sometimes also people walk together in one direction diverging and another problem is that it does not apply for cars that follow a road structure at least not the um that's a vandala social force model so when on highway driving we have specific rules and we have a lane structure and it's not easy to implement this into such a social force model which was more developed for a pedestrian moving prediction motion prediction so the question is can we model human driving behavior with a different model and one solution is the intelligent driver model so as you see here on the the spurs i view picture of a highway scene the intelligent driver model is intended for scenarios where cars are following each other and you can already guess from the image that the future motion of cars that follow each other mainly depends on some parameters like the distance between them or the velocity they want to go for and this is the main concept of the intelligent driver model so it's a car following model with some parameters so we will talk about these parameters in the next slide the output of the model is the acceleration of an ego vehicle so here we have this car following situation with our ego vehicle on the left and this model gives us the acceleration as a function of the current speed of the model the speed of the model in front of it the speed of the car in front of it sorry and also the distance between the two cars and also the parameters so let's have a look at these parameters and how is the acceleration actually computed so this is the um this is the function for the intelligent driver model um you see basically two parts here and so this s star which is um denoted here is the desired distance you want to keep but it's a dynamic distance so it depends also on some velocity um on some velocities and we will now go through the different parts in more detail so first let's talk about the parameters we need to first get and then put them in the model so we have the maximum vehicle acceleration a so that's what the car is able to accelerate at maximum or what the driver wants to do at maximum we have the desired velocity of the driver here and in on the nominator you see the actual velocity of the driver there is the so in this dynamic desire or the desired dynamic distance you see the minimum spacing in congested traffic so when the car stands still this is the minimum spacing a driver wants to keep there's the desired time headway so the time until you collide with the current front and there's a comfortable braking deceleration b um yeah this is also something which depends on each driver so it's also you also need to infer this before or estimate it or somehow tune it and find an exponent for for this relation here and in this model we can identify two different parts so one is the free road behavior so if you consider this right part and you have a free road so there's no car in front then the distance to the next car is going to infinity or at least very large if you cannot see another current front um and this means that this term is going to zero and the acceleration is only determined by this left part so this is the free road behavior and if the actual velocity is close to the desired velocity we will this will become one and we will not accelerate further and if the actual velocity is for example lower we will have an acceleration to reach our desired velocity the right part is called an interaction term so it if you look at the this formula you can see that if there's a small net distance which means that this distance here is small the impact of the interaction term will increase so if there is a closed car in front of you this impact this this term will get larger and will influence our acceleration so if you look here if we have a large speed difference um between this is the delta v alpha here so if the speed difference of the kind front and and yourself or the ego car is large we will then break more in order to increase the distance again this is the sort of safety mechanism most drivers do and if the delta is very small you can see this interaction term more like a small repulsive force because only this part remains and this force is um yeah basically resulting in an equilibrium net distance so for specific velocities of the two drivers um we will then just keep a specific specific distance to our current front and yeah you can derive this from from the function here some advantages of the intelligent driver model are that it's also simple but effective so there are just some parameters involved um and it can model a lot of different situations but the parameters and the parameters are very intuitive so you've seen that we can probably already define some of them just by hand from our experience but at the same time in some scenarios the driver model is less realistic so we cannot cover probably all situations like an intersection or roundabout or an emergency stopping so some of them are covered with the model but since it's a car following model we cannot use it for all scenarios we can imagine and it does not work well for pedestrians because pedestrians have different priorities while driving and yeah they're not paying attention to for example the distance and also the velocity when it comes to the pedestrian in front of you so there's just a different behavior for pedestrians so the question arises how can we model more realistic motion so we have now seen some other models but uh usually all these independent predictions or model based predictions with some parameters uh due to good predictions but sometimes less realistic so it's not it's hard to really look them make them look like real predictions or like yeah trajectories we can observe in real data and one way to approach this problem is by deep learning based methods so here the goal is that we learn to predict a future directory from large real real-world data sets so we have a lot of examples how drivers um drive in the real world and we can learn to predict these future motion so the advantage of these models um are that we have an implicit trait modeling so we at least in the basic vanilla models we do not need to implicitly model the traits like high level behaviors or define transition functions for example this is all included in in the model and the models of these models are usually able to have a high representational capacity so we can capture for example different traffic agents like cars or pedestrians or trucks and we do not need too many hand-based rules and decide for each agent which model are we going to apply there are also some disadvantages of the deep learning methods so the parameters are no longer inter interpretable so we cannot pick a parameter and say this specific weight is responsible for for this or we can tune it by ourselves so in contrast to the intelligent driver model where the parameters are very intuitive here we have way more parameters and optimize them without actually saying what each parameter is going to do we are also usually not explicitly modeling interactions so in contrast to game theory methods this is not considered in a lot of deep learning based methods and it's also less robust for unseen scenarios so if you deploy your model which has been trained on some data and some training data you deployed in into an unknown environment for example you encounter a situation that was not in the training distribution then you do not have any guarantees what the model will output so this is a challenge with the deep learning based methods here's a short reminder of how the main components on deep learning work with each other so we have our model here and we use pass data as the input and output a future trajectory and with the ground truth trajectory you get from the observed data we can compute a loss signal for example with the root mean squared error and this loss can be used to optimize the model weights by for example back propagation so then we compute the influence of each weight here on the loss and by backpropagating we can then update the weights in order to improve the loss at the next step some paradigms that are used for motion prediction with deep learning are listed here so basically what we do is we do sequence to sequence prediction because we have a sequence a temporal sequence of input data and we want to output a temporal sequence of data and some methods for example are the recurrent neural networks or convolutional neural networks or a combination of them it's also common to use a graph neural network so here you model each agent as a graph node and the dependencies or the interactions as edges and also one another way to to model motion prediction with deep learning is with transformers so these came up very recently and they're also used in language processing to to capture sequences temporal sequences of data and the main advantage is that it's possible to attend so they have an attention mechanism and they can attend from the position in the output sequence to any position in the input sequence so then do not need to process the sequence step by step but have a let's say a full full view on the input sequence so this solves some some problems that arise in the other models finally another way is to with a generative adversarial networks here you have a generator that generates trajectories and the discriminator that gets either the um ground truth directory which is realistic and real or your predicted directory and needs to decide if it's real or not and by having this adversarial network setting that basically the generator tries to fool the discriminator and the generator will learn to output very realistic trajectories so you can explicitly reason about getting trajectories that look real and are very close to actual human behavior and for these methods we can in general discriminate between deterministic or stochastic models so for a deterministic model if you use the same input for your model again you will also get the same output so there's no society involved and for stochastic models we for example can sample different trajectories for the same input and generate multiple proposals so we can capture multimodal behavior with this for example so this lecture we will focus on the first two models so the recurrent neural networks and convolutional neural networks and will now start with the recurrent new networks so here you see the basic setup of a recurrent neural network so we have our network a in the middle which will be defined later and there is an input x at time t and an output h at time t and we then the network operates on the input xd but also some part from the previous iteration so we always we have some output and we feed also some part of the output back to the input and if you unroll this in time like here so this is the temporal axis now you see the structure that we use the same model a but we always pass something to the next iteration and with this we can [Music] make a prediction here for example based on the full input sequence because we always pass some information to the next time step so what actually happens in a so in the basic recurrent neural network we have this function here so the in this case the output h is also the hidden state that gets passed to the next state and you can also use another layer on top like an output layer to define how your output should look like but in this case we consider the an easy example where we just use this hidden state h as output and also pass it then to the next iteration so you see that we at each time step so in order to compute this ht we take the hidden state from the previous time step and our new input xt we concatenate these multiplied with a weight matrix add a bias and finally apply an activation function to add some non-linearity and these weights and biases are learnable so these are optimized to fit the data so as you have seen before the optimization step can change these values in order to produce better outputs so the advantages of recurring new networks or rnns are that we have a weight sharing so at each time step we use the same network a so the same w and so weight matrix w and bias b so in contrast to a fully connected net where we input a sequence here we can use the same network for each time step whereas for the fully connected net um you know it's very likely that you for example overfit to some specific positions in the sequence because each each weight is dedicated to a specific position in the sequence another advantage is that we can process variable sequence length so we can input the length of the sequence here and it's not fixed so compared to a fully connected net where you only have a specific input size and an output size here we can do this process as long as we want or as long as we have input data and then consider the last output as our prediction some advantage disadvantages of the method are that the prediction is usually usually slow so as you can see here each stage has to wait for the previous stage to output the um hidden state or these internal state that gets passed to the next iteration so yeah compared to other mess that's it's it's slower especially for long sequences and you can encounter the problem of vanishing or exploding gradients so the information flow in this network especially for long sequences can be very long since it goes through the whole network so if there is an input here x0 that has a large influence to this output at time t and you want to you compute the loss for this output and back propagate um yeah it's a very long way through the network and these gradients can then be accumulated and basically explode or they can also vanish and go to zero such that the this input basically does not lead to any update of the weights in a because the gradient already vanished across the way so if you consider this recurring network for language processing you can think of a sentence and you want to predict the next word at some point in the sentence um so you first process all the words in the sentence and if the output depends on the first word in the sentence then this can be a problem because you would need to propagate through all the words here in all the in this case it's not time but it's the position in the sequence in order to get the influence from the first word so it's in general hard to maintain information from the beginning of the sequence here so one way to resolve this problem was introduced with the long short term memory so this is the setup with the recurrent neural network in our network a and we now replace this network with this more sophistic sophisticated structure and we will now have a closer look at this block so you already see that we have now four neural network layers with different activation functions and there are also some point wise operations of these vectors some transfers concatenation and upper and copy operations but we will now go through each of these mechanisms in more detail the first thing you will realize is that there is a new vector which we use in this setup so this is the cell state so each time step we have a cell state c and we use the previous cell state c t minus one in this model and this as you can see is a direct connection so from t minus one to t with just some minor operations here so this makes it possible to maintain a cell state and solve the solves the problem of vanishing or exploding radiance because now it's easier for the network to learn to maintain a specific state throughout the sequence processing so the first part we want to discuss is the forget gate so you see that there is the input xt and the previous hidden state ht minus one and we apply similar to the previous function and this projection with the matrix wf and the bias bf which are also learnable and an activation function to compute the f and we will use an activation function so a sigma activation function such that the values of f are between zero and one and the goal is that this vector now determines which parts of the cell state do we want to carry on to the next iteration and which one do we want to forget so if you predict a zero for a specific entry this means that you do not want to keep the cell state at this specific position you basically want to forget it and if you predict one you want to keep it because you will need it for the next iteration another gate is the input gate so here we again use our state xt and the hidden state ht minus one we compute two things now one is the input gate with a similar structure then the forget gate so we again output a value between zero and one for each position in the vector which defines which parts of the input of the computed input or cell update do we want to keep and apply and which do we want to throw away or discard and the cell update is um yeah computed with this function so we again concatenate these two we have the projection um with the bias and apply a ton h so here we scale these two minus one and one so this will be our new cell state or the cell state update and we then combine them on this line here so we use the forget gate to determine which parts of the cell state we want to keep and which do we want to forget and with the input vector of the input gate vector we determine which parts of the new cell state do we keep and which do we forget and then the state for the next step is the sum of these two finally we have the output gate so this gate determines what is our actual output for this stage so we compute the output vector here similarly as the input and forget gate with an output between 0 and 1 and the actual output is now based on the new computed cell state with a 10h activation function and with the ot vector we define which parts do we output as ht and which parts do we want to throw away basically so with this lstm model you see that there are some more mechanisms involved and it enables us to maintain a cell state that can for example summarize if you consider motion prediction the cell set could summarize some traits or information you need for predict the future motion and you can update this sales state based on the new inputs or the new observations you get so this is a new introduce mechanism here and question now is how can we use the lstms for predicting a full sequence so here we have only considered the prediction of one single output and the question is how can we encode a full sequence and predict the full sequence so if we want to do sequence prediction we can consider this in an lstm encoder and decoder architecture so we have one lstm here which encodes the sequence so as we have seen before it processes all the inputs we have and finally outputs a hidden state in a cell state at the time step t so we want to make the prediction for t plus one and t plus two until t plus uh tf or tf and as soon as we get these hidden state and cell state that summarize these input sequence we can use another lstm decoder lstm that predicts these outputs in an autograph fashion so this means that we always take as input in this first case the last internal the last physical state we got we take these hidden and cell state representations and output the first future state at t plus one and then at inference or test time we use this feed it back to the next iteration step and use this prediction as input and make the prediction for the next time step based on the previous prediction and the new hidden and sales state which were updated during training we can use the concept which is called teacher forcing that means that we do not feedback our actual prediction because they're training the prediction uh in the beginning might not make too much sense because we still need to optimize the weights in the network so here it's common to use teacher forcing and to use the real ground truth future values here because at training time we have access to these values and this can lead to a faster training and speed up learning in the end now if you consider this lstm encoder decoder structure one problem we might still face is that we do not consider the influence of other neighbors so in the previous slide you saw that we just processed the input sequence for each state for each agent individually now the question is can we somehow add information about other other traffic participants in our environment so in this picture you see the different agency in the scene each is modeled with an lstm encoder decoder architecture you see the predictions here and now the question is how do we connect these predictions um on this on the spatial level so because we know these pedestrians are close to each other they might to avoid each other and this one is more far away so the influence of this prediction is probably less so the influence of the others on this prediction is less compared on the to the crowded scene here so there is an a model which is called social lstm that actually works on this problem so here the assumption is that the hidden state of the lstm contains some features about the motion of an agent so as i said before the hidden state could for example encode something like the goal the agent is going for or the overall style if he's more aggressive and accelerating or more slow in general or maybe more cooperative so these features can be encoded in the hidden state and the office of social stm proposed the idea to share these hidden states among the lstms so here you see two different time steps so t1 and t2 with the different lstms for each agent and in this time step here we take the hidden states so for the black agent in this example for the black agent we take the hidden states of the neighbors in blue and in yellow and we pool these hidden states and then feed this as an additional input to the lstm so we share these hidden states or the encoded information about the motion with the neighboring agents and this is done by here considering the local neighborhood around the black agent and then for the closest neighbors you take the these hidden states the hidden vectors at this specific time step and pull them into a new tensor h3 and provide this to the next lstm layer so each lstm in the decoder stage then predicts that the next cell and hidden state based on the previous selling hidden state their new input but also the hidden state of the neighboring agents and with this shared hidden state information the authors could show that for some situations you see a difference in the prediction so the method accounts for other agents in the scene and here's a comparison a qualitative comparison of the different methods so in yellow we have the ground truth future directory in blue is the social force model we discussed orange is the linear model so the console velocity model and in red dashed we have the social lstm and in these cases you see that the prediction here for example is more accurate because it considers the other traffic participant here or here it predicts a left turn because there is an i think there is some something in the way for example so the spatial relation of other agents in this scene is included in the model but of course there are also some failures if the agent is decelerating or acceleration accelerating or some other modes that are not captured so you see some improvements for for interactive scenarios but there's still a lot of challenges involved as you can see in this lower part here so besides the recurrent new networks we discussed we can also use convolutional neural networks so you might know them from image processing where we use a filter to convolve along the spatial dimensions of the the image so the height and width dimension and we can also apply convolutions across the time dimension which is usually easier to train so we do not have the problem of exploding or vanishing gradients so there's no recurrent structure in there but one challenge is that we need a sufficient receptive field so since the filters are usually not large if they look at a specific part of the temporal dimension we need to stack more and more layers to give the last output stage enough field of view to the input sequence to actually predict a meaningful future motion for example so here's an example for a one-dimensional temporal sequence so we have here an input sequence from x0 to xt and in this example you want to predict some outputs for the same temporal axis but you can also imagine that you can here predict future values and how the temporal convolution works is yeah that we have these filters here that for example here derived a new feature based on these three inputs and then by stacking you can derive more and more high level features and in this case you see some that these filters skip some inputs and this because of dilated convolutions which are a common common technique to increase the receptive field so by skipping these inputs at this level we can widen the the temporal axis we are looking at such that the output here is looking at a way more or looks away it looks to a larger time horizon in the past so with static convolutions we can counteract this problem of the receptive field one example of convolutional neural networks for motion prediction is the fast and furious method so here the authors use a lidar data and voxelize the lighter points into a bird's eye view then stack these 2d images along a third temporal dimension and then apply 3d convolutional networks to as you can see here to detect the bounding boxes in the scenes and then with different detections you can also track the objects over time and then predict the future uh motion as well so here's the detection at the current time step here's the detection of the future time steps and then with the detection with the tracking so you then track these different objects over time you will get future motion predictions and the um yeah interesting thing here is that the whole method is trained in an end-to-end fashion so the method jointly does detection tracking and motion prediction the authors motivate this by the fact that in a perception system you can always have misperceptions though for example it's possible that you do not detect a car and you do not detect it it will also not get tracked and predicted for the future so if you just assume that you always get perfect input data so perfect tracked trajectories and you deploy your model in to a real world scenario then there are situations where you do not get perfect perfectly tracked trajectories or missed trajectories and if you optimize your model jointly you do not have these hard decision boundaries that you need to decide at some point am i now going to output a detection to the next method or the next module or do i keep this and end to end methods yeah try to solve this issue such that each task so detection tracking and prediction are jointly optimized and also support each other during training so this answers or tries to answer the question what happens if the perception system fails and you see the common pipeline we have before entry end learning so we have the sensor data we get detections from the data for example the states we use these states and track them over time to get trajectories and with these trajectories most prediction methods basically consider just this part so they assume to have trajectories and then predict future trajectories which are then used in motion planning to output the plan of the ego car so the basically the future ego motion and then this is passed to a controller to output control commands and as i said if you only consider this part with your model and the trajectories are not perfect because of some misperception then in reality your performance can degrade a lot so this is why people came up with end-to-end models that jointly optimize these tasks here you use raw sensor data and output a few future trajectories so this is close to what the previous method the fast and furious it does and you can also increase the the span of the end to end model and go from sensor data to the future ego motion so you include the planning into the model or even go further and just go from sensor data to control commands but of course the more you put into the deep learning model and optimize end to end the less you can interpret the results so you cannot check some intermediate results usually and it's also not clear how your model then performs in unknown situations so if you consider this case for example if you optimize this with some data you got from expert drivers that drive on a road since these drivers usually stayed on the road it can be hard to encourage to reason about how will the model perform if the car is not actually leaving the road because if that's not covered in your training data then it's not clear if the correct control commands will be given that you find your way back to the street another ongoing field of research is the self-supervised prediction here we consider the fact that labeling trajectories is usually very expensive so if you have a large data set and you want to train your prediction method on this you need to label for example the bounding boxes of all the agents at each time step or the trajectories to to train your method and one idea to avoid this problem is that you predict the raw the raw sensor data into the future so for example in this work the authors used the bird's eye view of the environment and predicted the future environment so here you see that there is a car which is approaching the evo vehicle and the prediction will be that it's passing us at this time step or here in this work the office considered free space forecasting so here you see the classic way of instances you want to predict so you will predict that the green car is going to be here and with free space forecasting we do not need these detections but we just know that here was the last time we saw that there is an occupied space and at the time step we are interested in we predict that this occupied space will be here at least from what we know from our sensor measurements and this can be also used in planning to for example avoid the path here that will then cross this this red area so if you have developed your model the question is how do we evaluate the prediction this can be important for comparing the performance to different methods or baselines but also for training if you want to compute the loss there are two common metrics that are used so the one is the final displacement error and the other one is the average displacement error so here we have our cow with some past states so the past trajectory and our ground truth future state and if we now predict a trajectory we can evaluate the distance of the physical states at the last time time steps which yields our final displacement error or we can also compute the distances at all time steps and average them to get the average displacement error in general if you evaluate the methods you need to account for the fact that some output unimodal trajectories and some output multimodal directly so different hypothesis of the future motion and in this case you for example need to decide which trajectory to use for evaluation so the method assigns some probabilities to the future emotions then you can for example use the one that has the highest estimated probability and finally i want to name some data sets and benchmarks so as i said the strength of the deep learning methods is that you can train them on real world data and this is why a lot of different data sets were developed and recorded with different sensors used or different environments and different uh um yeah different interactions for example some are in the city some are on the highway for example these consider highways intersections or roundabouts we have the kitty data set um also from some car manufacturers and all these data sets also or at least some of them also provide benchmarks so you can test your model on some data where you don't know the labels so you don't know the future directory and you can evaluate the performance of all of your method and compare to other methods and see how good you perform in this unknown scenario or unknown data yeah finally i would like to summarize this lecture so we talked about estimation of intention traits or also the future trajectory and we have seen that this can be very useful for planning our own behavior and our own motion and there were different solution strategies which depend on the model complexity the level of interaction among the agents also um on the realist or how how realistic the trajectories are look like so some more model based with some parameters we can use so these are easier models some are more complex with large networks that model human behavior so we can also use large data sets to predict behavior which is very close to actual human driving um so we have the demonstrations in these data sets and yeah optimize our methods in order to match their behavior here and this i would like to thank you for your attention

Info

Channel: Cyrill Stachniss

Views: 1,380

Rating: undefined out of 5

Keywords: robotics, photogrammetry

Id: GaCXxffLzX8

Channel Id: undefined

Length: 75min 39sec (4539 seconds)

Published: Sat Oct 09 2021