Autonomous Navigation with Deep Reinforcement Learning Using ROS2

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this tutorial we will use deep reinforcement learning to teach your robot how to navigate to a local goal in an unknown environment that is without using a previously created map with rapid evolution of reinforcement learning algorithms it has become possible to learn complex tasks such as navigation in this tutorial we will use the td3 method as you can see using this method the robot successfully reaches the goal even when environment changes with every testing this tutorial was made based on this Repository thanks to the authors for sharing this great work since this project has initially worked only with raws 1 I modified it to work with raws 2. the code is in the description below now let's see the theory so what is td3 td3 stands for twin delayed ddpg and ddpg stands for deep deterministic policy gradient td3 is a relatively recent method and it is based on previously developed algorithms so in order to understand how it works we should begin from the basics of the policy gradient method before describing td3 we will briefly discuss policy gradient method deterministic policy gradient method and deep deterministic policy gradient method in this tutorial I will not explain derivation or proof of formulas and we will go over Theory very briefly for more in-depth explanation please see other materials now let's begin here we consider the case of a stochastic parameterized policy Pi Theta policy gradient method estimates weights of an optimal policy through gradient Ascent by maximizing expected cumulative reward which an agent gets after taking optimal action in a given state to actually use this algorithm we need an expression for the policy gradient which we can numerically compute after several expression Transformations we can get this formula this is basic policy gradient in this form policy gradient can be estimated by letting the agent act in the environment using the policy and collecting a set of trajectories this is the simplest version of the computable expression in the case of deterministic policies the policy takes in the state space s and outputs in action space a instead of getting the integral over actions we only need to sum over the state space as action is deterministic this is an expanded form of the deterministic policy gradient the proof for this deterministic policy gradient is similar in structure to the proof for the policy gradient theorem an advantage of the deterministic policy gradient is that compared to the stochastic policy gradient the deterministic version is simpler and can be computed more efficiently ddpg is an algorithm which concurrently learns a q function and a policy it uses off-policy data and the Bellman equation to learn the Q function and uses the queue function to learn the policy now I will explain some facts behind the two parts of ddpg learning a q function and learning a policy let's begin from the Q learning side of ddpg in ddpg we are using mean squared Bellman error function which tells us roughly how closely Q Phi comes to satisfying the Bellman equation here small D indicates whether the next state is terminal so the objective of ddpg training is to minimize this mean squared Bellman error loss function there are several methods used to successfully train the network the first is replay buffer this is the set of previous experiences in order for the algorithm to have stable Behavior the replay buffer should be large enough to contain a wide range of experiences but if we use too much experience you may slow down the learning the second is Target Network by using Target Network we can stabilize the learning in depg style algorithms the target network is updated once per main Network update by polyak averaging now about the policy learning side of ddpg policy learning in ddpg is fairly simple the objective is to learn a deterministic policy by Theta which gives the action that maximizes Q Phi the action space is continuous and we assume the Q function is differentiable with respect to action so gradient Ascent is used to solve the above equation finally we get to td3 ddpg can achieve great performance sometimes but it is very sensitive to hyper parameters and other kinds of tuning often learned Q function begins to dramatically overestimate Q values which then leads to the policy breaking because it exploits the errors in the queue function pd3 overcomes this issue by introducing three critical tricks the first is clipped double queue learning td3 learns two Q functions instead of one and uses the smaller of the two Q values to form the Targets in the Bellman error loss functions the second is delayed policy updates td3 updates the policy and Target networks less frequently than the Q function it is recommended to do one policy update for every two queue function updates the third is Target policy smoothing td3 adds noise to the Target action to make it harder for the policy to exploit Q function Errors By smoothing out Q along changes in action here is a structure of a network we used in this simulation robot States consist of distance between robot and goal angle Theta between robot heading Direction and goal Direction translational velocity and angular velocity of the robot it also contains distances to obstacles each nine degrees and 180 degree range in front of the robot as has been described in the previous slide clipped double queue learning trick is used in this network so there are two identical networks in one critic Network the policy gets reward of 100 if the robot reaches the goal and gets minus 100 reward if Collision happens also each time step the robot gets reward which is the difference between translational velocity and angular velocity this reward is applied to make the robot move forward as much as possible now let's execute the simulation firstly we have to train our Network to do this launch the training simulation launch Pi script to observe how training goes tensorboard is helpful open train velodyne no Pi script using visual studio code launch the tensor board by clicking on launch tensorboard session select current directory note that the graph here is describing the state after the training is complete here we can see that it takes about 3700 steps for Q average value to converge to test training results launch the test simulation launch Pi script as we can see the robot successfully navigates to the goal

Info

Channel: robot mania

Views: 7,925

Rating: undefined out of 5

Keywords:

Id: KEObIB7RbH0

Channel Id: undefined

Length: 9min 47sec (587 seconds)

Published: Sun Apr 16 2023