AI Learns To Swing Like Spiderman

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I'm sure you've seen this a million times it's Spider-Man swinging through the city but something's very different this time around this Spider-Man is not human It's actually an AI and this AI started off knowing nothing about its environment and yet it went from falling flat on its face to swinging through the air at over a hundred kilometers an hour so how did it accomplish this and how did it learn how to swing when it had no human assistance let's go back to the beginning the first thing our AI ever sees is a state the state could be anything it could be a view or some numbers or it could be a picture of a beautiful house the point is the AI is given some sort of sensory input about the environment it's in the AI then takes this state and outputs an action the AI has a bunch of inputs it can interact with and the action describes how they should be set as humans we do something similar with video game controllers the actions can be discreet or continuous discrete actions are like button presses while continuous actions are more like joysticks or levers for the sake of the Spider-Man AI all the actions will be continuous finally the action is applied to the environment which steps forward in time and outputs a value called the reward the reward is what motivates our Ai and is given out like a score in a video game so you get points for doing the right thing and you lose points for doing the wrong thing ai's job is to figure out how to maximize the reward it gets by picking the right actions so States actions and rewards this little sequence we have covered describes how a single frame or time step goes for our AI in the next time step the environment will change and a new state is reduced leading to a new action and reward this process will continue on forever Gathering estate action and reward for each time step eventually we may reach a terminal State this ends the cycle maybe the AI ran out of time or succeeded or decided to go work at Joe's Pizza the point is the loop is broken the total reward we have received is final and the AI doesn't get to choose any more actions this closed off sequence of States actions and rewards forms an episode by arranging the data like this we can now select any time step and see its future the reason we do this is so that we can analyze the outcome of our actions more directly if we isolate a specific time step it's tempting to judge actions based on the immediate reward but if we look into the future we can see that the reward we got was taken away by a huge punishment to solve this we can add every future reward together creating a new number called the return this improves the Judgment of our actions significantly but there are still some problems we aren't really considering the value of time yes we are looking into the future but how far should we look is getting twenty dollars now the same as getting 20 in 10 years unless you're really patient or broke probably not time has value to us and there's a good reason for that the future will always have some sort of uncertainty to it this means that any reward received in the future should be worthless since it might not happen the uncertainty also increases with time so our future reward should scale down with how much time has passed let's introduce a new term the discount Factor this is a value between 0 and 1 which represents how fast the rewards will Decay over each time step typically this is set somewhere between 0.9 and 0.99 to use this number we multiply together with the first future time step since the discount factor is less than one this causes the reward to shrink slightly the time step after that is Multiplied with the discount Factor twice making it even smaller then we do it three times for this one and four times for this one and so on finally we can take our new shrunken rewards and add them together this creates a new value called the discounted return note how this differs from the regular return the discounted return actually takes time into account giving us a much more comprehensive view of how good our actions are speaking of which how exact clear these actions being chosen this is where the AI comes in [Music] or more specifically a neural network neural networks are like artificial brains and their main job is to take input process it and make a decision they accomplish this by linking artificial neurons together allowing information to flow from one side to the other neural networks can also learn which is accomplished by strengthening or weakening connections with various neurons changing our information flows through it we'll come back to the learning later all we're really concerned with at the moment is how to plug this into the environment our Network can accept one number for each neuron in the first layer and since our input will be 20 numbers we will need 20 neurons the 20 numbers we feed in will be the state as I mentioned before this will contain essential information about the environment things like the position of Spider-Man or the direction of his limbs or the due date on his next assignment as for the output whenever the network spits out will become the action the AI will have full control over both its arms so for each arm we will need a your and a pitch angle which gives them a 3D direction to point but you'll also need one additional input for each arm to control the web shooters all tallied up that's six numbers so we'll need six neurons in the last layer yes there isn't any leg control which is unfortunate but we need this to be as simple as possible so the AI can learn properly in addition to input and output neurons the network will also need some hidden neurons hidden neurons go in the middle of the network and they're responsible for giving the AI its intelligence without them the neural network may still learn but it will be very limited in its capabilities for the Spider-Man AI 512 heater neuron should be enough this may sound like a lot but that's actually about 10 of the neurons inside a jellyfish so in reality this may be the dumbest Spider-Man that's ever existed nevertheless it's still a very capable learner and we're going to take advantage of that now all of the things we've talked about so far describe the bass for a family of algorithms called Deep reinforcement learning within this family there is a variety of different methods we can use to get the AI to learn so we'll have to choose one to get the AI learning for Spider-Man we will use PPO which stands for proximal policy optimization here's a rundown of how that algorithm Works our AI which we will now call the actor Will interact with the environment generating heater time steps and episodes while the actor is running Randomness will be injected into his actions so the actor still gets to choose but the randomness ensures that we explore all of our options finally we will have another AI known as the critic which judges how good it thinks our actions are the critic's job is to try and estimate the discounted returns at each time step so not only does the critic have to figure out how the environment works it also has to figure out how the actor is going to react to it which is also an AI oh yeah and it's got the same brain power as the actor so if you ever thought your job was hard try being this guy once we have enough episodes and time steps the actor and critic will look at all the data and learn from it due to the randomness we added there will be a variety of different outcomes in each episode some of the episodes will be good and and some of them will be bad much like Game of Thrones the critic is constantly trying to guess a value that we will find out in the future so we can train it by calculating that value Vader and comparing it to the critics prediction which will tell us how wrong the critic is we can then feed all these error values back into the critic and through a process called back propagation it'll become slightly less wrong next time as for the actor we can train it by maximizing the critics estimate in layman's terms we're an athlete and we want to make our coach as happy as possible by consistently doing what they want us to do the critic is constantly trying to predict the discounted return the actor will get but it can't predict the randomness we add to the actions so what it ends up settling on is the average of all the possibilities or the value with the smallest error compared to the randomized outcomes this slightly changes what it calculates from the discounted return to something called the value function which is defined as the discounted return if the AI behaves normally or without any Randomness we can use the value function as a baseline for performance it essentially describes how well the actor should perform from a given state so if the actor manages to surpass it then the randomness in its actions must have made them better this also applies the other way if the actor's performance doesn't reach the Baseline then the randomness in its actions must have made them worse we can quantify this idea by subtracting the baseline from the discounted return we calculate this creates a new term called the advantage function we can use the advantage function as a guideline on how to improve the actor's actions if the advantage function is positive then we want to encourage the actions associated with it and if it's negative then we want to discourage it for all the time steps we've gathered we can calculate the advantage function and then multiply with this fraction giving us the error values to feed into the actor this fraction contains the probability the actor selects the action which is possible because of the added Randomness this part is really confusing and quite frankly a little too complicated for this video if you're still interested I've included some links in the description so we feed all these values into the actor and by using back propagation again we can make the actor slightly better at picking the right actions so our actor and critic are now slightly better now what well we throw out all the episodes we gathered and do the whole thing again in this new training session the improved actor and critic will produce better episodes which will lead to new experiences to learn from we can keep repeating this process over and over again improving the actor in critical little each time eventually the actor and critic will reach a point where they excel at their tasks at this point the training is complete we can now throw away our critic and our fully trained actor will now dominate the tasks we gave it there's only one flaw with this process unfortunately neural networks are prone to forgetting the things they've learned which could lead to a disaster similar to Peter Parker's performance issues in Spider-Man 2. to fix this issue PPO adds an extra step to the learning of the actor whenever the actor does a training session we check how different Its Behavior has become from its old self if the current AI is strayed too far away from its old self then we clamp it effectively preventing it from changing any more in that direction this simple step helps the actor retain knowledge you gather from previous training sessions which prevents performance drops and allows the new trainee sessions to be more effective and for PPO that's actually all there is to it now let's see what this AI can do maybe we'll just stick to swinging the AI spent a total of 11 hours training eventually being able to average about 1.2 kilometers per swing at the beginning the AI makes no progress at all webs has spun randomly and there's no Rhyme or Reason to its actions it just keeps hitting the ground over and over again after a few training sessions the AI learns the spinning webs forward is a great way to start unfortunately there is rarely any follow-up but it's still better than nothing about half an hour later it started to learn the shooting more than two webs allows it to go a bit further it started making decent progress but it was still very sloppy one of the keys to swinging like Spider-Man is consistency one bad web can easily nullify several good ones we can also see the emergence of what I like to call inchworm Desperado occasionally the AI gets stuck on the wall and decides to sacrifice speed for survival leading to this hilariously slow movement the AIS progressed after that began to slow most of the learning that took place after this was mainly for the sake of consistency the remaining training hours were less about exploring and more about fine-tuning this AI was one of my personal favorites it now opts for a backflip at the start for extra Flair and really loves this right wall for some reason I mean it really loved this wall we're now six hours into training and we can really see the experience of the AI starting to shine through it now utilizes both of the walls to make the swings more consistent as well as being a lot faster there is a newfound sense of calmness to this AI overall it just seems more confident with its actions thank you and here is the final AI this one is mastered swinging it now swings down the middle at very high speeds occasionally topping 150 kilometers an hour it's so good in fact that it doesn't even need to look where it's going or use its left hand it has truly fulfilled its Destiny now due to some criticism I've got none another video I think I just let the AI swing for a little longer so if you're still here enjoy thank you true no no no no oh sure true true

Info

Channel: b2studios

Views: 3,109,827

Rating: undefined out of 5

Keywords: spiderman, spiderman ai, piderman ai, spiderman lerrns to swing, spiderman learns to swing, spiderman swinging, swinging ai, ai rope swinging, spiderman swinging ai, ppo, spiderman ppo, b2studios ppo, spiderman ai ppo, monkey swinging, monkey swinging ai, reinforcement learning

Id: Y48Vk77MoYg

Channel Id: undefined

Length: 15min 29sec (929 seconds)

Published: Wed Feb 01 2023