Foundations of Reinforcement Learning

Video Statistics and Information

Video

Captions Word Cloud

Captions

Good day everyone. This is Dr. Soper here, and today I’ll be discussing the foundations of reinforcement learning, which is an important area within the broader domain of artificial intelligence. Before we begin discussing the foundations of reinforcement learning, let’s briefly review what you’ll learn in this lesson. By the time you have finished this video, you will know: What reinforcement learning is; and The five principles that underlie reinforcement learning-based artificial intelligence, including: The input and output system; Rewards; The environment; Markov decision processes; and Training & inference. Once we understand all of these concepts, we’ll be fully equipped to start building some real AI models. So, why wait? Let’s get started! To begin, let’s learn what is really meant by “reinforcement learning”. Along with supervised learning and unsupervised learning, reinforcement learning is one of the three primary paradigms of machine learning. In supervised learning, a machine uses matched pairs of input and output values in order to try to learn a general function for predicting the outputs based on the inputs, and in unsupervised learning, a machine attempts to discover previously unknown patterns in a data set without the benefit of any prior knowledge about the data. By contrast, the goal of reinforcement learning is to train a machine to understand its environment in such a way that it can learn to take actions that will maximize its cumulative rewards. This is accomplished by trying to find an optimal balance between continuing to explore the environment on the one hand, and exploiting what we have already learned about the environment on the other hand. As noted previously, all reinforcement learning systems are based on five principles: An input and output system, the notions of rewards and cumulative rewards, an environment in which the system operates, Markov decision processes, and training and inference modes. Let’s learn about each of these principles in turn. The first principle that we’ll learn about is the input and output system. It’s important to understand from the outset that the input and output system is not just for reinforcement learning. Indeed, all artificial intelligence and cognitive computing systems are based on the notion of converting inputs into outputs. Put differently, all AI and cognitive computing systems take input data and use those data to generate output, and reinforcement learning is no exception. This notion of converting inputs to outputs is a very familiar part of the human experience. For example if you are driving a car or riding a bicycle, your senses are constantly gathering input about what’s happening around you. Your mind uses this information to continuously generate a series of outputs, which in this example might be specific movements of your arms, legs, hands, or feet, all of which are oriented toward achieving a larger goal, such as arriving at your destination safely. While the input and output system is familiar and simple to understand, we need to know the appropriate vocabulary for these concepts in the context of reinforcement learning. Specifically, in reinforcement learning the inputs are “states”, which you can think of as referring to the state of the environment, and “rewards”, which we will discuss in a few moments. The outputs in reinforcement learning are called “actions”. You can think of actions as answers to the question, “what should I do next?” The goal of a reinforcement learning model, then, is to identify an optimal policy that tells us which action to take in any given state. Next, let’s talk about rewards. As with the input and output system, rewards are a part of every AI or cognitive computing system. More specifically, every cognitive computing or artificial intelligence system is trained using some sort of reward function. So, what is a reward? Well, in the context of AI and cognitive computing, a reward is simply a metric that tells the system how well it is performing. Many different types of reward functions are possible, but two of the most common reward structures are based on maximizing gains or minimizing losses. The “best” reward structure to use is highly context dependent. Put differently, there is no universally ideal reward; instead, the best reward function to use depends on the nature of the problem that the system is trying to solve. For instance, consider this example of a reinforcement learning system figuring out how to play the classic Atari game “Breakout”. Several different reward structures are possible here, one of which might be “earn as many points as possible”, while another might be “stay alive as long as possible”. Note that the term “reward” can be applied to both immediate (short-term) rewards and cumulative (long-term) rewards. It is often the case that an AI system must learn to forgo immediate rewards – or even accept immediate losses – in order to maximize its cumulative rewards. Now that we know about rewards, we can think about the general goal of a reinforcement learning system a bit differently. That is, the general goal of such systems is to maximize their total accumulated rewards over time. The third element of reinforcement learning systems is the environment. You can think of the environment as the setting or surroundings in which the system is operating. The environment is the system’s source of information about states and rewards. It’s also important to understand that the environment defines the “rules of the game”. What I mean by this is that the environment determines what is possible or not possible at any point in time. As such, the environment defines the set of possible actions that are available to the reinforcement learning system in any given state. The environment also defines how the state changes from one point in time to the next as a result of the system’s actions, as well as the rewards that will accrue to the reinforcement learning system for each possible action that it might take. When the reinforcement learning system begins exploring its environment, it is entirely unaware of these “rules of the game” – that is, it has no knowledge about what the consequences of each possible action might be. Instead, it must experiment by taking different actions and observing the results. If all goes well, the system will get better and better at deciding what to do next as it gains more and more experience. The fourth element of reinforcement learning systems is the Markov Decision Process (or MDP), which is named in honor of the Russian mathematician Andrey Andreyevich Markov, who did pioneering work on stochastic processes. Markov Decision Processes provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. In the case of reinforcement learning, the “decision maker” is the AI system that is operating in the environment. Markov Decision Processes involve discrete units of time. For example, time zero, time one, time two, etc. In these processes, “time” can be many different things. For example, a unit of time might refer to one second, one turn in a game, one frame in a video, and so on. In light of our previous discussion, we can thus describe a consistent pattern that reinforcement learning models use to transition from one state to the next. First, the system will observe the current state at time “t”, which we can refer to as “S” sub “t”. Next, the system takes an action “a” at time “t”, which we can refer to as “a” sub “t”. As a result of its action at time “t”, the system receives a reward “r”, which we can refer to as “r” sub “t”. Finally, the system enters the next state, which we can refer to as “S” sub “t + 1”. The reward received and the next state thus become inputs for the next iteration of the loop, wherein the system must decide which action to take next. Graphically, our AI agent begins with the environment in some initial state. The AI agent then takes an action, the result of which is that the agent receives a reward and enters the next state. The cycle then repeats. This is the Markov decision process. The last principle of reinforcement learning systems that we need to discuss is the difference between “training mode” and “inference mode”. What you need to know here is that each reinforcement learning system goes through two phases in its life cycle: training mode and inference mode. In training mode, the system is learning and is attempting to identify an optimal policy to guide its choices and actions. During this time, the system attempts to accomplish its goal over and over again across many training cycles – sometimes hundreds of thousands or even millions of training cycles. After each iteration, the system updates its policy with what it most recently learned in order to try to do better during its next attempt. In inference mode, the system has been fully trained, and is deployed to perform its task. When in this mode, the policy that guides the system’s choices and actions is no longer updated. Instead, it simply uses the policy that it has learned in order to make decisions about what to do, given the current state of the environment in which the system has been deployed. Now that we have a good understanding of the principles of reinforcement learning, we can begin to create some real reinforcement learning models. In the next two videos in this series, we’ll build reinforcement learning models that rely on Thompson Sampling to solve real problems. Our first model will demonstrate how reinforcement learning-based AI can be used to solve an exploration-exploitation dilemma in the classic multi-armed bandit problem. Our second model will demonstrate how this type of artificial intelligence can be used in conjunction with simulations to help a company achieve optimal results in a complex advertising campaign. We will definitely be getting a lot of practical, hands-on experience in creating AI models in Python in these next two videos, so I hope you’ll join me as we continue our adventures in cognitive computing and artificial intelligence! Well my friends, thus ends our lesson on the foundations of reinforcement learning. I hope that you learned something interesting in this lesson, and until next time, have a great day.

Info

Channel: Dr. Daniel Soper

Views: 5,192

Rating: undefined out of 5

Keywords: reinforcement learning, artificial intelligence, AI, input and output system, rewards, AI environment, Markov decision process, training vs. inference mode

Id: wVXXLLT6srY

Channel Id: undefined

Length: 11min 22sec (682 seconds)

Published: Tue Apr 07 2020