Good day everyone. This is Dr. Soper here, and today I’ll be
discussing the foundations of reinforcement learning, which is an important area within
the broader domain of artificial intelligence. Before we begin discussing the foundations
of reinforcement learning, let’s briefly review what you’ll learn in this lesson. By the time you have finished this video,
you will know: What reinforcement learning is; and
The five principles that underlie reinforcement learning-based artificial intelligence, including:
The input and output system; Rewards;
The environment; Markov decision processes; and
Training & inference. Once we understand all of these concepts,
we’ll be fully equipped to start building some real AI models. So, why wait? Let’s get started! To begin, let’s learn what is really meant
by “reinforcement learning”. Along with supervised learning and unsupervised
learning, reinforcement learning is one of the three primary paradigms of machine learning. In supervised learning, a machine uses matched
pairs of input and output values in order to try to learn a general function for predicting
the outputs based on the inputs, and in unsupervised learning, a machine attempts to discover previously
unknown patterns in a data set without the benefit of any prior knowledge about the data. By contrast, the goal of reinforcement learning
is to train a machine to understand its environment in such a way that it can learn to take actions
that will maximize its cumulative rewards. This is accomplished by trying to find an
optimal balance between continuing to explore the environment on the one hand, and exploiting
what we have already learned about the environment on the other hand. As noted previously, all reinforcement learning
systems are based on five principles: An input and output system, the notions of rewards
and cumulative rewards, an environment in which the system operates, Markov decision
processes, and training and inference modes. Let’s learn about each of these principles
in turn. The first principle that we’ll learn about
is the input and output system. It’s important to understand from the outset
that the input and output system is not just for reinforcement learning. Indeed, all artificial intelligence and cognitive
computing systems are based on the notion of converting inputs into outputs. Put differently, all AI and cognitive computing
systems take input data and use those data to generate output, and reinforcement learning
is no exception. This notion of converting inputs to outputs
is a very familiar part of the human experience. For example if you are driving a car or riding
a bicycle, your senses are constantly gathering input about what’s happening around you. Your mind uses this information to continuously
generate a series of outputs, which in this example might be specific movements of your
arms, legs, hands, or feet, all of which are oriented toward achieving a larger goal, such
as arriving at your destination safely. While the input and output system is familiar
and simple to understand, we need to know the appropriate vocabulary for these concepts
in the context of reinforcement learning. Specifically, in reinforcement learning the
inputs are “states”, which you can think of as referring to the state of the environment,
and “rewards”, which we will discuss in a few moments. The outputs in reinforcement learning are
called “actions”. You can think of actions as answers to the
question, “what should I do next?” The goal of a reinforcement learning model,
then, is to identify an optimal policy that tells us which action to take in any given
state. Next, let’s talk about rewards. As with the input and output system, rewards
are a part of every AI or cognitive computing system. More specifically, every cognitive computing
or artificial intelligence system is trained using some sort of reward function. So, what is a reward? Well, in the context of AI and cognitive computing,
a reward is simply a metric that tells the system how well it is performing. Many different types of reward functions are
possible, but two of the most common reward structures are based on maximizing gains or
minimizing losses. The “best” reward structure to use is
highly context dependent. Put differently, there is no universally ideal
reward; instead, the best reward function to use depends on the nature of the problem
that the system is trying to solve. For instance, consider this example of a reinforcement
learning system figuring out how to play the classic Atari game “Breakout”. Several different reward structures are possible
here, one of which might be “earn as many points as possible”, while another might
be “stay alive as long as possible”. Note that the term “reward” can be applied
to both immediate (short-term) rewards and cumulative (long-term) rewards. It is often the case that an AI system must
learn to forgo immediate rewards – or even accept immediate losses – in order to maximize
its cumulative rewards. Now that we know about rewards, we can think
about the general goal of a reinforcement learning system a bit differently. That is, the general goal of such systems
is to maximize their total accumulated rewards over time. The third element of reinforcement learning
systems is the environment. You can think of the environment as the setting
or surroundings in which the system is operating. The environment is the system’s source of
information about states and rewards. It’s also important to understand that the
environment defines the “rules of the game”. What I mean by this is that the environment
determines what is possible or not possible at any point in time. As such, the environment defines the set of
possible actions that are available to the reinforcement learning system in any given
state. The environment also defines how the state
changes from one point in time to the next as a result of the system’s actions, as
well as the rewards that will accrue to the reinforcement learning system for each possible
action that it might take. When the reinforcement learning system begins
exploring its environment, it is entirely unaware of these “rules of the game” – that
is, it has no knowledge about what the consequences of each possible action might be. Instead, it must experiment by taking different
actions and observing the results. If all goes well, the system will get better
and better at deciding what to do next as it gains more and more experience. The fourth element of reinforcement learning
systems is the Markov Decision Process (or MDP), which is named in honor of the Russian
mathematician Andrey Andreyevich Markov, who did pioneering work on stochastic processes. Markov Decision Processes provide a mathematical
framework for modeling decision making in situations where outcomes are partly random
and partly under the control of a decision maker. In the case of reinforcement learning, the
“decision maker” is the AI system that is operating in the environment. Markov Decision Processes involve discrete
units of time. For example, time zero, time one, time two,
etc. In these processes, “time” can be many
different things. For example, a unit of time might refer to
one second, one turn in a game, one frame in a video, and so on. In light of our previous discussion, we can
thus describe a consistent pattern that reinforcement learning models use to transition from one
state to the next. First, the system will observe the current
state at time “t”, which we can refer to as “S” sub “t”. Next, the system takes an action “a” at
time “t”, which we can refer to as “a” sub “t”. As a result of its action at time “t”,
the system receives a reward “r”, which we can refer to as “r” sub “t”. Finally, the system enters the next state,
which we can refer to as “S” sub “t + 1”. The reward received and the next state thus
become inputs for the next iteration of the loop, wherein the system must decide which
action to take next. Graphically, our AI agent begins with the
environment in some initial state. The AI agent then takes an action, the result
of which is that the agent receives a reward and enters the next state. The cycle then repeats. This is the Markov decision process. The last principle of reinforcement learning
systems that we need to discuss is the difference between “training mode” and “inference
mode”. What you need to know here is that each reinforcement
learning system goes through two phases in its life cycle: training mode and inference
mode. In training mode, the system is learning and
is attempting to identify an optimal policy to guide its choices and actions. During this time, the system attempts to accomplish
its goal over and over again across many training cycles – sometimes hundreds of thousands
or even millions of training cycles. After each iteration, the system updates its
policy with what it most recently learned in order to try to do better during its next
attempt. In inference mode, the system has been fully
trained, and is deployed to perform its task. When in this mode, the policy that guides
the system’s choices and actions is no longer updated. Instead, it simply uses the policy that it
has learned in order to make decisions about what to do, given the current state of the
environment in which the system has been deployed. Now that we have a good understanding of the
principles of reinforcement learning, we can begin to create some real reinforcement learning
models. In the next two videos in this series, we’ll
build reinforcement learning models that rely on Thompson Sampling to solve real problems. Our first model will demonstrate how reinforcement
learning-based AI can be used to solve an exploration-exploitation dilemma in the classic
multi-armed bandit problem. Our second model will demonstrate how this
type of artificial intelligence can be used in conjunction with simulations to help a
company achieve optimal results in a complex advertising campaign. We will definitely be getting a lot of practical,
hands-on experience in creating AI models in Python in these next two videos, so I hope
you’ll join me as we continue our adventures in cognitive computing and artificial intelligence! Well my friends, thus ends our lesson on the
foundations of reinforcement learning. I hope that you learned something interesting
in this lesson, and until next time, have a great day.