I recently completed the Coursera Reinforcement Learning Specialization. These are my notes, still under construction, on some of what I learned. The course was based on Sutton and Barto’s freely available reinforcement learning book, so images will be from there unless otherwise stated. All errors are mine, so please let me know about any in the comments.
Agents and Environments
What differentiates reinforcement learning from other forms of learning is the setting of an agent acting in an environment. The agent generates actions from a policy. In response, the environment occasionally delivers rewards (which can be negative i.e. punishment). The agent searches for a policy that maximizes the total reward it receives now and into the future. Since a reward now is typically better than the same reward tomorrow, the quantity of interest is the discounted cumulative reward, called the return $$G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} \dots$$ The discount $\gamma$ adjusts the time-horizon of the agent. A myopic agent that only cares about immediate reward has a discount of 0. A far-sighted agent accumulates all future rewards by setting its discount to 1. Because the environment and the agent are not necessarily deterministic, we care about the expected return. The task of the reinforcement learning agent is then to find the policy with the highest expected return:
$$ \max_\pi\, \mathbb E \, (R_t + \gamma R_{t+1} + \gamma^2 R_{t+2}\dots).$$
How do we do this?
Markov Decision Processes
To determine how to find an optimal policy, we first need to flesh out a model of the environment and the agent’s interactions with it.
The first notion we need is that of state. This is meant to capture the properties of the agent and the environment relevant to the task. We will write the state at some $t$ as $s_t$, or $S_t$.
The next notion we need is that of action. This describes what the agent did. It can be discrete, like moves in a chess game, or continuous, like steering a car. We will write the agent’s action at time $t$ as $a_t$ or $A_t$.
After the agent acts at time $t$ the clock ticks to $t+1$ and two responses are generated by the environment. First, it delivers a reward (possibly 0) which we write as $r_{t+1}$, or $R_{t+1}$.
Second, the state transitions to $S_{t+1}$, and the process repeats. This produces a history of interactions $(S_t, A_t, R_{t+1}), (S_{t+1}, A_{t+1}, R_{t+2}) \dots$
Rewards and transitions are often uncertain. Therefore, we can model the environment as a probability distribution on rewards and next-states. This can depend on the current state, the action taken, and in principle, all previous interactions, $$ p(R_{t+1}, S_{t+1} | S_t, A_t, (S_{t-1}, A_{t-1}, R_{t}), (S_{t-2}, A_{t-2}, R_{t-1}) \dots ).$$
Carrying around an ever-growing history of interactions quickly becomes impractical. Therefore, we make the key simplifying Markov assumption that the reward received and the state transition only depend on the current state and action. That is, we model the environment as the probability distribution
$$ p(R_{t+1}, S_{t+1} | S_t, A_t).$$
This framework of sequential decision-making within a Markovian environment is called a Markov Decision Process, or MDP:

Choosing actions: Policy
The policy describes the action the agent takes. In principle, this could depend on the entire history of interactions with environment. But we make the simplifying assumption that the action the agent takes only depends on the current state. The policy then becomes a probability distribution over actions, given the current state. We typically write this as $$ \pi(A_t | S_t).$$ Although this is a regular probability distribution, we use $\pi$ instead of $p$ because (a) there are a already so many $p$’s floating around, and (b) to distinguish this particular probability distribution, because it’s the one that we’re going to be optimizing.
How good is a state? The state value function
We’re interested in maximizing our expected return. For any policy, we can rank states by the expected return when starting at that state and acting under that policy. We encode this information in the state value function,
$$ V_\pi(s) = \mathbb E_\pi (R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s).$$ The expectation is with respect to the policy, hence the index by $\pi$, and also the environment.
The value of a state is then the expected return when starting from that state and pursuing the given policy.
How good is an action? The state-action value function
We don’t just want to know how good each state is, but how to act in whatever state we’re currently in. How do we end up in states with high expected return?
To determine that, let’s relate the state value function to actions we can take: \begin{align} V_\pi(s) &=\mathbb E_\pi (R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s)\\ &= \sum_a \pi(a|s) \sum_{s’, R_{t+1}} p(s’, R_{t+1} | a, s) \left[R_{t+1} + \gamma \mathbb E_\pi (R_{t+2} + \gamma R_{t+3}, \dots | S_{t+1} = s’)\right]\\&= \sum_a \pi(a|s) \mathbb E_\pi(R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s, A_t = a) \end{align}
This says that the value of the current state is the expected return after taking an action, averaged over actions according to the policy. We encode the expected return after taking a given action in a given state into state-action value function. It gives us the expected return starting from any state action pair,
$$ Q_\pi(s, a) = \mathbb E_\pi (R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s, A_t = a).$$
This is the famous Q-function of Q-learning and Deep Q Networks. Why they call it Q, I don’t know!
The two value functions are related through the policy:
$$ V_\pi(s) = \sum_a \pi(a|s) Q_\pi(s, a).$$
Dynamic Programming
If we had the state-action value function, we would have a simple recipe for how to act: pick the action that gives the highest return!
$$ \pi(a|S_t) = \argmax_a \, Q(S_t, a).$$
But how do we actually compute $Q$? Much of reinforcement learning is about addressing this question.
If we have access to the model of the environment, then we can do so, at least in principle by relating the value of the current state to that of the next state.
Let’s write the state value function again \begin{align} V_\pi(s) &=\mathbb E_\pi (R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s)\\ &= \sum_{a, s’, R_{t+1}} \pi(a|s) p(s’, R_{t+1} | a, s) \left[R_{t+1} + \gamma \mathbb E_\pi (R_{t+2} + \gamma R_{t+3}, \dots | S_{t+1} = s’)\right]\end{align}
Notice that the last term is the state value function for the next state, \begin{align} V_\pi(s) &= \sum_a \pi(a|s) \sum_{s’, r} p(s’, r | a, s) \left[r + \gamma \color{red} V_\pi(s’) \color{black} \right]\end{align}
This is the Bellman equation for the state value function.
There is also a similar Bellman equation for the state-action value function. $$ \begin{align} Q_\pi(s,a) &= \sum_a \pi(a|s) \sum_{s’, R_{t+1}} p(s’, R_{t+1} |
a, s)\left[R_{t+1} + \gamma V_\pi(s’)\right]\\&=\sum_a \pi(a|s) \sum_{s’, R_{t+1}} p(s’, R_{t+1} | a, s)\left[R_{t+1} + \gamma \sum_{a’} \pi(a’|s’) \color{red} Q(s’, a’) \color{black}\right]\end{align}.$$
The Bellman equations show how determining the value of the current state/action depends on determining the value of subsequent states/actions. This is a hallmark of dynamic programming, in which larger problems are solved by combining solutions of subproblems of the same type.
Before we see how dynamic programming can be used to determine the value functions, notice that the Bellman equations are linear in the quantities we’re after. So we could solve them using standard techniques for linear systems.
However, dynamic programming provides alternative methods which are efficient, and will be extendable to situations where we don’t have a model of the environment.
These work by simply looping through states and applying the Bellman equations above as updates. For example iterative policy evaluation sets the state value functions by sweeping through states:

and continuing until the values are stable.
Once we have the values for each state, we can improve our policy by taking the greedy action in each state. This is called policy improvement:

Policy iteration chains these two steps together
- Policy evaluation to estimate $V_\pi$ for the current policy;
- Policy improvement to improve the policy,
until policy improvement no longer changes the policy. At that point, our value function is are already based on picking the best action at each step, so we’ve found the optimal policy and value function.

In policy iteration we start with value functions based on some suboptimal policy, and gradually update our value functions towards their optimal values. What if we could start with the optimal value function right off the bat?
To do so, we note that the optimal value function must correspond to picking the optimal action at each state: $$ V^*(s) = \max_a \sum_{s’,r} p(s’, r|s, a) (r + \gamma V^*(s’)).$$ This is the Bellman optimality equation for the state value function.
In value iteration, we apply the Bellman optimality equation as an update until our state value functions have converged to their optimal values. We then set our policy as greedy one at each state.
Sampling, or what to do without a model
Dynamic programming has relied on using a model of the environment to determine optimal policies and value functions. As such it is an example of a model-based approach. What if we don’t have a model?
Model-free approaches use interactions with the environment itself to approximate it. They do so by noting that observed state transitions and rewards are samples from the probability distribution that describes the environment.
$$ (S_{t+1}, R_{t+1}) \sim p(S_{t+1} ,R_{t+1}| S_t, A_t).$$ By replacing the probabilities in dynamic programming with samples we arrive at the most common reinforcement learning algorithms.
TD Learning
Temporal difference (TD) learning is a model-free algorithm for estimating the value function. To derive it, we start with the Bellman equation for the state value,
$$ V(S_t) = \sum_a \pi(a|s) \sum_{s’, r} p(s’,r|S_t, A_t)(r + \gamma V(s’)).$$
We then replace sum over actions with the action we actually took, and the average over next states and rewards with what we actually observed,
$$ \pi(a|s) \approx \delta(a – A_t), \quad p(S_{t+1}, R_{t+1}|S_t, A_t) \approx \delta(s – S_{t+1}) \delta(r – R_{t+1}),$$ and get
$$ V(S_t) \approx R_{t+1} + \gamma V(S_{t+1}),$$
TD then simply update the value function towards this value, damped by a learning rate $\alpha$
$$ \Delta V(S_t) = \alpha (\underbrace{R_{t+1} + \gamma V(S_{t+1}) – V(S_t)}_{\text{TD error}}).$$
SARSA
The same approach can be applied to update Q-values. We again start with the Bellman equation for state-action values, \begin{align} Q(S_t, A_t) &= \sum_{s’, r} p(s’,r|S_t, A_t)(r + \gamma \sum_{a’} \pi(a’|s’)Q(s’, a’))\\ &\approx (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})).\end{align}
Just like TD learning, SARSA then performs a damped update of the Q values towards their targets: $$ \Delta Q(S_t, A_t) = \alpha (R_t + \gamma Q(S_{t+1}, A_{t+1}) – Q(S_t, A_t)).$$
Expected SARSA
In Expected SARSA we agains start with the Bellman equation
\begin{align} Q(S_t, A_t) &= \sum_{s’, r} p(s’,r|S_t, A_t)(r + \gamma \sum_{a’} \pi(a’|s’)Q(s’, a’))\end{align} and note that the last term is an expectation over actions in the next state. So we modify our target to use this expectation over actions, given whatever state we arrived at
$$ \Delta Q(S_t, A_t) = \alpha (R_t + \gamma \sum_{a’} \pi(a’|S_{t+1}) Q(S_{t+1}, a’) – Q(S_t, A_t)).$$ This improves performance by reducing the variability caused by random action selection, though it comes at higher computational cost.
Q-Learning
Finally, in Q-learning, we use the optimal Q-values as our target. The Bellman optimality equation for state-action values is
$$ Q^*(S_t, A_t) = \sum_{s’, r} p(s’, r | S_t, A_t)(r + \gamma \max_{A_{t+1}} Q^*(S_{t+1}, A_{t+1})),$$ so we modify our target accordingly
$$ \Delta Q(S_t, A_t) = \alpha (R_t + \gamma \max_{a’} Q(S_{t+1},a’) – Q(S_t, A_t)).$$
Leave a Reply