Sutton RL: Day 3 - Dynamic Programming
Source: Sutton & Barto, Reinforcement Learning: An Introduction, Chapter 4.
Use this note as a review sheet. The goal is not to memorize every equation, but to see how Bellman backups turn a known MDP model into computation.
One-sentence model
Dynamic programming is what RL looks like when the full environment model is known:
known p(s', r | s, a)
-> Bellman backup
-> value function
-> improved policy
The later model-free methods in Sutton and Barto can be read as approximations of this picture when the model is missing or too expensive to use directly.
Assumption
DP assumes access to the transition and reward dynamics:
That is the probability of landing in state and receiving reward after taking action in state .
| Question | DP answer |
|---|---|
| Does it need a model? | Yes, the full MDP dynamics |
| Does it learn from sampled experience? | Not directly |
| Does it bootstrap? | Yes |
| What kind of backup? | Full backup over all next states and rewards |
| When is it practical? | Planning, small MDPs, known simulators |
| Main limitation | Large state/action spaces become expensive |
Value functions
State value under policy :
Action value under policy :
The useful intuition:
reward = immediate signal
value = long-term desirability
Bellman equations
For a fixed policy , the Bellman expectation equation is:
It says: value is the expected immediate reward plus discounted next value.
For the optimal value function:
The difference is the operator:
Bellman expectation: average over the policy
Bellman optimality: maximize over actions
Algorithm 1: policy evaluation
Given a policy , estimate its value function.
Update rule:
Pseudocode:
initialize V(s) arbitrarily
repeat:
delta <- 0
for each state s:
old <- V(s)
V(s) <- sum_a pi(a|s) sum_{s',r} p(s',r|s,a) [r + gamma V(s')]
delta <- max(delta, |old - V(s)|)
until delta < theta
Policy evaluation answers: “How good is this policy?”
Algorithm 2: policy improvement
Once we have , improve the policy by one-step lookahead:
Then choose greedily:
The policy improvement theorem says that if the new action is at least as good as the old policy in every state, then the new policy is at least as good overall:
Policy improvement answers: “How should this policy change?”
Algorithm 3: policy iteration
Policy iteration alternates the two steps:
evaluate pi_k -> get v_{pi_k}
improve greedily -> get pi_{k+1}
repeat until policy stable
Pseudocode:
initialize V(s) and pi(s)
loop:
# policy evaluation
repeat:
for each state s:
V(s) <- sum_{s',r} p(s',r|s,pi(s)) [r + gamma V(s')]
until value change is small
# policy improvement
stable <- true
for each state s:
old_action <- pi(s)
pi(s) <- argmax_a sum_{s',r} p(s',r|s,a) [r + gamma V(s')]
if old_action != pi(s):
stable <- false
if stable:
return V, pi
Policy iteration is conservative: evaluate seriously, then improve.
Algorithm 4: value iteration
Value iteration skips full policy evaluation and directly applies the Bellman optimality backup:
Pseudocode:
initialize V(s) arbitrarily
repeat:
delta <- 0
for each state s:
old <- V(s)
V(s) <- max_a sum_{s',r} p(s',r|s,a) [r + gamma V(s')]
delta <- max(delta, |old - V(s)|)
until delta < theta
return pi(s) = argmax_a sum_{s',r} p(s',r|s,a) [r + gamma V(s')]
Value iteration is more direct: push values toward optimality, then extract a greedy policy.
Policy iteration vs value iteration
| Dimension | Policy iteration | Value iteration |
|---|---|---|
| Maintains | Policy and value | Mainly value |
| Inner loop | Evaluate a policy | Apply optimality backup |
| Uses full evaluation? | Yes, or approximately | No |
| Has max in every value update? | No | Yes |
| Policy update timing | After evaluation | At the end, or implicit in max |
| Intuition | Evaluate, then improve | Improve while evaluating |
Minimal example
Two non-terminal states, and , with :
| State | Action | Reward | Next |
|---|---|---|---|
| safe | 0 | terminal | |
| go | 0 | ||
| exit | 2 | terminal | |
| back | -1 |
The optimal policy is:
s1: go
s2: exit
because:
Value iteration reaches this quickly:
| Iteration | ||
|---|---|---|
| 0 | 0 | 0 |
| 1 | 0 | 2 |
| 2 | 1.8 | 2 |
Relation to ordinary dynamic programming
Classic algorithm problems often use recurrences like:
or:
RL dynamic programming uses the same recurrence idea, but with stochastic transitions and expected returns:
| Ordinary DP | MDP/RL DP |
|---|---|
| Often acyclic | Often cyclic |
| Can often fill a table in order | Usually iterates to convergence |
| Transitions may be deterministic | Transitions are probabilistic |
| Objective may be count or shortest path | Objective is expected return |
Generalized policy iteration
Generalized policy iteration is the big abstraction:
policy evaluation:
value moves toward v_pi
policy improvement:
policy moves toward greedy(V)
The two processes interact until both stabilize. At that point, the value function and policy satisfy the Bellman optimality equation.
This is the template behind much of the book:
DP = model + bootstrap
MC = experience + no bootstrap
TD = experience + bootstrap
Sarsa = on-policy TD control
Q-learning = off-policy TD control
Interview answers
Why are PI and VI called dynamic programming algorithms?
They solve a known MDP by repeatedly applying Bellman recurrences to value functions, just as ordinary DP solves a larger problem through smaller state values.
What is the difference between PI and VI?
Policy iteration alternates evaluation and improvement. Value iteration applies the Bellman optimality backup directly, combining value estimation and greedy improvement in each update.
Why study DP if it requires a full model?
Because it is the conceptual parent of later RL algorithms. MC, TD, Sarsa, Q-learning, and actor-critic methods can all be read as ways to approximate Bellman backup and generalized policy iteration under weaker information.
Final takeaway
Dynamic programming is the cleanest version of reinforcement learning:
value function -> Bellman equation -> Bellman backup -> improved policy
Once that loop is clear, the rest of Sutton and Barto becomes easier to organize.