Skip to main content

Sutton RL: Day 3 - Dynamic Programming

Source: Sutton & Barto, Reinforcement Learning: An Introduction, Chapter 4.

Use this note as a review sheet. The goal is not to memorize every equation, but to see how Bellman backups turn a known MDP model into computation.

One-sentence model

Dynamic programming is what RL looks like when the full environment model is known:

known p(s', r | s, a)
        -> Bellman backup
        -> value function
        -> improved policy

The later model-free methods in Sutton and Barto can be read as approximations of this picture when the model is missing or too expensive to use directly.

Assumption

DP assumes access to the transition and reward dynamics:

p(s,rs,a)p(s', r \mid s, a)

That is the probability of landing in state ss' and receiving reward rr after taking action aa in state ss.

QuestionDP answer
Does it need a model?Yes, the full MDP dynamics
Does it learn from sampled experience?Not directly
Does it bootstrap?Yes
What kind of backup?Full backup over all next states and rewards
When is it practical?Planning, small MDPs, known simulators
Main limitationLarge state/action spaces become expensive

Value functions

State value under policy π\pi:

vπ(s)=Eπ[GtSt=s]v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s]

Action value under policy π\pi:

qπ(s,a)=Eπ[GtSt=s,At=a]q_\pi(s,a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a]

The useful intuition:

reward = immediate signal
value  = long-term desirability

Bellman equations

For a fixed policy π\pi, the Bellman expectation equation is:

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]v_\pi(s) = \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma v_\pi(s')\right]

It says: value is the expected immediate reward plus discounted next value.

For the optimal value function:

v(s)=maxas,rp(s,rs,a)[r+γv(s)]v_*(s) = \max_a \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma v_*(s')\right]

The difference is the operator:

Bellman expectation: average over the policy
Bellman optimality:  maximize over actions

Algorithm 1: policy evaluation

Given a policy π\pi, estimate its value function.

Update rule:

V(s)aπ(as)s,rp(s,rs,a)[r+γV(s)]V(s) \leftarrow \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma V(s')\right]

Pseudocode:

initialize V(s) arbitrarily

repeat:
    delta <- 0
    for each state s:
        old <- V(s)
        V(s) <- sum_a pi(a|s) sum_{s',r} p(s',r|s,a) [r + gamma V(s')]
        delta <- max(delta, |old - V(s)|)
until delta < theta

Policy evaluation answers: “How good is this policy?”

Algorithm 2: policy improvement

Once we have vπv_\pi, improve the policy by one-step lookahead:

qπ(s,a)=s,rp(s,rs,a)[r+γvπ(s)]q_\pi(s,a) = \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma v_\pi(s')\right]

Then choose greedily:

π(s)=argmaxas,rp(s,rs,a)[r+γvπ(s)]\pi'(s) = \arg\max_a \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma v_\pi(s')\right]

The policy improvement theorem says that if the new action is at least as good as the old policy in every state, then the new policy is at least as good overall:

qπ(s,π(s))vπ(s)    vπ(s)vπ(s)q_\pi(s,\pi'(s)) \ge v_\pi(s) \implies v_{\pi'}(s) \ge v_\pi(s)

Policy improvement answers: “How should this policy change?”

Algorithm 3: policy iteration

Policy iteration alternates the two steps:

evaluate pi_k -> get v_{pi_k}
improve greedily -> get pi_{k+1}
repeat until policy stable

Pseudocode:

initialize V(s) and pi(s)

loop:
    # policy evaluation
    repeat:
        for each state s:
            V(s) <- sum_{s',r} p(s',r|s,pi(s)) [r + gamma V(s')]
    until value change is small

    # policy improvement
    stable <- true
    for each state s:
        old_action <- pi(s)
        pi(s) <- argmax_a sum_{s',r} p(s',r|s,a) [r + gamma V(s')]
        if old_action != pi(s):
            stable <- false

    if stable:
        return V, pi

Policy iteration is conservative: evaluate seriously, then improve.

Algorithm 4: value iteration

Value iteration skips full policy evaluation and directly applies the Bellman optimality backup:

V(s)maxas,rp(s,rs,a)[r+γV(s)]V(s) \leftarrow \max_a \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma V(s')\right]

Pseudocode:

initialize V(s) arbitrarily

repeat:
    delta <- 0
    for each state s:
        old <- V(s)
        V(s) <- max_a sum_{s',r} p(s',r|s,a) [r + gamma V(s')]
        delta <- max(delta, |old - V(s)|)
until delta < theta

return pi(s) = argmax_a sum_{s',r} p(s',r|s,a) [r + gamma V(s')]

Value iteration is more direct: push values toward optimality, then extract a greedy policy.

Policy iteration vs value iteration

DimensionPolicy iterationValue iteration
MaintainsPolicy and valueMainly value
Inner loopEvaluate a policyApply optimality backup
Uses full evaluation?Yes, or approximatelyNo
Has max in every value update?NoYes
Policy update timingAfter evaluationAt the end, or implicit in max
IntuitionEvaluate, then improveImprove while evaluating

Minimal example

Two non-terminal states, s1s_1 and s2s_2, with γ=0.9\gamma = 0.9:

StateActionRewardNext
s1s_1safe0terminal
s1s_1go0s2s_2
s2s_2exit2terminal
s2s_2back-1s1s_1

The optimal policy is:

s1: go
s2: exit

because:

0+0.9×2=1.8>00 + 0.9 \times 2 = 1.8 > 0

Value iteration reaches this quickly:

V(s1)=max{0,0.9V(s2)}V(s_1) = \max\{0, 0.9V(s_2)\} V(s2)=max{2,1+0.9V(s1)}V(s_2) = \max\{2, -1 + 0.9V(s_1)\}
IterationV(s1)V(s_1)V(s2)V(s_2)
000
102
21.82

Relation to ordinary dynamic programming

Classic algorithm problems often use recurrences like:

f(i)=f(i+1)+f(i+2)f(i) = f(i+1) + f(i+2)

or:

V(i)=minj[c(i,j)+V(j)]V(i) = \min_j \left[c(i,j) + V(j)\right]

RL dynamic programming uses the same recurrence idea, but with stochastic transitions and expected returns:

V(s)=maxaE[r+γV(s)]V(s) = \max_a \mathbb{E}\left[r + \gamma V(s')\right]
Ordinary DPMDP/RL DP
Often acyclicOften cyclic
Can often fill a table in orderUsually iterates to convergence
Transitions may be deterministicTransitions are probabilistic
Objective may be count or shortest pathObjective is expected return

Generalized policy iteration

Generalized policy iteration is the big abstraction:

policy evaluation:
    value moves toward v_pi

policy improvement:
    policy moves toward greedy(V)

The two processes interact until both stabilize. At that point, the value function and policy satisfy the Bellman optimality equation.

This is the template behind much of the book:

DP       = model + bootstrap
MC       = experience + no bootstrap
TD       = experience + bootstrap
Sarsa    = on-policy TD control
Q-learning = off-policy TD control

Interview answers

Why are PI and VI called dynamic programming algorithms?

They solve a known MDP by repeatedly applying Bellman recurrences to value functions, just as ordinary DP solves a larger problem through smaller state values.

What is the difference between PI and VI?

Policy iteration alternates evaluation and improvement. Value iteration applies the Bellman optimality backup directly, combining value estimation and greedy improvement in each update.

Why study DP if it requires a full model?

Because it is the conceptual parent of later RL algorithms. MC, TD, Sarsa, Q-learning, and actor-critic methods can all be read as ways to approximate Bellman backup and generalized policy iteration under weaker information.

Final takeaway

Dynamic programming is the cleanest version of reinforcement learning:

value function -> Bellman equation -> Bellman backup -> improved policy

Once that loop is clear, the rest of Sutton and Barto becomes easier to organize.