Category: rl

Sutton RL Chapter 6：Temporal-Difference Learning
TLDR: TD learning updates from partial experience by bootstrapping current value estimates, combining Monte Carlo sampling with dynamic-programming-style updates.
10 min read · May 30, 2026
2026 · sutton-rl · learning · rl
Sutton RL Chapter 5：Monte Carlo Methods
TLDR: Monte Carlo methods learn value from complete sampled episodes, trading model-free simplicity for delayed updates and return variance.
7 min read · May 29, 2026
2026 · sutton-rl · learning · rl
Sutton RL Day 2：Multi-Armed Bandits
TLDR: Multi-armed bandits isolate the exploration/exploitation problem by removing state transitions and making action-value estimation the center.
6 min read · May 28, 2026
2026 · sutton-rl · learning · rl
Sutton RL Day 3：Dynamic Programming
TLDR: Dynamic programming turns known MDP dynamics into iterative policy evaluation and improvement through Bellman updates.
7 min read · May 28, 2026
2026 · sutton-rl · learning · rl
Sutton RL Day 1：RL Problem 与 MDP 基础
TLDR: RL is interaction for long-term reward: policy chooses actions, reward gives feedback, value estimates future return, and Bellman equations connect the pieces.
7 min read · May 27, 2026
2026 · sutton-rl · learning · rl