Category: rl
-
Sutton RL Chapter 6:Temporal-Difference Learning
TLDR: TD learning updates from partial experience by bootstrapping current value estimates, combining Monte Carlo sampling with dynamic-programming-style updates.
-
Sutton RL Chapter 5:Monte Carlo Methods
TLDR: Monte Carlo methods learn value from complete sampled episodes, trading model-free simplicity for delayed updates and return variance.
-
Sutton RL Day 2:Multi-Armed Bandits
TLDR: Multi-armed bandits isolate the exploration/exploitation problem by removing state transitions and making action-value estimation the center.
-
Sutton RL Day 3:Dynamic Programming
TLDR: Dynamic programming turns known MDP dynamics into iterative policy evaluation and improvement through Bellman updates.
-
Sutton RL Day 1:RL Problem 与 MDP 基础
TLDR: RL is interaction for long-term reward: policy chooses actions, reward gives feedback, value estimates future return, and Bellman equations connect the pieces.