Heuristic Learning：用代码维护可验证的启发式系统

0. 一句话

Heuristic Learning 是把“和 coding agent 一起反复试错”的过程，看成一个持续维护启发式系统的学习过程：反馈来自测试、日志、环境和人类判断，更新对象不是神经网络参数，而是代码、状态表示、规则、评估器和记忆。

1. Heuristic Learning 是什么

After more iteration with Codex, I started calling this process Heuristic Learning (HL).

HL 和常见 Deep RL 一样，也有 state、action、feedback、update 的循环。区别在于：Deep RL 更新的是 neural-network parameters，而 HL 更新的是 software structure。

HL is built out of program code.
Its feedback is consumed by a coding agent, and can come from environment reward, test cases, logs, videos, replays, or human feedback.
Its updates do not use backpropagation. The coding agent directly edits policies, state detectors, tests, configuration, or memory.
HL is the learning and update process. The object maintained by HL over time can be called a Heuristic System (HS).

2. Heuristic System 的组成

An HS is more than an isolated policy.py. It contains at least a programmatic policy, state representation, feedback channels, experiment records, replays or tests, memory, and an update mechanism executed by a coding agent.

A single rule is not enough. Rules, feedback, history, and the next update path all need to connect before it becomes an HS.

3. 和 Deep RL 的区别

As a table:

Axis	Deep RL	HL
Policy	Neural network parameters	Code: rules, state machines, controllers, MPC, macro-actions
State	Usually explicit observations	Usually explicit variables, detectors, caches, and other readable representations
Action	Produced by a neural network forward pass	Produced by executing code logic
Feedback	Mainly fixed reward	Provided through coding-agent context: tests, environment feedback, logs, and replays all count
Update	Gradient-based updates to neural-network parameters in a Deep RL algorithm	Direct code edits by a coding agent
Memory	On-policy methods basically have none; off-policy methods have replay buffers	Can explicitly store trials, summaries, failure reasons, replays, and version diffs

4. 为什么值得维护

Heuristic Learning has several useful properties compared with Deep RL:

Explainability: neural networks are hard to explain, while HL policies can often be translated into plain language.
Sample Efficiency: one effective code update can jump directly to a new policy, rather than slowly climbing through learning-rate tuning.
Regression-testability: old capabilities can become tests, replays, or golden cases.
Overfitting can be constrained: code heuristics can still overfit to seeds, environment details, or test loopholes, but simplification, regression checks, and multi-seed evaluation provide an engineering form of regularization.
It can avoid part of catastrophic forgetting: old capabilities do not have to live only inside model weights; they can be written into rule sets and tests.

The point is that a class of heuristics that used to be too expensive to maintain may now be worth owning.

5. 最小维护循环

A healthy HS therefore needs at least two operations:

Absorb feedback: write new failures, logs, and rewards back into the system.
Compress history: fold local patches back into simpler, more maintainable representations.

That turns Continual Learning from “how do we update parameters?” into “how do we maintain a software system that keeps absorbing feedback?“

6. Takeaway

凡是可以验证的，都开始能被解决。