Category: systems
-
Crafting Interpreters: Chapter 4 - Scanning
Scanning is the first structural boundary in an interpreter: raw characters become tokens, so the parser can work with language units instead of individual bytes.
-
CS336: Lecture 3 - LM Architecture and Hyperparameters
TLDR: LM architecture is a stack of trade-offs across normalization, activations, attention, positional encoding, hyperparameters, stability, and inference cost.
-
CS336: Lecture 4 - Mixture of Experts
TLDR: MoE scales parameter count through sparse expert routing, but the real work is balancing tokens, capacity, communication cost, and specialization.
-
Compression Is All You Need: measuring mathematical progress
TLDR: A mathematical abstraction is valuable when it compresses downstream work: proofs become shorter, repeated patterns disappear, and the library becomes easier to extend.
-
Heuristic Learning: maintaining a learning system in code
TLDR: Heuristic Learning treats iterative agent work as maintaining a verifiable software system. Feedback updates code, tests, rules, state representations, and memory rather than neural network weights.
-
CS336: Lecture 1 - Language Modeling as Engineering
TLDR: Modern LM work is easiest to understand by building the stack yourself, because tokenization, data, compute, and evaluation are all leaky engineering choices.
-
CS336: Lecture 2 - PyTorch and resource accounting
Lecture 2 is about making training cost concrete: tensors, dtypes, memory, FLOPs, autograd, optimizers, data loading, checkpoints, and mixed precision all have resource prices.
-
AMP: automatic mixed precision as a dispatch policy
TLDR: AMP is not "turn the model into half precision." It is a runtime policy that runs safe, high-throughput ops in lower precision while protecting numerically sensitive paths.