CS336 Lecture 1：Language Modeling as Engineering

Takeaways

This lecture frames language modeling as an engineering problem under resource constraints. The main message is: to understand modern LMs, you need to build the stack yourself, because high-level APIs hide leaky, still-evolving abstractions.

The big ideas:

Frontier models are industrial-scale, expensive, and mostly opaque, so small models will not reproduce everything about frontier behavior.
What transfers from small-scale work is mostly mechanics and mindset: how Transformers, training, hardware, data, and scaling interact.
The “bitter lesson” is not “scale is all that matters.” It is: algorithms that scale matter. A useful framing is accuracy = efficiency x resources.
Modern LM design is driven by efficiency: avoid wasting compute on bad data, overly long token sequences, inefficient architectures, bad hyperparameters, or poor hardware utilization.

Tokenization

The technical center of the lecture is tokenization: converting strings into integer token sequences and back.

Key tradeoff: tokenizers balance vocabulary size against sequence length.

Character tokenization round-trips cleanly, but Unicode has around 150K characters, many rare.
Byte tokenization has a tiny fixed vocabulary of 256, but produces long sequences, which is bad for Transformers because attention cost grows roughly quadratically with sequence length.
Word tokenization is intuitive, but vocabularies become huge and open-ended; unseen words require awkward UNK handling.
BPE is the practical compromise: start from bytes, repeatedly merge the most frequent adjacent token pairs, and learn a vocabulary from corpus statistics.

The lecture’s BPE intuition: common strings should become short token sequences; rare strings can remain decomposed into smaller pieces. Tokenization is described as a “necessary evil”: useful for today’s compute constraints, but maybe eventually replaced by scalable byte-level models.

Enjoy Reading This Article?