Posts tagged moe
-
CS336: Lecture 4 - Mixture of Experts
TLDR: MoE scales parameter count through sparse expert routing, but the real work is balancing tokens, capacity, communication cost, and specialization.
TLDR: MoE scales parameter count through sparse expert routing, but the real work is balancing tokens, capacity, communication cost, and specialization.