All articles
12 min read

MoE vs Dense Models: What Actually Happens When a Model Thinks

When you send a prompt to a language model, tokens come back. The format is identical regardless of the architecture that produced them. From the outside, a dense model and a mixture-of-experts model look like the same kind of machine.

They are not. The compute path that produces each token is fundamentally different — not in some abstract theoretical sense, but in which matrix multiplications execute, how much silicon activates, and how many parameters participate in the result.

This matters for anyone building on top of these models, paying for inference, or trying to understand why two models with similar benchmark scores behave differently in production. What follows is a description of what actually happens inside each architecture, where they diverge, and what remains uncertain.

Dense Models: The Full-Width Pass

A dense transformer processes every token through every parameter in every layer. No conditional path. No selection. Every matrix multiplication fires for every token.

The architecture is a stack of identical layers. Each layer has two main components: a self-attention mechanism and a feed-forward network. Attention allows the model to relate tokens to each other across the sequence. The feed-forward network transforms each token's representation independently — an up-projection to a wider dimension, a nonlinear activation, and a down-projection back.

For a single token during autoregressive generation, the path is: embedding lookup, then attention and feed-forward through every layer, then a final projection to vocabulary logits, then sampling. Every parameter in the model fires. A 70-billion-parameter dense model uses all 70 billion parameters per token, every token, without exception.

Simple, well-understood, and expensive in direct proportion to model size.

Mixture-of-Experts: Conditional Compute

An MoE transformer replaces the single feed-forward network in each layer — or in selected layers — with a bank of smaller feed-forward networks (the experts) and a routing mechanism that selects which experts process each token.

The attention layers remain shared. Every token still passes through the same attention mechanism at each layer. The sparsity is specific to the feed-forward stage.

At each MoE layer, a small router network takes the token's hidden state and produces a score for each expert. The top-k experts (usually k=1 or k=2) are selected. Only those experts contribute to the computation. Their outputs are combined, weighted by the router's confidence scores.

The result: a model can have 8, 16, or 64 expert feed-forward networks per layer, but each token only activates a small fraction. Mixtral 8x7B has roughly 47 billion total parameters but activates approximately 13 billion per token. The model's total capacity is large, but any given token only touches a fraction of it.

One Token, Two Paths

Trace what happens to a single token through both architectures.

Dense. The token enters as an embedding vector. At layer 1, it passes through self-attention — queries, keys, and values are computed, attention scores calculated, context mixed in. Then the full feed-forward network fires: up-projection, activation, down-projection. Residual connection. Normalize. Repeat for every layer. At the end, project to logits, sample the next token. Every parameter touched. Every multiplication executed.

MoE. Same embedding. Same attention at layer 1 — identical mechanism, identical cost. Then the hidden state reaches the MoE feed-forward stage. The router evaluates the state, produces scores across all experts, selects the top two. Those two experts fire. Their outputs are blended according to the router weights. Residual. Normalize. Repeat. At dense layers (not all layers need to be MoE), the full feed-forward network fires as usual. At MoE layers, only the selected experts participate.

The attention computation is identical. The difference is entirely in the feed-forward stage — how much of the model's capacity processes each token.

The Router

The gating network is small relative to the experts it governs. A typical router is a single linear layer: it takes the hidden state (dimension d) and projects it to a vector of length N (number of experts), followed by softmax to produce a probability distribution.

This is a learned function. During training, the router learns which patterns of hidden state should be sent to which expert. In principle, this allows specialization — one expert might handle syntactic structure, another numerical reasoning, another entity knowledge. In practice, the specialization patterns observed in research are real but noisy. Experts develop tendencies, not clean jurisdictions.

Training MoE models requires an auxiliary load-balancing loss to prevent expert collapse — a failure mode where the router learns to send all tokens to a small number of experts while the rest go unused. This loss penalizes uneven routing distributions. Without it, MoE models degenerate into smaller dense models with wasted parameters.

The router adds a decision point at every MoE layer that does not exist in dense models. That decision introduces variance: semantically similar tokens can be routed to different experts, and the downstream effect is difficult to predict or interpret.

Parameter Count vs Active Compute

When someone says a model has 400 billion parameters, the instinct is to compare it to a 70-billion-parameter model and conclude it is roughly six times more powerful or six times more expensive. For dense models, parameter count correlates directly with per-token FLOPs. For MoE models, it does not.

An MoE model with 400 billion total parameters and top-2 routing across 16 experts per layer might activate 50 billion parameters per token. Its per-token compute cost is comparable to a 50-billion-parameter dense model. Its total representational capacity — the space of functions the model can express — is closer to the 400-billion figure. But that capacity is distributed across experts and accessed conditionally, not uniformly.

Two numbers matter: total parameters (which determines memory footprint) and active parameters per token (which determines per-token FLOPs). Dense models have one number. MoE models have two. Conflating them is the source of most misconceptions about MoE architectures.

The Physics

I use "physics" here to mean the computational substrate — what the hardware does, measured in operations and data movement.

For autoregressive generation, the bottleneck in dense models is typically memory bandwidth, not arithmetic throughput. The model's parameters must be read from memory for each token, and modern accelerators can compute faster than they can move data. MoE models intensify this imbalance: the total parameter footprint is larger, so more data must be addressable, even though less is read per token.

To put numbers on it: a dense 70B model requires roughly 140 GB in 16-bit precision. An MoE model with 47B total parameters requires roughly 94 GB. A hypothetical 400B-parameter MoE model requires around 800 GB — regardless of how few parameters activate per token. All experts must be memory-resident because the router decides which to use at runtime.

Serving across multiple accelerators introduces a different cost. Expert parallelism means different experts reside on different devices, and tokens must be routed across the network to reach their selected experts. This all-to-all communication pattern is expensive and becomes a bottleneck at scale. Dense models use tensor or pipeline parallelism, which have more predictable communication patterns.

The router computation itself is cheap — a single matrix multiplication per MoE layer. The real cost is dispatching: gathering tokens for each expert, sending them to the correct device, collecting results. This is an engineering problem more than a mathematical one, and it is largely solved in well-optimized serving infrastructure. But it adds latency and system complexity.

Finally, utilization patterns differ. Dense models produce predictable, uniform load across compute units. MoE models produce uneven load — some experts activate frequently, others rarely. This reduces average hardware utilization and complicates capacity planning.

Why MoEs Can Feel Uneven

MoE models have high total capacity but uneven access to it.

If certain token patterns consistently route to the same experts, those experts are well-trained on that distribution. Other experts may see less diverse traffic and develop narrower competence. The result is a model that performs very well on some tasks and less well on others — not because it lacks parameters, but because the relevant experts for a given input may be undertrained relative to the model's overall scale.

This is externally observable: MoE models sometimes show higher variance across benchmark subtasks than dense models of comparable active parameter count. The aggregate score may be competitive, but the distribution of per-task performance is wider.

Dense models distribute all training signal across all parameters. No routing bottleneck starves part of the network. This tends to produce more uniform performance across tasks — though not necessarily higher peak performance.

Common Misconceptions

"Most of the model is asleep." The attention layers — a substantial fraction of total compute — are fully active for every token. Only the feed-forward experts are conditionally activated. And the non-selected experts are not idle in any meaningful sense; they occupy memory, consume power, and sit ready to activate. They simply were not selected for this token.

"MoE is fake parameter count." The parameters are real. They are trained. They encode learned representations. The model genuinely has more representational capacity than its active parameter count suggests. What is misleading is comparing total MoE parameters to total dense parameters as if they imply equivalent per-token cost. The parameter count is real. The comparison is wrong.

"Dense is always more stable." Dense models have their own training instabilities — loss spikes, sensitivity to learning rate and batch size. MoE adds routing instability and the risk of expert collapse, but modern implementations have substantially mitigated these through auxiliary losses, capacity factors, and expert-level dropout. Stability is a function of engineering effort, not architecture alone.

"MoE means multiple independent minds." Experts are feed-forward sub-networks that process individual tokens. They do not maintain independent context, memory, or understanding. The shared attention layers build the contextual representation that all experts operate on. Experts are specialized matrix multiplications, not agents.

What We Observe vs What We Infer

This distinction matters because most people discuss model internals with more confidence than the evidence supports.

Directly observable. Output quality on benchmarks and production tasks. Latency and throughput under controlled conditions. Cost per token from API providers. Behavioral differences across prompt types — consistency, instruction following, failure patterns.

Inferred from published research. Architecture details for open models — layer count, expert count, routing strategy. Training procedures, data composition, and hyperparameters where disclosed. Expert specialization patterns from activation analysis.

Inferred from external behavior. Whether a proprietary model uses MoE — sometimes visible through latency patterns, parameter count disclosures, or informed reporting. Rough expert utilization patterns from probing studies. Whether routing variance contributes to observed behavioral inconsistencies.

Unknown for proprietary models. Exact architecture. Number of experts and routing strategy. Which layers use MoE and which remain dense. Training data composition. Load balancing implementation. Whether expert specialization is deliberately shaped or emergent.

Epistemic status matters. "GPT-4 is widely reported to use a mixture-of-experts architecture" is different from "GPT-4 uses 16 experts with top-2 routing." The first is well-sourced reporting. The second is specific enough to be wrong.

Why This Matters for Real Systems

MoE models offer more capacity per FLOP, which can translate to lower compute cost per token at a given quality level. But they require more memory, which means more hardware. The cost equation depends on whether the deployment is compute-bound or memory-bound. Most autoregressive inference is memory-bound, which limits MoE's cost advantage in practice.

For single-request latency, MoE and dense models with similar active parameter counts are roughly comparable — routing and dispatch overhead is small. Throughput under load is where MoE pulls ahead, because lower per-token compute allows more concurrent requests on the same hardware, assuming memory is sufficient.

Determinism is subtler. Dense models produce deterministic outputs for a given input at temperature zero with fixed seeds. MoE models do the same in principle — the router is a deterministic function. In practice, floating-point non-determinism in distributed expert computation can introduce subtle variance. Systems requiring exact reproducibility should verify this empirically.

On long tasks — and this is inference, not established fact — MoE models sometimes show less consistent performance over extended multi-step work. One hypothesis is that routing variance compounds across many tokens, producing higher output divergence over long sequences. Dense models, applying the same full transformation at every step, may accumulate less path-dependent drift. Plausible, but not conclusively demonstrated.

For local and edge deployment, MoE is harder. All experts must fit in memory. A 47B-parameter MoE model requires the same memory as a 47B dense model, despite performing inference comparable to a 13B dense model. Dense models offer a better ratio of capability to memory footprint. Expert offloading — loading experts from disk on demand — exists in research but adds significant latency.

Where Each Architecture Sits

Dense models are well-understood, predictable, and efficient when memory is the binding constraint. They produce consistent behavior across tasks. They are easier to serve, easier to quantize, and easier to reason about. For a given compute budget, they are the simpler engineering choice.

MoE models offer more capacity per FLOP. They scale total parameters more cheaply during training and can achieve strong benchmark performance with lower per-token compute. They are the right choice when serving cost at scale is the primary constraint and the engineering team can absorb the complexity of expert parallelism, routing, and load balancing.

Neither architecture is universally superior. Dense models trade compute for consistency. MoE models trade complexity for efficiency. The right choice depends on deployment constraints — and those constraints are specific to each system, each budget, each latency target.

Both architectures are evolving. The tradeoffs are real and measurable. Anyone who tells you one is strictly better than the other is simplifying past the point of usefulness.