PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

DeepSeek-V4-Pro is a 1.6T-parameter MoE model (49B active) pre-trained on 32T+ tokens. Key innovations: (1) CSA/HCA compressed attention — only 10% KV cache and 27% inference FLOPs vs DeepSeek-V3.2 at 1M context; (2) mHC Birkhoff-constrained residual connections for stable trillion-scale training; (3) the Muon optimizer replacing Adam for consistent gradient spectral norm. Achieves 80.6% SWE Verified, 93.5 LiveCodeBench, 3206 Codeforces Rating (Think Max mode).

DeepSeek-V2 (2024)

Introduced MLA (Multi-head Latent Attention) to compress KV cache via low-rank projection

Scales

DeepSeek-V3 (2025, 671B)

Auxiliary-loss-free MoE load balancing + multi-token prediction; trained on 14.8T tokens

V4 adds

V4 Innovation 1+2: CSA/HCA + DSA

Two-level attention overhaul — 10× KV cache reduction + 2× long-context compute reduction

V4 Innovation 3: Engram Memory

O(1) hash-based static knowledge store — decouples factual recall from FFN reasoning

V4 Innovation 4: mHC

Birkhoff-constrained residual streams via Sinkhorn-Knopp projection — stable gradient flow at 1.6T

= V4

DeepSeek-V4-Pro (2026)

1.6T / 49B active · 1M context · 32T+ tokens · MIT license

3% params active per token

10× KV cache vs V3

No Nvidia GPUs needed

1. Background: Two Scaling Walls

Frontier LLMs hit two fundamental walls as context grows. The memory wall: KV cache for a 1M-token context can exceed 100 GB, bottlenecking batch size and GPU utilisation. The compute wall: dense attention is O(L²) in sequence length — at 1M tokens, standard attention is simply infeasible.

DeepSeek-V4 attacks both walls simultaneously with two attention innovations (CSA/HCA and DSA), then adds Engram memory to make 1M-context knowledge retrieval reliable, and mHC to keep a 1.6T-parameter model stable during training.

2. Foundation: MoE at 1.6T / 49B Active

V4-Pro uses Mixture-of-Experts to keep per-token compute feasible at 1.6T total parameters. Each token activates only the top-K expert FFN layers — roughly 3% of all weights — leaving the rest idle. The routing is learned end-to-end.

\text{MoE}(x) = \sum_{i \in \text{Top-K}(x)} g_i(x) \cdot E_i(x), \quad g_i(x) = \frac{e^{s_i(x)}}{\sum_{j \in \text{Top-K}} e^{s_j(x)}}

x \in \mathbb{R}^d

Input token hidden state

\text{Top-K}(x)

Indices of the K experts with highest routing scores — K=2 in V4 (out of 64 routed experts per layer)

s_i(x)

Routing score for expert i — computed by a small gating network

g_i(x)

Gating weight — softmax re-normalised over the top-K selected experts only

E_i(x)

Output of expert i (a 2-layer FFN with its own weights)

MoE Routing: 8 Experts, Top-2 Active

Code

Math

Science

Lang

Logic

Fact

Style

Misc

Click a token or press Animate to see routing2/8 active

Why 1.6T total but only 49B active? MoE separates model capacity from inference cost. The 1.6T parameters store a vast library of specialised knowledge; the 49B active parameters per token is what you actually pay for at inference time. Compare to a dense 49B model — same inference cost but 32× less knowledge capacity.

3. Innovation 1: Compressed Attention (CSA + HCA)

Standard multi-head attention caches one key and one value vector per head per token. V4 replaces this with a two-level scheme: CSA (Compressed Sparse Attention) for medium-range context, and HCA (Heavily Compressed Attention) for ultra-long-range layers — each with a different learned compression ratio.

\tilde{K}_l = W_K^c \cdot X,\quad \tilde{V}_l = W_V^c \cdot X,\quad d_c \ll d_h \cdot n_h

KV Cache Size: V3 vs V4-Pro

Context length128K tokens

8K32K128K512K1M

DeepSeek-V31.0 GB

DeepSeek-V4-Pro0.10 GB

At 128K context: V4 saves 0.9 GB — a 10× reduction

Approximate estimates based on reported 10% KV cache ratio (CSA/HCA vs V3 MLA).

4. Innovation 2: DeepSeek Sparse Attention (DSA)

Even after compressing the KV cache, the attention computation itself is still O(L²). DSA adds a second efficiency layer: instead of attending to all L tokens, each query selects only the top-K most relevant tokens via a lightweight Lightning Indexer, reducing attention from O(L²) to O(L·k).

Dense Attention

O(L²)

Every query attends to all L tokens

DSA (Sparse)

O(L · k)

Each query selects top-k relevant tokens

The Lightning Indexer is a small, fast network (FP8 precision, few attention heads) that scores each token's relevance to the query. It runs before the main attention layer and returns the k token indices to attend to. Result: ~1.5× per-layer speedup, ~2× end-to-end GPU cost reduction at 100K+ tokens.

DSA vs CSA/HCA — what's the difference? CSA/HCA reduce memory by compressing the KV representation (fewer bits per token). DSA reduces compute by sparsifying the attention pattern (fewer tokens attended to). They are orthogonal and both applied in V4.

5. Innovation 3: Engram — Conditional Memory

Transformers store factual knowledge in FFN weights — but this conflates two different tasks: dynamic reasoning (which needs the full network) and static fact retrieval (which is just a lookup). Engram (arXiv 2601.07372) adds a separate O(1) hash-based memory module alongside the FFN.

m(x) = \mathcal{M}[\text{hash}(W_q x)], \quad h' = h + \alpha \cdot m(x)

W_q x

Linear query projection — maps hidden state to a low-dimensional lookup key

\text{hash}(\cdot)

Locality-sensitive hash — similar queries map to nearby addresses; enables approximate nearest-neighbour retrieval

\mathcal{M}[\cdot]

Static memory table — a large lookup table of knowledge vectors fixed after pre-training

\alpha

Learned blending coefficient — controls how much memory to mix in; optimal allocation is ~20–25% memory vs 75–80% FFN

Needle-in-a-Haystack (NiAH) accuracy at 1M tokens

Without Engram

84.2%

→

With Engram

97.0%

+12.8pp

Task: recall a specific fact injected at a random position in a 1M-token document.

6. Innovation 4: Manifold-Constrained Hyper-Connections (mHC)

Standard residual connections use a fixed identity skip: h_out = h_in + F(h_in). Hyper-Connections (HC) generalise this with learnable mixing matrices across multiple residual streams. mHC (arXiv 2512.24880) adds a hard constraint: the residual mixer must be a doubly stochastic matrix (Birkhoff polytope), preventing any stream from amplifying signal.

\mathbf{x}_{l+1} = \mathcal{H}_l^{\mathrm{res}}\mathbf{x}_l + (\mathcal{H}_l^{\mathrm{post}})^\top \mathcal{F}(\mathcal{H}_l^{\mathrm{pre}}\mathbf{x}_l,\, \mathcal{W}_l)

\mathbf{x}_l \in \mathbb{R}^{n \times C}

Expanded hidden state — n parallel residual streams of dimension C (n=4 typically)

\mathcal{H}_l^{\mathrm{res}} \in \mathbb{R}^{n \times n}

Residual stream mixer — routes information between streams; constrained to Birkhoff polytope

\mathcal{H}_l^{\mathrm{pre}},\, \mathcal{H}_l^{\mathrm{post}} \in \mathbb{R}^{1 \times n}

Input/output projections — compress n streams → 1 before layer, expand 1 → n after layer

\mathcal{F}(\cdot,\, \mathcal{W}_l)

Layer function — attention or FFN with weights W_l

\mathcal{H}_l^{\mathrm{res}} \in \left\{M \in \mathbb{R}^{n \times n} \;\middle|\; M\mathbf{1}_n = \mathbf{1}_n,\;\mathbf{1}_n^\top M = \mathbf{1}_n^\top,\; M \geq 0\right\}

7. Training Details

The following specs are sourced from the official HuggingFace model page (deepseek-ai/DeepSeek-V4-Pro) and the paper. Training hardware is not stated in the paper.

Pre-training tokens

32T+

Total parameters

1.6T

Active parameters

49B

Context length

Precision

FP4 + FP8

License

MIT

Precision: MoE expert parameters use FP4; most other parameters use FP8. Post-training follows a two-stage paradigm: Stage 1 cultivates domain-specific experts via SFT and RL with GRPO; Stage 2 consolidates them into a unified model through on-policy distillation.

V4 supports three reasoning modes: Non-Think (fast responses), Think High (deliberate reasoning), and Think Max (maximum reasoning effort, requires ≥384K context).

8. Training Algorithm: Muon Optimizer

V4 uses the Muon optimizer in place of Adam. The key difference: instead of tracking per-parameter gradient magnitudes, Muon applies a Newton-Schulz orthogonalisation to the gradient matrix, producing updates with a consistent spectral norm regardless of gradient scale. This improves training stability at 1.6T-parameter scale.

Algorithm 1: Muon Optimizer

Verbatim from DeepSeek-V4 technical report — click a step or animate

Require: Learning rate η, momentum μ, weight decay λ, rescaling factor γ

vs Adam: Adam tracks gradient magnitude per-parameter (scale-dependent). Muon orthogonalises the gradient matrix (scale-independent) — every weight matrix gets an update of consistent spectral norm, preventing large matrices from dominating training dynamics at 1.6T scale.

9. V4-Pro vs V4-Flash

DeepSeek released two V4 variants alongside each other. V4-Flash is a smaller model optimised for throughput and latency; V4-Pro is the flagship. Both share the 1M context window and all architectural innovations. Specs below are from the official HuggingFace model pages.

Spec	V4-Pro	V4-Flash
Total params	1.6T	284B
Active params	49B	13B
Context length	1M	1M
Precision	FP4 + FP8	FP4 + FP8
License	MIT	MIT
LiveCodeBench Max (Pass@1)	93.5	91.6
Codeforces Max (Rating)	3206	3052
SWE Verified Max (Resolved)	80.6%	79.0%
SWE Verified Non-Think	73.6%	73.7%
MMLU-Pro Max (EM)	87.5	86.2
GPQA Diamond Max (Pass@1)	90.1	88.1
HLE Max (Pass@1)	37.7	34.8
MRCR 1M Max (MMR)	83.5	78.7

Source: official HuggingFace READMEs for DeepSeek-V4-Pro and DeepSeek-V4-Flash. Training hardware is not stated in either README or the paper.

10. Key Results

DeepSeek-V4-Pro-Max benchmark comparison: SimpleQA Verified, HLE, Apex Shortlist, Codeforces, SWE Verified, Terminal Bench 2.0, Toolathlon

Figure from official HuggingFace model page — DeepSeek-V4-Pro-Max vs Claude Opus 4.6 Max, GPT-5.4 xHigh, Gemini-3.1-Pro High across Knowledge & Reasoning and Agentic benchmarks.

All benchmark numbers below are from the official README (huggingface.co/deepseek-ai/DeepSeek-V4-Pro), which mirrors the tables in the technical report. V4-Pro Max = maximum Think mode.

27%

single-token inference FLOPs vs V3.2

10%

KV cache size at 1M context vs V3.2

Source: paper Figure 1 (right panel).

Coding & Math (V4-Pro Max vs frontier, from README)

Benchmark	V4-Pro Max	Claude Opus 4.6	GPT-5.4	Gemini-3.1-Pro
LiveCodeBench (Pass@1)	93.5 ★	88.8	—	91.7
Codeforces (Rating)	3206 ★	—	3168	3052
SWE Verified (Resolved)	80.6	80.8 ★	—	80.6
IMOAnswerBench (Pass@1)	89.8	75.3	91.4 ★	81.0
HMMT 2026 Feb (Pass@1)	95.2	96.2	97.7 ★	94.7
Apex Shortlist (Pass@1)	90.2 ★	85.9	78.1	89.1
MMLU-Pro (EM)	87.5	89.1	87.5	91.0 ★
GPQA Diamond (Pass@1)	90.1	91.3	93.0	94.3 ★

★ = best in row. Source: official README, DeepSeek-V4-Pro HuggingFace page.

11. Why It Matters

Efficiency: frontier quality at a fraction of the inference cost

27% inference FLOPs and 10% KV cache vs DeepSeek-V3.2 at equivalent quality is a significant systems engineering advance. It enables true 1M-token serving — full codebases, book-length documents — at deployable cost.

Technical: 10× KV cache at frontier quality

CSA/HCA/DSA achieving a 10× KV cache and 2× long-context compute reduction at equivalent or better quality is a major systems engineering advance. It enables genuine 1M-token serving — full codebases, book-length documents, multi-session memory — at deployable cost.

Architectural: Engram as a new primitive

Separating static knowledge lookup from dynamic FFN computation is a conceptual advance that could become a standard LLM component — analogous to how MoE evolved from a niche technique to a universal practice. If Engram generalises, all frontier models may adopt it.

Open: Apache 2.0, 1.6T weights public

Releasing a 1.6T-parameter frontier model under MIT license puts extreme capability in the hands of the research community. The full model weights (865 GB across 64 SafeTensors files) are publicly available on HuggingFace.

11. Related Papers

GRPO

DeepSeek's RL algorithm used in V4 post-training — group-relative rewards, no separate critic network.

Attention Is All You Need

The Transformer CSA/HCA/DSA all build on. V4's compressed attention is a direct extension of scaled dot-product attention.

FlashAttention

IO-aware attention kernel that V4's sparse patterns rely on — tiling and kernel fusion are prerequisites for efficient sparse attention at scale.

Scaling Laws

Motivated V4's 32T+ token pre-training budget for a 1.6T-parameter model — Chinchilla-style optimal compute allocation.

12. Additional Resources

DeepSeek-V4 Technical Report (PDF)HuggingFace DeepSeek-V4-Flash Model PageHuggingFace Engram GitHub RepositoryConditional Memory via Scalable Lookup Engram arXiv 2601.07372Formal paper on conditional memory mHC arXiv 2512.24880Manifold-Constrained Hyper-Connections DeepSeek-V3 arXiv 2512.02556V4's predecessor — MoE load balancing, multi-token prediction