DeepSeek-V4 Technical Report

DeepSeek AI Β· 2026 Β· Technical Report (HuggingFace)

TL;DR

DeepSeek-V4-Pro is a 1.6T-parameter MoE model (49B active) pre-trained on 32T+ tokens. Key innovations: (1) CSA/HCA compressed attention β€” only 10% KV cache and 27% inference FLOPs vs DeepSeek-V3.2 at 1M context; (2) mHC Birkhoff-constrained residual connections for stable trillion-scale training; (3) the Muon optimizer replacing Adam for consistent gradient spectral norm. Achieves 80.6% SWE Verified, 93.5 LiveCodeBench, 3206 Codeforces Rating (Think Max mode).

β—†DeepSeek Architecture Evolution
DeepSeek-V2 (2024)
Introduced MLA (Multi-head Latent Attention) to compress KV cache via low-rank projection
Scales
DeepSeek-V3 (2025, 671B)
Auxiliary-loss-free MoE load balancing + multi-token prediction; trained on 14.8T tokens
V4 adds
V4 Innovation 1+2: CSA/HCA + DSA
Two-level attention overhaul β€” 10Γ— KV cache reduction + 2Γ— long-context compute reduction
+
V4 Innovation 3: Engram Memory
O(1) hash-based static knowledge store β€” decouples factual recall from FFN reasoning
+
V4 Innovation 4: mHC
Birkhoff-constrained residual streams via Sinkhorn-Knopp projection β€” stable gradient flow at 1.6T
= V4
DeepSeek-V4-Pro (2026)
1.6T / 49B active Β· 1M context Β· 32T+ tokens Β· MIT license
3% params active per token
10Γ— KV cache vs V3
No Nvidia GPUs needed

1. Background: Two Scaling Walls

Frontier LLMs hit two fundamental walls as context grows. The memory wall: KV cache for a 1M-token context can exceed 100 GB, bottlenecking batch size and GPU utilisation. The compute wall: dense attention is O(LΒ²) in sequence length β€” at 1M tokens, standard attention is simply infeasible.

DeepSeek-V4 attacks both walls simultaneously with two attention innovations (CSA/HCA and DSA), then adds Engram memory to make 1M-context knowledge retrieval reliable, and mHC to keep a 1.6T-parameter model stable during training.

2. Foundation: MoE at 1.6T / 49B Active

V4-Pro uses Mixture-of-Experts to keep per-token compute feasible at 1.6T total parameters. Each token activates only the top-K expert FFN layers β€” roughly 3% of all weights β€” leaving the rest idle. The routing is learned end-to-end.

MoE routing β€” weighted sum of selected experts
MoE(x)=βˆ‘i∈Top-K(x)gi(x)β‹…Ei(x),gi(x)=esi(x)βˆ‘j∈Top-Kesj(x)\text{MoE}(x) = \sum_{i \in \text{Top-K}(x)} g_i(x) \cdot E_i(x), \quad g_i(x) = \frac{e^{s_i(x)}}{\sum_{j \in \text{Top-K}} e^{s_j(x)}}
x∈Rdx \in \mathbb{R}^dInput token hidden stateTop-K(x)\text{Top-K}(x)Indices of the K experts with highest routing scores β€” K=2 in V4 (out of 64 routed experts per layer)si(x)s_i(x)Routing score for expert i β€” computed by a small gating networkgi(x)g_i(x)Gating weight β€” softmax re-normalised over the top-K selected experts onlyEi(x)E_i(x)Output of expert i (a 2-layer FFN with its own weights)

MoE Routing: 8 Experts, Top-2 Active

Code

E1

Math

E2

Science

E3

Lang

E4

Logic

E5

Fact

E6

Style

E7

Misc

E8

Click a token or press Animate to see routing2/8 active

Why 1.6T total but only 49B active? MoE separates model capacity from inference cost. The 1.6T parameters store a vast library of specialised knowledge; the 49B active parameters per token is what you actually pay for at inference time. Compare to a dense 49B model β€” same inference cost but 32Γ— less knowledge capacity.

3. Innovation 1: Compressed Attention (CSA + HCA)

Standard multi-head attention caches one key and one value vector per head per token. V4 replaces this with a two-level scheme: CSA (Compressed Sparse Attention) for medium-range context, and HCA (Heavily Compressed Attention) for ultra-long-range layers β€” each with a different learned compression ratio.

CSA: compressed KV projection (schematic β€” V4 paper not yet public)
K~l=WKcβ‹…X,V~l=WVcβ‹…X,dcβ‰ͺdhβ‹…nh\tilde{K}_l = W_K^c \cdot X,\quad \tilde{V}_l = W_V^c \cdot X,\quad d_c \ll d_h \cdot n_h

KV Cache Size: V3 vs V4-Pro

Context length128K tokens
8K32K128K512K1M
DeepSeek-V31.0 GB
DeepSeek-V4-Pro0.10 GB
At 128K context: V4 saves 0.9 GB β€” a 10Γ— reduction

Approximate estimates based on reported 10% KV cache ratio (CSA/HCA vs V3 MLA).

4. Innovation 2: DeepSeek Sparse Attention (DSA)

Even after compressing the KV cache, the attention computation itself is still O(LΒ²). DSA adds a second efficiency layer: instead of attending to all L tokens, each query selects only the top-K most relevant tokens via a lightweight Lightning Indexer, reducing attention from O(LΒ²) to O(LΒ·k).

Dense Attention

O(LΒ²)

Every query attends to all L tokens

DSA (Sparse)

O(L Β· k)

Each query selects top-k relevant tokens

The Lightning Indexer is a small, fast network (FP8 precision, few attention heads) that scores each token's relevance to the query. It runs before the main attention layer and returns the k token indices to attend to. Result: ~1.5Γ— per-layer speedup, ~2Γ— end-to-end GPU cost reduction at 100K+ tokens.

DSA vs CSA/HCA β€” what's the difference? CSA/HCA reduce memory by compressing the KV representation (fewer bits per token). DSA reduces compute by sparsifying the attention pattern (fewer tokens attended to). They are orthogonal and both applied in V4.

5. Innovation 3: Engram β€” Conditional Memory

Transformers store factual knowledge in FFN weights β€” but this conflates two different tasks: dynamic reasoning (which needs the full network) and static fact retrieval (which is just a lookup). Engram (arXiv 2601.07372) adds a separate O(1) hash-based memory module alongside the FFN.

Engram memory retrieval (schematic β€” from arXiv 2601.07372)
m(x)=M[hash(Wqx)],hβ€²=h+Ξ±β‹…m(x)m(x) = \mathcal{M}[\text{hash}(W_q x)], \quad h' = h + \alpha \cdot m(x)
WqxW_q xLinear query projection β€” maps hidden state to a low-dimensional lookup keyhash(β‹…)\text{hash}(\cdot)Locality-sensitive hash β€” similar queries map to nearby addresses; enables approximate nearest-neighbour retrievalM[β‹…]\mathcal{M}[\cdot]Static memory table β€” a large lookup table of knowledge vectors fixed after pre-trainingΞ±\alphaLearned blending coefficient β€” controls how much memory to mix in; optimal allocation is ~20–25% memory vs 75–80% FFN

Needle-in-a-Haystack (NiAH) accuracy at 1M tokens

Without Engram

84.2%

β†’

With Engram

97.0%

+12.8pp

Task: recall a specific fact injected at a random position in a 1M-token document.

6. Innovation 4: Manifold-Constrained Hyper-Connections (mHC)

Standard residual connections use a fixed identity skip: h_out = h_in + F(h_in). Hyper-Connections (HC) generalise this with learnable mixing matrices across multiple residual streams. mHC (arXiv 2512.24880) adds a hard constraint: the residual mixer must be a doubly stochastic matrix (Birkhoff polytope), preventing any stream from amplifying signal.

mHC layer update (Eq. 1 from arXiv 2512.24880)
xl+1=Hlresxl+(Hlpost)⊀F(Hlprexl, Wl)\mathbf{x}_{l+1} = \mathcal{H}_l^{\mathrm{res}}\mathbf{x}_l + (\mathcal{H}_l^{\mathrm{post}})^\top \mathcal{F}(\mathcal{H}_l^{\mathrm{pre}}\mathbf{x}_l,\, \mathcal{W}_l)
xl∈RnΓ—C\mathbf{x}_l \in \mathbb{R}^{n \times C}Expanded hidden state β€” n parallel residual streams of dimension C (n=4 typically)Hlres∈RnΓ—n\mathcal{H}_l^{\mathrm{res}} \in \mathbb{R}^{n \times n}Residual stream mixer β€” routes information between streams; constrained to Birkhoff polytopeHlpre, Hlpost∈R1Γ—n\mathcal{H}_l^{\mathrm{pre}},\, \mathcal{H}_l^{\mathrm{post}} \in \mathbb{R}^{1 \times n}Input/output projections β€” compress n streams β†’ 1 before layer, expand 1 β†’ n after layerF(β‹…, Wl)\mathcal{F}(\cdot,\, \mathcal{W}_l)Layer function β€” attention or FFN with weights W_l
Birkhoff polytope constraint (doubly stochastic)
Hlres∈{M∈RnΓ—nβ€…β€Š|β€…β€ŠM1n=1n,β€…β€Š1n⊀M=1n⊀,β€…β€ŠMβ‰₯0}\mathcal{H}_l^{\mathrm{res}} \in \left\{M \in \mathbb{R}^{n \times n} \;\middle|\; M\mathbf{1}_n = \mathbf{1}_n,\;\mathbf{1}_n^\top M = \mathbf{1}_n^\top,\; M \geq 0\right\}

7. Training Details

The following specs are sourced from the official HuggingFace model page (deepseek-ai/DeepSeek-V4-Pro) and the paper. Training hardware is not stated in the paper.

Pre-training tokens

32T+

Total parameters

1.6T

Active parameters

49B

Context length

1M

Precision

FP4 + FP8

License

MIT

Precision: MoE expert parameters use FP4; most other parameters use FP8. Post-training follows a two-stage paradigm: Stage 1 cultivates domain-specific experts via SFT and RL with GRPO; Stage 2 consolidates them into a unified model through on-policy distillation.

V4 supports three reasoning modes: Non-Think (fast responses), Think High (deliberate reasoning), and Think Max (maximum reasoning effort, requires β‰₯384K context).

8. Training Algorithm: Muon Optimizer

V4 uses the Muon optimizer in place of Adam. The key difference: instead of tracking per-parameter gradient magnitudes, Muon applies a Newton-Schulz orthogonalisation to the gradient matrix, producing updates with a consistent spectral norm regardless of gradient scale. This improves training stability at 1.6T-parameter scale.

Algorithm 1: Muon Optimizer

Verbatim from DeepSeek-V4 technical report β€” click a step or animate

Require: Learning rate Ξ·, momentum ΞΌ, weight decay Ξ», rescaling factor Ξ³
vs Adam: Adam tracks gradient magnitude per-parameter (scale-dependent). Muon orthogonalises the gradient matrix (scale-independent) β€” every weight matrix gets an update of consistent spectral norm, preventing large matrices from dominating training dynamics at 1.6T scale.

9. V4-Pro vs V4-Flash

DeepSeek released two V4 variants alongside each other. V4-Flash is a smaller model optimised for throughput and latency; V4-Pro is the flagship. Both share the 1M context window and all architectural innovations. Specs below are from the official HuggingFace model pages.

SpecV4-ProV4-Flash
Total params1.6T284B
Active params49B13B
Context length1M1M
PrecisionFP4 + FP8FP4 + FP8
LicenseMITMIT
LiveCodeBench Max (Pass@1)93.591.6
Codeforces Max (Rating)32063052
SWE Verified Max (Resolved)80.6%79.0%
SWE Verified Non-Think73.6%73.7%
MMLU-Pro Max (EM)87.586.2
GPQA Diamond Max (Pass@1)90.188.1
HLE Max (Pass@1)37.734.8
MRCR 1M Max (MMR)83.578.7

Source: official HuggingFace READMEs for DeepSeek-V4-Pro and DeepSeek-V4-Flash. Training hardware is not stated in either README or the paper.

10. Key Results

DeepSeek-V4-Pro-Max benchmark comparison: SimpleQA Verified, HLE, Apex Shortlist, Codeforces, SWE Verified, Terminal Bench 2.0, Toolathlon

Figure from official HuggingFace model page β€” DeepSeek-V4-Pro-Max vs Claude Opus 4.6 Max, GPT-5.4 xHigh, Gemini-3.1-Pro High across Knowledge & Reasoning and Agentic benchmarks.

All benchmark numbers below are from the official README (huggingface.co/deepseek-ai/DeepSeek-V4-Pro), which mirrors the tables in the technical report. V4-Pro Max = maximum Think mode.

27%

single-token inference FLOPs vs V3.2

10%

KV cache size at 1M context vs V3.2

Source: paper Figure 1 (right panel).

Coding & Math (V4-Pro Max vs frontier, from README)

BenchmarkV4-Pro MaxClaude Opus 4.6GPT-5.4Gemini-3.1-Pro
LiveCodeBench (Pass@1)93.5 β˜…88.8β€”91.7
Codeforces (Rating)3206 β˜…β€”31683052
SWE Verified (Resolved)80.680.8 β˜…β€”80.6
IMOAnswerBench (Pass@1)89.875.391.4 β˜…81.0
HMMT 2026 Feb (Pass@1)95.296.297.7 β˜…94.7
Apex Shortlist (Pass@1)90.2 β˜…85.978.189.1
MMLU-Pro (EM)87.589.187.591.0 β˜…
GPQA Diamond (Pass@1)90.191.393.094.3 β˜…

β˜… = best in row. Source: official README, DeepSeek-V4-Pro HuggingFace page.

11. Why It Matters

Efficiency: frontier quality at a fraction of the inference cost

27% inference FLOPs and 10% KV cache vs DeepSeek-V3.2 at equivalent quality is a significant systems engineering advance. It enables true 1M-token serving β€” full codebases, book-length documents β€” at deployable cost.

Technical: 10Γ— KV cache at frontier quality

CSA/HCA/DSA achieving a 10Γ— KV cache and 2Γ— long-context compute reduction at equivalent or better quality is a major systems engineering advance. It enables genuine 1M-token serving β€” full codebases, book-length documents, multi-session memory β€” at deployable cost.

Architectural: Engram as a new primitive

Separating static knowledge lookup from dynamic FFN computation is a conceptual advance that could become a standard LLM component β€” analogous to how MoE evolved from a niche technique to a universal practice. If Engram generalises, all frontier models may adopt it.

Open: Apache 2.0, 1.6T weights public

Releasing a 1.6T-parameter frontier model under MIT license puts extreme capability in the hands of the research community. The full model weights (865 GB across 64 SafeTensors files) are publicly available on HuggingFace.

11. Related Papers

GRPO

DeepSeek's RL algorithm used in V4 post-training β€” group-relative rewards, no separate critic network.

Attention Is All You Need

The Transformer CSA/HCA/DSA all build on. V4's compressed attention is a direct extension of scaled dot-product attention.

FlashAttention

IO-aware attention kernel that V4's sparse patterns rely on β€” tiling and kernel fusion are prerequisites for efficient sparse attention at scale.

Scaling Laws

Motivated V4's 32T+ token pre-training budget for a 1.6T-parameter model β€” Chinchilla-style optimal compute allocation.

12. Additional Resources