TL;DR
DeepSeek-V4-Pro is a 1.6T-parameter MoE model (49B active) pre-trained on 32T+ tokens. Key innovations: (1) CSA/HCA compressed attention β only 10% KV cache and 27% inference FLOPs vs DeepSeek-V3.2 at 1M context; (2) mHC Birkhoff-constrained residual connections for stable trillion-scale training; (3) the Muon optimizer replacing Adam for consistent gradient spectral norm. Achieves 80.6% SWE Verified, 93.5 LiveCodeBench, 3206 Codeforces Rating (Think Max mode).
1. Background: Two Scaling Walls
Frontier LLMs hit two fundamental walls as context grows. The memory wall: KV cache for a 1M-token context can exceed 100 GB, bottlenecking batch size and GPU utilisation. The compute wall: dense attention is O(LΒ²) in sequence length β at 1M tokens, standard attention is simply infeasible.
DeepSeek-V4 attacks both walls simultaneously with two attention innovations (CSA/HCA and DSA), then adds Engram memory to make 1M-context knowledge retrieval reliable, and mHC to keep a 1.6T-parameter model stable during training.
2. Foundation: MoE at 1.6T / 49B Active
V4-Pro uses Mixture-of-Experts to keep per-token compute feasible at 1.6T total parameters. Each token activates only the top-K expert FFN layers β roughly 3% of all weights β leaving the rest idle. The routing is learned end-to-end.
MoE Routing: 8 Experts, Top-2 Active
Code
E1
Math
E2
Science
E3
Lang
E4
Logic
E5
Fact
E6
Style
E7
Misc
E8
Why 1.6T total but only 49B active? MoE separates model capacity from inference cost. The 1.6T parameters store a vast library of specialised knowledge; the 49B active parameters per token is what you actually pay for at inference time. Compare to a dense 49B model β same inference cost but 32Γ less knowledge capacity.
3. Innovation 1: Compressed Attention (CSA + HCA)
Standard multi-head attention caches one key and one value vector per head per token. V4 replaces this with a two-level scheme: CSA (Compressed Sparse Attention) for medium-range context, and HCA (Heavily Compressed Attention) for ultra-long-range layers β each with a different learned compression ratio.
KV Cache Size: V3 vs V4-Pro
Approximate estimates based on reported 10% KV cache ratio (CSA/HCA vs V3 MLA).
4. Innovation 2: DeepSeek Sparse Attention (DSA)
Even after compressing the KV cache, the attention computation itself is still O(LΒ²). DSA adds a second efficiency layer: instead of attending to all L tokens, each query selects only the top-K most relevant tokens via a lightweight Lightning Indexer, reducing attention from O(LΒ²) to O(LΒ·k).
Dense Attention
O(LΒ²)
Every query attends to all L tokens
DSA (Sparse)
O(L Β· k)
Each query selects top-k relevant tokens
The Lightning Indexer is a small, fast network (FP8 precision, few attention heads) that scores each token's relevance to the query. It runs before the main attention layer and returns the k token indices to attend to. Result: ~1.5Γ per-layer speedup, ~2Γ end-to-end GPU cost reduction at 100K+ tokens.
DSA vs CSA/HCA β what's the difference? CSA/HCA reduce memory by compressing the KV representation (fewer bits per token). DSA reduces compute by sparsifying the attention pattern (fewer tokens attended to). They are orthogonal and both applied in V4.
5. Innovation 3: Engram β Conditional Memory
Transformers store factual knowledge in FFN weights β but this conflates two different tasks: dynamic reasoning (which needs the full network) and static fact retrieval (which is just a lookup). Engram (arXiv 2601.07372) adds a separate O(1) hash-based memory module alongside the FFN.
Needle-in-a-Haystack (NiAH) accuracy at 1M tokens
Without Engram
84.2%
With Engram
97.0%
+12.8pp
Task: recall a specific fact injected at a random position in a 1M-token document.
6. Innovation 4: Manifold-Constrained Hyper-Connections (mHC)
Standard residual connections use a fixed identity skip: h_out = h_in + F(h_in). Hyper-Connections (HC) generalise this with learnable mixing matrices across multiple residual streams. mHC (arXiv 2512.24880) adds a hard constraint: the residual mixer must be a doubly stochastic matrix (Birkhoff polytope), preventing any stream from amplifying signal.
7. Training Details
Pre-training tokens
32T+
Total parameters
1.6T
Active parameters
49B
Context length
1M
Precision
FP4 + FP8
License
MIT
Precision: MoE expert parameters use FP4; most other parameters use FP8. Post-training follows a two-stage paradigm: Stage 1 cultivates domain-specific experts via SFT and RL with GRPO; Stage 2 consolidates them into a unified model through on-policy distillation.
V4 supports three reasoning modes: Non-Think (fast responses), Think High (deliberate reasoning), and Think Max (maximum reasoning effort, requires β₯384K context).
8. Training Algorithm: Muon Optimizer
V4 uses the Muon optimizer in place of Adam. The key difference: instead of tracking per-parameter gradient magnitudes, Muon applies a Newton-Schulz orthogonalisation to the gradient matrix, producing updates with a consistent spectral norm regardless of gradient scale. This improves training stability at 1.6T-parameter scale.
Algorithm 1: Muon Optimizer
Verbatim from DeepSeek-V4 technical report β click a step or animate
9. V4-Pro vs V4-Flash
DeepSeek released two V4 variants alongside each other. V4-Flash is a smaller model optimised for throughput and latency; V4-Pro is the flagship. Both share the 1M context window and all architectural innovations. Specs below are from the official HuggingFace model pages.
| Spec | V4-Pro | V4-Flash |
|---|---|---|
| Total params | 1.6T | 284B |
| Active params | 49B | 13B |
| Context length | 1M | 1M |
| Precision | FP4 + FP8 | FP4 + FP8 |
| License | MIT | MIT |
| LiveCodeBench Max (Pass@1) | 93.5 | 91.6 |
| Codeforces Max (Rating) | 3206 | 3052 |
| SWE Verified Max (Resolved) | 80.6% | 79.0% |
| SWE Verified Non-Think | 73.6% | 73.7% |
| MMLU-Pro Max (EM) | 87.5 | 86.2 |
| GPQA Diamond Max (Pass@1) | 90.1 | 88.1 |
| HLE Max (Pass@1) | 37.7 | 34.8 |
| MRCR 1M Max (MMR) | 83.5 | 78.7 |
Source: official HuggingFace READMEs for DeepSeek-V4-Pro and DeepSeek-V4-Flash. Training hardware is not stated in either README or the paper.
10. Key Results

Figure from official HuggingFace model page β DeepSeek-V4-Pro-Max vs Claude Opus 4.6 Max, GPT-5.4 xHigh, Gemini-3.1-Pro High across Knowledge & Reasoning and Agentic benchmarks.
27%
single-token inference FLOPs vs V3.2
10%
KV cache size at 1M context vs V3.2
Source: paper Figure 1 (right panel).
Coding & Math (V4-Pro Max vs frontier, from README)
| Benchmark | V4-Pro Max | Claude Opus 4.6 | GPT-5.4 | Gemini-3.1-Pro |
|---|---|---|---|---|
| LiveCodeBench (Pass@1) | 93.5 β | 88.8 | β | 91.7 |
| Codeforces (Rating) | 3206 β | β | 3168 | 3052 |
| SWE Verified (Resolved) | 80.6 | 80.8 β | β | 80.6 |
| IMOAnswerBench (Pass@1) | 89.8 | 75.3 | 91.4 β | 81.0 |
| HMMT 2026 Feb (Pass@1) | 95.2 | 96.2 | 97.7 β | 94.7 |
| Apex Shortlist (Pass@1) | 90.2 β | 85.9 | 78.1 | 89.1 |
| MMLU-Pro (EM) | 87.5 | 89.1 | 87.5 | 91.0 β |
| GPQA Diamond (Pass@1) | 90.1 | 91.3 | 93.0 | 94.3 β |
β = best in row. Source: official README, DeepSeek-V4-Pro HuggingFace page.
11. Why It Matters
Efficiency: frontier quality at a fraction of the inference cost
27% inference FLOPs and 10% KV cache vs DeepSeek-V3.2 at equivalent quality is a significant systems engineering advance. It enables true 1M-token serving β full codebases, book-length documents β at deployable cost.
Technical: 10Γ KV cache at frontier quality
CSA/HCA/DSA achieving a 10Γ KV cache and 2Γ long-context compute reduction at equivalent or better quality is a major systems engineering advance. It enables genuine 1M-token serving β full codebases, book-length documents, multi-session memory β at deployable cost.
Architectural: Engram as a new primitive
Separating static knowledge lookup from dynamic FFN computation is a conceptual advance that could become a standard LLM component β analogous to how MoE evolved from a niche technique to a universal practice. If Engram generalises, all frontier models may adopt it.
Open: Apache 2.0, 1.6T weights public
Releasing a 1.6T-parameter frontier model under MIT license puts extreme capability in the hands of the research community. The full model weights (865 GB across 64 SafeTensors files) are publicly available on HuggingFace.
11. Related Papers
DeepSeek's RL algorithm used in V4 post-training β group-relative rewards, no separate critic network.
The Transformer CSA/HCA/DSA all build on. V4's compressed attention is a direct extension of scaled dot-product attention.
IO-aware attention kernel that V4's sparse patterns rely on β tiling and kernel fusion are prerequisites for efficient sparse attention at scale.
Motivated V4's 32T+ token pre-training budget for a 1.6T-parameter model β Chinchilla-style optimal compute allocation.