DeepSeek-V4:
Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-AI
research@
Abstract
We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-
Experts (MoE) language models — DeepSeek-V4-Pro with parameters (49B activated) and
DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of
one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and op-
timization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA)
and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-
Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3)
and the Muon optimizer for faster convergence and greater training stability. We pre-train
both models on more than 32T diverse and high-quality tokens, followed by a comprehensive
post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-
Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for
open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are
highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-
V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared
with . This enables us to routinely support one-million-token contexts, thereby
making long-horizon tasks and further test-time scaling more feasible. The model checkpoints
are available at
SimpleQA
Verified
(Pass@1)
HLE
(Pass@1)
Apex
Shortlist
(Pass@1)
Codeforces
(Rating)
SWE
Verified
(Resolved)
Terminal
Bench
(Acc)
Toolathlon
(Pass@1)
0
20
40
60
80
100
A
cc
u
ra
cy
/
P
as
s@
1
(
%
)
32063168
3052
Knowledge & Reasoning Agentic Capabilities
DeepSeek-V4-Pro-Max -Max -xHigh -Pro-High
0 256 512 768 1024
Token Position (K)
S
in
g
le
-T
ok
en
F
L
O
P
s
(T
)
× lower
× lower
DeepSeek-V4-Pro
DeepSeek-V4-Flash
0 256 512 768 1024
Sequence Length (K)
0
10
20
30
40
50
A
cc
u
m
u
la
te
d
K
V
C
ac
h
e
(G
B
)
× smaller
× smaller
DeepSeek-V4-Pro
DeepSeek-V4-Flash
Figure 1 | Left: benchmark performance of DeepSeek-V4-Pro-Max and its counterparts. Right:
inference FLOPs and KV cache size of DeepSeek-V4 series and .
Contents
1 Introduction 4
2 Architecture 6
Designs Inherited from DeepSeek-V3 . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Manifold-Constrained Hyper-Connections . . . . . . . . . . . . . . . . . . . . . . 7
Hybrid Attention with CSA and HCA . . . . . . . . . . . . . . . . . . . . . . . . . 9
Compressed Sparse Attention . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Heavily Compressed Attention . . . . . . . . . . . . . . . . . . . . . . . . . 11
Other Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Efficiency Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Muon Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 General Infrastructures 15
Fine-Grained Communication-Computation Overlap in Expert Parallelism . . . . 15
Flexible and Efficient Kernel Development with TileLang . . . . . . . . . . . . . . 16
High-Performance Batch-Invariant and Deterministic Kernel Libraries . . . . . . 18
FP4 Quantization-Aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Training Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Efficient Implementation of Muon . . . . . . . . . . . . . . . . . . . . . . . 20
Cost-Effective and Memory-Efficient Implementation of mHC . . . . . . . 21
Contextual Parallelism for Long-Context Attention . . . . . . . . . . . . . 21
Extended Automatic Differentiation for Flexible Activation Checkpointing 21
Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
KV Cache Structure and Management . . . . . . . . . . . . . . . . . . . . . 22
On-Disk KV Cache Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Pre-Training 24
Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Pre-Training Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Model Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Training Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Mitigating Training Instability . . . . . . . . . . . . . . . . . . . . . . . . . 26
Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2
5 Post-Training 29
Post-Training Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Specialist Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
On-Policy Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
RL and OPD Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
FP4 Quantization Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Efficient Teacher Scheduling for Full-Vocabulary OPD . . . . . . . . . . . 34
Preemptible and Fault-Tolerant Rollout Service . . . . . . . . . . . . . . . 34
Scaling RL Framework for Million-Token Context . . . . . . . . . . . . . . 35
Sandbox Infrastructure for Agentic AI . . . . . . . . . . . . . . . . . . . . . 35
Standard Benchmark Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Performance on Real-World Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chinese Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
White-Collar Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Code Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Conclusion, Limitations, and Future Directions 44
A Author List and Acknowledgment 54
Author List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
B Evaluation Details 55
3
1. Introduction
The emergence of reasoning models (DeepSeek-AI, 2025; OpenAI, 2024c) has established a
new paradigm of test-time scaling, driving substantial performance gains for Large Language
Models (LLMs). However, this scaling paradigm is fundamentally constrained by the quadratic
computational complexity of the vanilla attention mechanism (Vaswani et al., 2017), which
creates a prohibitive bottleneck for ultra-long contexts and reasoning processes. Concurrently,
the emergence of long-horizon scenarios and tasks — from complex agentic workflows to
massive cross-document analysis — has also made efficient support for ultra-long contexts
critical for future progress. While recent open-source efforts (Bai et al., 2025a; DeepSeek-AI,
2024; MiniMax, 2025; Qwen, 2025) have advanced general capabilities, this core architectural
inefficiency in handling ultra-long sequences remains a key impediment, limiting further gains
from test-time scaling and hindering further exploration into long-horizon scenarios and tasks.
In order to break the efficiency barrier in ultra-long contexts, we develop the DeepSeek-V4
series, including the preview versions of DeepSeek-V4-Pro with parameters (49B activated)
and DeepSeek-V4-Flash with 284B parameters (13B activated). Through architectural innova-
tions, DeepSeek-V4 series achieve a dramatic leap in computational efficiency for processing
ultra-long sequences. This breakthrough enables efficient support for a context length of one
million tokens, ushering in a new era of million-length contexts for next-generation LLMs. We
believe our capability to efficiently handle ultra-long sequences unlocks the next frontier of
test-time scaling, paves the way for deeper research into long-horizon tasks, and establishes a
necessary foundation for exploring future paradigms like online learning.
Compared with the DeepSeek-V3 architecture (DeepSeek-AI, 2024), DeepSeek-V4 series
retain the DeepSeekMoE framework (Dai et al., 2024) and Multi-Token Prediction (MTP) strategy,
while introducing several key innovations in architecture and optimization. To enhance long-
context efficiency, we design a hybrid attention mechanism combining Compressed Sparse
Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses the KV caches
along the sequence dimension and then performs DeepSeek Sparse Attention (DSA) (DeepSeek-
AI, 2025), whereas HCA applies more aggressive compression to the KV caches but keeps
dense attention. To strengthen modeling capability, we incorporate Manifold-Constrained
Hyper-Connections (mHC) (Xie et al., 2026) that upgrade conventional residual connections.
Additionally, we introduce the Muon (Jordan et al., 2024; Liu et al., 2025) optimizer to the
training of DeepSeek-V4 series, leading to faster convergence and improved training stability.
To enable efficient training and inference for DeepSeek-V4 series as well as productive de-
velopment, we introduce several infrastructure optimizations. First, we design and implement
a single fused kernel for MoE modules that fully overlaps computation, communication, and
memory access. Second, we employ TileLang (Wang et al., 2026), a Domain-Specific Language
(DSL) to balance development productivity and runtime efficiency. Third, we provide efficient
batch-invariant and deterministic kernel libraries to ensure bitwise reproducibility across train-
ing and inference. Fourth, we incorporate FP4 quantization-aware training for MoE expert
weights and the indexer QK path to reduce memory and computation. Fifth, for the training
framework, we extend the autograd framework with tensor-level checkpointing for fine-grained
recomputation control; and we enhance training efficiency with a hybrid ZeRO strategy for the
Muon optimizer, cost-effective mHC implementations via recomputation and fused kernels, and
two-stage contextual parallelism to manage compressed attention. Finally, for the inference
framework, we design a heterogeneous KV cache structure with on-disk storage strategies to
enable efficient shared-prefix reuse.
4
By employing hybrid CSA and HCA, along with precision optimizations on computation
and storage, DeepSeek-V4 series achieve significantly lower inference FLOPs and a substantially
reduced KV cache size compared with , especially in long-context settings. The
right part of Figure 1 demonstrates the estimated single-token inference FLOPs and accumulated
KV cache size of and DeepSeek-V4 series. In the scenario of 1M-token context,
even DeepSeek-V4-Pro, which has a larger number of activated parameters, attains only 27%
of the single-token FLOPs (measured in equivalent FP8 FLOPs) and 10% of the KV cache
size relative to . Furthermore, DeepSeek-V4-Flash, with its smaller number of
activated parameters, pushes efficiency even further: in the 1M-token context setting, it achieves
only 10% of the single-token FLOPs and 7% of the KV cache size compared with .
Additionally, for DeepSeek-V4 series, the routed expert parameters utilize FP4 precision. While
the peak FLOPs for FP4 × FP8 operations are currently the same as FP8 × FP8 on existing
hardware, they can theoretically be implemented to be 1/3 more efficient on future hardware,
which will further enhance the efficiency of DeepSeek-V4 series.
During pre-training, we train DeepSeek-V4-Flash on 32T tokens and DeepSeek-V4-Pro on 33T
tokens, respectively. After pre-training, these two models can natively and efficiently support
1M-length contexts. In our internal evaluations, DeepSeek-V4-Flash-Base already surpasses
-Base across a majority of benchmarks with its more parameter-efficient design.
DeepSeek-V4-Pro-Base further extends this advantage to set a new performance standard among
DeepSeek foundation models, achieving comprehensive superiority across reasoning, coding,
long-context, and world knowledge tasks.
The post-training pipeline of DeepSeek-V4 series features a two-stage paradigm: the inde-
pendent cultivation of domain-specific experts, followed by unified model consolidation via
on-policy distillation (Lu and Lab, 2025). Initially, for each target domain — such as mathematics,
coding, agent, and instruction following — a separate expert model is trained independently.
The base model first undergoes Supervised Fine-Tuning (SFT) on high-quality, domain-specific
data to establish foundational capabilities. Subsequently, Reinforcement Learning (RL) is ap-
plied using Group Relative Policy Optimization (GRPO) (DeepSeek-AI, 2025), which further
optimizes the model for domain-aligned behaviors guided by reward models tailored to specific
success criteria. This phase yields a diverse set of specialized experts, each excelling in its
respective field. Finally, to integrate these distinct proficiencies, a single unified model is trained
through on-policy distillation, wherein the unified model acts as the student learning to optimize
the reverse KL loss with teacher models.
Summary of Core Evaluation Results
• Knowledge: In assessments of broad world knowledge, DeepSeek-V4-Pro-Max, the maxi-
mum reasoning effort mode of DeepSeek-V4-Pro, significantly outperforms leading open-
source models on the SimpleQA (OpenAI, 2024d) and Chinese-SimpleQA (He et al., 2024)
benchmarks. Regarding educational knowledge — evaluated via MMLU-Pro (Wang et al.,
2024b), HLE (Phan et al., 2025), and GPQA (Rein et al., 2023) — DeepSeek-V4-Pro-Max
shows a marginal lead over its open-source counterparts. DeepSeek-V4-Pro-Max has
significantly closed the gap with the leading proprietary model, -Pro, despite
still trailing it in these knowledge-based evaluations.
• Reasoning: Through the expansion of reasoning tokens, DeepSeek-V4-Pro-Max demon-
strates superior performance relative to and -Pro on standard reasoning
benchmarks. Nevertheless, its performance falls marginally short of and Gemini-
-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by
approximately 3 to 6 months. Furthermore, DeepSeek-V4-Flash-Max achieves comparable
5
Input Tokens
Embedding
CSA / HCA
Prediction Head
MTP Modules
LM Loss
MTP Loss
Residual Mixing
Pre-Block Mixing
Post-Block Mixing
Transformer Block ×𝐿𝐿
DeepSeekMoE
Residual Mixing
Pre-Block Mixing
Post-Block Mixing
Figure 2 | Overall architecture of DeepSeek-V4 series. We use hybrid CSA (Compressed Sparse
Attention) and HCA (Heavily Compressed Attention) for attention layers, DeepSeekMoE for
feed-forward layers, and strengthen conventional residual connections with mHC.
performance to and -Pro, establishing itself as a highly cost-effective
architecture for complex reasoning tasks.
• Agent: On public benchmarks, DeepSeek-V4-Pro-Max is on par with leading open-source
models, such as and , but slightly worse than frontier closed models.
In our internal evaluation, DeepSeek-V4-Pro-Max outperforms Claude Sonnet and
approaches the level of Opus .
• Long-Context: DeepSeek-V4-Pro-Max delivers strong results on synthetic and real use
cases with a 1-million-token context window, surpassing even -Pro on academic
benchmarks.
• DeepSeek-V4-Pro . DeepSeek-V4-Flash: DeepSeek-V4-Flash-Max exhibits lower per-
formance in knowledge evaluations due to its smaller parameter scale. However, it
achieves comparable results on reasoning tasks when allocated a larger thinking bud-
get. In agent evaluations, while DeepSeek-V4-Flash-Max matches the performance of
DeepSeek-V4-Pro-Max on several benchmarks, it still trails its larger counterpart on more
complex, high-difficulty tasks.
2. Architecture
Overall, DeepSeek-V4 series retain the Transformer (Vaswani et al., 2017) architecture and Multi-
Token Prediction (MTP) modules (DeepSeek-AI, 2024; Gloeckle et al., 2024), while introducing
several key upgrades over DeepSeek-V3: (1) firstly, we introduce the Manifold-Constrained
Hyper-Connections (mHC) (Xie et al., 2026) to strengthen conventional residual connections;
6
(2) secondly, we design a hybrid attention architecture, which greatly improves long-context
efficiency through Compressed Sparse Attention and Heavily Compressed Attention. (3) thirdly,
we employ Muon (Jordan et al., 2024; Liu et al., 2025) as the optimizer. For the Mixture-of-
Experts (MoE) components, we still adopt the DeepSeekMoE (Dai et al., 2024) architecture, with
only minor adjustments from DeepSeek-V3. The Multi-Token Prediction (MTP) (DeepSeek-AI,
2024; Gloeckle et al., 2024; Li et al., 2024; Qi et al., 2020) configuration remains identical to
that of DeepSeek-V3. All other unspecified details follow the settings established in DeepSeek-
V3 (DeepSeek-AI, 2024). Figure 2 illustrates the overall architecture of DeepSeek-V4, and the
details are described below.
. Designs Inherited from DeepSeek-V3
Mixture-of-Experts. As previous DeepSeek-series models (DeepSeek-AI, 2024; DeepSeek-AI,
2024), DeepSeek-V4 series also adopt the DeepSeekMoE paradigm (Dai et al., 2024) for Feed-
Forward Networks (FFNs), which sets fine-grained routed experts and shared experts. Different
from DeepSeek-V3, we change the activation function that computes the affinity scores from
Sigmoid(·) into Sqrt(Softplus(·)). For load balancing, we also employ the auxiliary-loss-free
strategy (DeepSeek-AI, 2024; Wang et al., 2024a), augmented by a slight sequence-wise balance
loss that prevents extreme imbalance within individual sequences. For DeepSeek-V4, we remove
the constraint on the number of routing target nodes, and carefully redesign the parallelism
strategy to maintain training efficiency. Furthermore, compared with DeepSeek-V3, we replace
the dense FFN layers in the initial several Transformer blocks with MoE layers that employ
Hash routing (Roller et al., 2021). The Hash routing strategy determines the target experts of
each token according to a predefined hash function with regard to the input token ID.
Multi-Token Prediction. As DeepSeek-V3, DeepSeek-V4 series also set MTP modules and
objectives. Given that the MTP strategy has been validated in DeepSeek-V3, we adopt the same
strategy for DeepSeek-V4 series without modification.
. Manifold-Constrained Hyper-Connections
As shown in Figure 2, DeepSeek-V4 series incorporate Manifold-Constrained Hyper-Connections
(mHC) (Xie et al., 2026) to strengthen the conventional residual connections between adjacent
Transformer blocks. Compared with naive Hyper-Connections (HC) (Zhu et al., 2025), the core
idea of mHC is to constrain the residual mapping onto a specific manifold, and thus enhance the
stability of signal propagation across layers while preserving model expressivity. This subsection
briefly introduces the standard HC and describes how we design mHC for stable training.
Standard Hyper-Connections. The standard HC expands the width of the residual stream
by a factor of 𝑛hc. Specifically, the shape of the residual stream is expanded from R𝑑 to R𝑛hc×𝑑 ,
where 𝑑 is the hidden size of the actual layer input. Let 𝑋𝑙 = [x𝑙,1; . . . ; x𝑙,𝑛hc]
𝑇 ∈ R𝑛hc×𝑑 be the
residual state before the 𝑙-th layer. HC introduces three linear mappings: an input mapping
𝐴𝑙 ∈ R1×𝑛hc , a residual transformation 𝐵𝑙 ∈ R𝑛hc×𝑛hc , and an output mapping 𝐶𝑙 ∈ R𝑛hc×1. The
update of the residual state is then formulated as:
𝑋𝑙+1 = 𝐵𝑙𝑋𝑙 + 𝐶𝑙F𝑙 (𝐴𝑙𝑋𝑙), (1)
where F𝑙 denotes the 𝑙-th layer (., an MoE layer), whose input and output shapes are both
R𝑑 . Note that the actual layer input 𝐴𝑙𝑋𝑙 ∈ R𝑑 is also 𝑑-dimensional, so the expanded residual
7
width does not influence the design of the inner layers. HC decouples the residual width from
the actual hidden size, offering a complementary scaling axis with minimal computational
overhead, as 𝑛hc is typically much smaller than the hidden size 𝑑. However, even though HC
has demonstrated potential in improving model performance, we find that the training will
frequently exhibit numerical instability when stacking multiple layers, which hinders the scaling
of HC.
Manifold-Constrained Residual Mapping. The core innovation of mHC is to constrain the
residual mapping matrix 𝐵𝑙 to the manifold of doubly stochastic matrices (the Birkhoff polytope)
M, and thus enhance the stability of signal propagation across layers:
𝐵𝑙 ∈ M ≔ {𝑀 ∈ R𝑛×𝑛 | 𝑀1𝑛 = 1𝑛, 1𝑇𝑛𝑀 = 1
𝑇
𝑛 , 𝑀 ⩾ 0}. (2)
This constraint ensures that the spectral norm of the mapping matrix ∥𝐵𝑙∥2 is bounded by 1, so
the residual transformation is non-expansive, which increases the numerical stability during both
the forward pass and backpropagation. Besides, the set M is closed under multiplication, which
guarantees stability in the scenarios of deep stacks of mHC. In addition, the input transformation
𝐴𝑙 and output transformation 𝐶𝑙 are also constrained to be non-negative and bounded via a
Sigmoid function to avoid the risk of signal cancellation.
Dynamic Parameterization. The parameters of three linear mappings are dynamically gen-
erated, which are decomposed into a dynamic (input-dependent) component and a static
(input-independent) component. Given the input 𝑋𝑙 ∈ R𝑛hc×𝑑 , it is first flattened and normal-
ized: �̂�𝑙 = RMSNorm(vec(𝑋𝑙)) ∈ R1×𝑛hc𝑑 . Then, we follow the conventional HC to generate the
unconstrained raw parameters �̃�𝑙 ∈ R1×𝑛hc , �̃�𝑙 ∈ R𝑛hc×𝑛hc , and 𝐶𝑙 ∈ R𝑛hc×1:
�̃�𝑙 = 𝛼
pre
𝑙
· ( �̂�𝑙𝑊
pre
𝑙
) + 𝑆pre
𝑙
, (3)
�̃�𝑙 = 𝛼
res
𝑙
· Mat( �̂�𝑙𝑊res𝑙 ) + 𝑆
res
𝑙
, (4)
𝐶𝑙 = 𝛼
post
𝑙
· ( �̂�𝑙𝑊
post
𝑙
)𝑇 + 𝑆post
𝑙
, (5)
where 𝑊
pre
𝑙
,𝑊
post
𝑙
∈ R𝑛hc𝑑×𝑛hc and 𝑊res
𝑙
∈ R𝑛hc𝑑×𝑛
2
hc are learnable parameters for generating the
dynamic components; Mat(·) reshapes a vector of size 1 × 𝑛2hc into a matrix of size 𝑛hc × 𝑛hc;
𝑆
pre
𝑙
∈ R1×𝑛hc , 𝑆post
𝑙
∈ R𝑛hc×1, and 𝑆res
𝑙
∈ R𝑛hc×𝑛hc are learnable static biases; and 𝛼pre
𝑙
, 𝛼res
𝑙
, 𝛼
post
𝑙
∈ R
are learnable gating factors initialized to small values.
Applying Parameter Constraints. After obtaining the unconstrained raw parameters �̃�𝑙, �̃�𝑙,𝐶𝑙,
we then apply constraints described earlier to them to enhance the numerical stability. To be
specific, for the input and output mappings, we employ a Sigmoid function 𝜎(·) to ensure their
non-negativity and boundedness:
𝐴𝑙 = 𝜎( �̃�𝑙), (6)
𝐶𝑙 = 2𝜎(𝐶𝑙). (7)
As for the residual mapping �̃�𝑙, we project it onto the manifold of doubly stochastic matrices M.
This is achieved by the Sinkhorn-Knopp algorithm, which first applies an exponential function
to �̃�𝑙 to ensure positivity, getting 𝑀 (0) = exp( �̃�𝑙), and then iteratively performs column and row
normalization:
𝑀
(𝑡)
= T𝑟 (T𝑐 (𝑀 (𝑡−1) )), (8)
where T𝑟 and T𝑐 denote row and column normalization, respectively. This iteration converges to
a constrained doubly stochastic matrix 𝐵𝑙 = 𝑀 (𝑡max ) . We choose 𝑡max = 20 as a practical value.
8
…Hidden States of KV Tokens
…
Compressed
Indexer Keys
Hidden State of Query Token
Multi-Query
Attention
Compressed
KV Entries
…
Top-k
Selector
Selected
Compressed
KV Entries
…
Shared Key-Value Multi-Query Attention
Indexer Queries Queries
Sliding Window
KV Entries
Concatenation
Token-Level
Compressor
…
Token-Level
Compressor
Lightning Indexer
Index Scores
Figure 3 | Core architectures of CSA. It compresses the number of KV entries to 1
𝑚
times, and
then applies DeepSeek Sparse Attention for further acceleration. Additionally, a small set of
sliding window KV entries is combined with the selected compressed KV entries to enhance
local fine-grained dependencies.
. Hybrid Attention with CSA and HCA
As the context length reaches extreme scales, the attention mechanism emerges as the dominant
computational bottleneck in a model. For DeepSeek-V4, we design two efficient attention
architectures — Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA)
— and employ their interleaved hybrid configuration, which substantially reduces the compu-
tational cost of attention in long-text scenarios. CSA integrates both compression and sparse
attention strategies: it first compresses the Key-Value (KV) cache of every 𝑚 tokens into one
entry, and then applies DeepSeek Sparse Attention (DSA) (DeepSeek-AI, 2025) where each query
token attends to only 𝑘 compressed KV entries. HCA aims for extreme compression by consol-
idating the KV cache of every 𝑚′ (≫ 𝑚) tokens into a single entry. The hybrid architecture of
CSA and HCA remarkably improves the long-context efficiency of DeepSeek-V4 series, making
one-million-token context feasible in practice. This subsection describes the core techniques
of our hybrid attention architecture, and we also provide an open-source implementation1 to
specify more details unambiguously.
. Compressed Sparse Attention
The core architecture of CSA is illustrated in Figure 3, which first compresses the KV cache of each
𝑚 tokens into one entry, and then applies DeepSeek Sparse Attention for further acceleration.
Compressed Key-Value Entries. Let 𝐻 ∈ R𝑛×𝑑 be a sequence of input hidden states, where
𝑛 is the sequence length and 𝑑 is the hidden size. CSA first computes two series of KV entries
𝐶𝑎,𝐶𝑏 ∈ R𝑛×𝑐 and their corresponding compression weights 𝑍𝑎, 𝑍𝑏 ∈ R𝑛×𝑐, where 𝑐 is the head
1
9
dimension:
𝐶
𝑎
= 𝐻 ·𝑊𝑎𝐾𝑉 , 𝐶𝑏 = 𝐻 ·𝑊𝑏𝐾𝑉 , (9)
𝑍
𝑎
= 𝐻 ·𝑊𝑎𝑍, 𝑍𝑏 = 𝐻 ·𝑊𝑏𝑍, (10)
where 𝑊𝑎𝐾𝑉 ,𝑊𝑏𝐾𝑉 ,𝑊𝑎𝑍,𝑊𝑏𝑍 ∈ R𝑑×𝑐 are trainable parameters. Next, each 𝑚 KV entries in 𝐶𝑎 and
𝐶𝑏 will be compressed into one entry according to their compression weights and learnable
positional biases 𝐵𝑎, 𝐵𝑏 ∈ R𝑚×𝑐, producing 𝐶Comp ∈ R
𝑛
𝑚
×𝑐. Each compressed entry 𝐶
Comp
𝑖
∈ R𝑐 is
computed by
[𝑆𝑎
𝑚𝑖:𝑚(𝑖+1)−1; 𝑆
𝑏
𝑚(𝑖−1) :𝑚𝑖−1] = Softmaxrow( [𝑍
𝑎
𝑚𝑖:𝑚(𝑖+1)−1 + 𝐵
𝑎; 𝑍𝑏
𝑚(𝑖−1) :𝑚𝑖−1 + 𝐵
𝑏]), (11)
𝐶
Comp
𝑖
=
𝑚(𝑖+1)−1∑︁
𝑗=𝑚𝑖
𝑆
𝑎
𝑗 ⊙ 𝐶
𝑎
𝑗 +
𝑚𝑖−1∑︁
𝑗=𝑚(𝑖−1)
𝑆
𝑏
𝑗 ⊙ 𝐶
𝑏
𝑗 , (12)
where ⊙ denotes the Hadamard product; Softmaxrow(·) denotes the softmax operation along
the row dimension, which performs normalization across the total of 2𝑚 elements from both
𝑍𝑎 and 𝑍𝑏. When 𝑖 = 0, 𝑍𝑏
𝑚(𝑖−1) :𝑚𝑖−1 is padded with negative infinity and 𝐶
𝑏
𝑚(𝑖−1) :𝑚𝑖−1 is padded
with zeros. Note that each 𝐶
Comp
𝑖
is derived from 2𝑚 KV entries, but the indexes of 𝐶𝑏 used for
𝐶
Comp
𝑖
and the indexes of 𝐶𝑎 used for 𝐶
Comp
𝑖−1 are overlapped. Therefore, CSA in fact compresses
the sequence length to 1
𝑚
times.
Lightning Indexer for Sparse Selection. After obtaining the compressed KV entries 𝐶Comp,
CSA applies the DSA strategy to select top-k compressed KV entries for core attention. First,
CSA performs the same compression operation used for 𝐶Comp to get compressed indexer keys
𝐾IComp ∈ R
𝑛
𝑚
×𝑐𝐼 , where 𝑐𝐼 is the indexer head dimension. Then, for a query token 𝑡, we produce
the indexer queries {q𝐼
𝑡,1; q
𝐼
𝑡,2; ...; q
𝐼
𝑡,𝑛𝐼
ℎ
} in a low-rank manner:
c𝑄𝑡 = h𝑡 ·𝑊
𝐷𝑄, (13)
[q𝐼
𝑡,1; q
𝐼
𝑡,2; ...; q
𝐼
𝑡,𝑛𝐼
ℎ
] = q𝐼𝑡 = c
𝑄
𝑡 ·𝑊
𝐼𝑈𝑄, (14)
where h𝑡 ∈ R𝑑 is the input hidden state of the query token 𝑡; c𝑄𝑡 ∈ R𝑑𝑐 is the compressed
latent vector for queries; 𝑑𝑐 denotes the query compression dimension; 𝑛𝐼ℎ denotes the number
of indexer query heads; 𝑊𝐷𝑄 ∈ R𝑑×𝑑𝑐 and 𝑊 𝐼𝑈𝑄 ∈ R𝑑𝑐×𝑐
𝐼𝑛𝐼
ℎ are the down-projection and up-
projection matrices for indexer queries, respectively. Next, the index score 𝐼𝑡,𝑠 ∈ R between the
query token 𝑡 and a preceding compressed block 𝑠 (𝑠 < Floor( 𝑡
𝑚
)) is computed by
[𝑤𝐼
𝑡,1;𝑤
𝐼
𝑡,2; ...;𝑤
𝐼
𝑡,𝑛𝐼
ℎ
] = w𝐼𝑡 = h𝑡 ·𝑊
𝑤, (15)
𝐼𝑡,𝑠 =
𝑛𝐼
ℎ∑︁
ℎ=1
𝑤
𝐼
𝑡,ℎ · ReLU
(
q𝐼
𝑡,ℎ · 𝐾
IComp
𝑠
)
, (16)
where 𝑊𝑤 ∈ R𝑑×𝑛
𝐼
ℎ is a learnable matrix; 𝑤𝐼
𝑡,ℎ ∈ R is the weight of the ℎ-th indexer head. For a
query token 𝑡, given its index scores 𝐼𝑡,:, we employ a top-k selector to selectively retain a subset
of compressed KV entries CSprsComp𝑡 for subsequent core attention:
CSprsComp𝑡 =
{
𝐶
Comp
𝑠
��� 𝐼𝑡,𝑠 ∈ Top-k(𝐼𝑡,:)} . (17)
10
…
Hidden States of KV Tokens
…
Hidden State of Query Token
Heavily
Compressed
KV Entries
Shared Key-Value Multi-Query Attention
Queries
Sliding Window
KV Entries
Concatenation
Token-Level
Compressor
Figure 4 | Core architectures of HCA. It performs heavier compression, where the KV entries of
𝑚′ (≫ 𝑚) tokens will be consolidated into one. Also, we additionally introduce a small set of
sliding window KV entries to enhance local fine-grained dependencies.
Shared Key-Value MQA. After selecting the sparse KV entries, CSA then performs core
attention in a Multi-Query Attention (MQA) (Shazeer, 2019) manner, where each compressed
KV entry in CSprsComp𝑡 serves as both attention key and value. To be specific, for a query token 𝑡,
we first produce attention queries {q𝑡,1; q𝑡,2; ...; q𝑡,𝑛ℎ} from the compressed latent vector c
𝑄
𝑡 :
[q𝑡,1; q𝑡,2; ...; q𝑡,𝑛ℎ] = q𝑡 = c
𝑄
𝑡 ·𝑊
𝑈𝑄, (18)
where 𝑛ℎ denotes the number of query heads; 𝑊𝑈𝑄 ∈ R𝑑𝑐×𝑐𝑛ℎ is the up-projection matrices for
queries. Note that the latent query vector c𝑄𝑡 is shared with that used for the indexer queries.
Next, we perform MQA on {q𝑡,𝑖} and C
SprsComp
𝑡 :
o𝑡,𝑖 = CoreAttn
(
query=q𝑡,𝑖, key=C
SprsComp
𝑡 , value=C
SprsComp
𝑡
)
, (19)
where o𝑡,𝑖 ∈ R𝑐 is the core attention output of the 𝑖-th head at the 𝑡-th token; CoreAttn(·) denotes
the core attention operation.
Grouped Output Projection. In the configuration of DeepSeek-V4, 𝑐𝑛ℎ is quite large. Therefore,
directly projecting the outputs of the core attention operation [o𝑡,1; o𝑡,2; ...; o𝑡,𝑛ℎ] = o𝑡 ∈ R𝑐𝑛ℎ to a
𝑑-dimensional hidden state will impose a substantial computational burden. To mitigate this
cost, we design a grouped output projection strategy. To be specific, we first split 𝑛ℎ outputs
into 𝑔 groups, and then for each group of output o𝐺
𝑡,𝑖 ∈ R
𝑐
𝑛ℎ
𝑔 , we project it to a 𝑑𝑔-dimensional
intermediate output o𝐺
′
𝑡,𝑖 ∈ R
𝑑𝑔 , where 𝑑𝑔 < 𝑐
𝑛ℎ
𝑔
. Finally, we project the intermediate output
[o𝐺′
𝑡,1; o
𝐺′
𝑡,2; ...; o
𝐺′
𝑡,𝑔] ∈ R𝑑𝑔𝑔 to the final attention output ô𝑡 ∈ R𝑑 .
. Heavily Compressed Attention
The core architecture of HCA is illustrated in Figure 4, which compresses the KV cache in a
heavier manner, but does not employ sparse attention.
Compressed Key-Value Entries. By and large, the compression strategy of HCA is similar to
that of CSA, but employs a larger compression rate 𝑚′ (≫ 𝑚) and does not perform overlapped
11
compression. Let 𝐻 ∈ R𝑛×𝑑 be a sequence of input hidden states, HCA first computes the
original KV entries 𝐶 ∈ R𝑛×𝑐 and their corresponding compression weights 𝑍 ∈ R𝑛×𝑐:
𝐶 = 𝐻 ·𝑊𝐾𝑉 , (20)
𝑍 = 𝐻 ·𝑊𝑍, (21)
where𝑊𝐾𝑉 ,𝑊𝑍 ∈ R𝑑×𝑐 are trainable parameters. Next, each𝑚′ KV entries in 𝐶 will be compressed
into one according to the compression weights and learnable positional biases 𝐵 ∈ R𝑚′×𝑐,
producing 𝐶Comp ∈ R
𝑛
𝑚′ ×𝑐. Each compressed entry 𝐶
Comp
𝑖
∈ R𝑐 is computed by
𝑆𝑚′ 𝑖:𝑚′ (𝑖+1)−1 = Softmaxrow(𝑍𝑚′ 𝑖:𝑚′ (𝑖+1)−1 + 𝐵), (22)
𝐶
Comp
𝑖
=
𝑚′ (𝑖+1)−1∑︁
𝑗=𝑚′ 𝑖
𝑆 𝑗 ⊙ 𝐶 𝑗. (23)
Through this compression operation, HCA compresses the sequence length to 1
𝑚′ times.
Shared Key-Value MQA and Grouped Output Projection. HCA also employs the shared KV
MQA and grouped output projection strategies as CSA does. After the KV compression, for a
query token 𝑡, HCA first produces attention queries {q𝑡,1; q𝑡,2; ...; q𝑡,𝑛ℎ} in a low-rank manner:
c𝑄𝑡 = h𝑡 ·𝑊
𝐷𝑄, (24)
[q𝑡,1; q𝑡,2; ...; q𝑡,𝑛ℎ] = q𝑡 = c
𝑄
𝑡 ·𝑊
𝑈𝑄, (25)
where h𝑡 ∈ R𝑑 is the input hidden state of the query token 𝑡; 𝑛ℎ denotes the number of query
heads; 𝑊𝐷𝑄 ∈ R𝑑×𝑑𝑐 and 𝑊𝑈𝑄 ∈ R𝑑𝑐×𝑐𝑛ℎ are the down-projection and up-projection matrices for
queries, respectively. Next, we perform MQA on {q𝑡,𝑖} and 𝐶Comp:
o𝑡,𝑖 = CoreAttn
(
query=q𝑡,𝑖, key=𝐶Comp, value=𝐶Comp
)
, (26)
where o𝑡,𝑖 ∈ R𝑐 is the core attention output of the 𝑖-th head at the 𝑡-th token. Next, as CSA does,
HCA splits 𝑛ℎ outputs into 𝑔 groups, and for each group of output o𝐺𝑡,𝑖 ∈ R
𝑐
𝑛ℎ
𝑔 , HCA projects it
to a 𝑑𝑔-dimensional intermediate output o𝐺
′
𝑡,𝑖 ∈ R
𝑑𝑔 , where 𝑑𝑔 < 𝑐
𝑛ℎ
𝑔
. Finally, HCA projects the
intermediate output [o𝐺′
𝑡,1; o
𝐺′
𝑡,2; ...; o
𝐺′
𝑡,𝑔] ∈ R𝑑𝑔𝑔 to the final attention output ô𝑡 ∈ R𝑑 .
. Other Details
In addition to the core architectures of CSA and HCA described above, our hybrid attention
incorporates several other techniques. For writing clarity, we omit these additional techniques
from the above introduction and will briefly describe them in this subsection. Also, this subsec-
tion focuses only on the core ideas of them and may omit some tiny details for simplicity. We
encourage the readers to refer to our open-source implementation for unambiguous details.
Query and Key-Value Entry Normalization. For both CSA and HCA, we perform an addi-
tional RMSNorm operation on each head of the queries and the only head of the compressed KV
entries, just before the core attention operation. This normalization avoids exploding attention
logits and may improve training stability.
12
Partial Rotary Positional Embedding. For both CSA and HCA, we partially employ the Rotary
Positional Embedding (RoPE) (Su et al., 2024) to the attention queries, KV entries, and the core
attention outputs. To be specific, for each query vector and KV entry vector used in CSA and
HCA, we apply RoPE to its last 64 dimensions. Since the KV entries serve as both attention
keys and values, the naive core attention outputs {o𝑡,𝑖} will carry absolute position embeddings,
derived from the weighted sum of KV entries. As a countermeasure, we also apply RoPE with
position −𝑖 on the last 64 dimensions of each o𝑡,𝑖. In this way, the output of the core attention
will also carry relative position embeddings — the contribution of each KV entry to the core
attention outputs will also be related to the distance between the query and the KV entry.
Additional Branch of Sliding Window Attention. In order to strictly preserve causality in
CSA and HCA, each query attends to only preceding compressed KV blocks. Consequently, a
query cannot access information from other tokens within its own compressed block. Meanwhile,
recent tokens usually possess greater relevance to the query token in language modeling. For
these reasons, we introduce a supplementary attention branch to both CSA and HCA in a sliding
window manner, for better modeling of local dependencies. To be specific, for each query token,
we additionally produce 𝑛win uncompressed KV entries corresponding to the recent 𝑛win tokens.
In the core attention of CSA and HCA, these KV entries in the sliding window will be used
along with the compressed KV entries.
Attention Sink. In the core attention of CSA and HCA, we employ the trick of attention
sink (OpenAI, 2025; Xiao et al., 2024). To be specific, we set a series of learnable sink logits
{𝑧′1, 𝑧
′
2, ..., 𝑧
′
𝑛ℎ
}. For the ℎ-th attention head, Exp(𝑧′
ℎ
) will be added to the denominator of the
attention score:
𝑠ℎ,𝑖, 𝑗 =
Exp(𝑧ℎ,𝑖, 𝑗)∑
𝑘 Exp(𝑧ℎ,𝑖,𝑘) + Exp(𝑧′ℎ)
, (27)
where 𝑠ℎ,𝑖, 𝑗, 𝑧ℎ,𝑖, 𝑗 ∈ R denote the attention score and attention logit of the ℎ-th attention head
between the 𝑖-th query token and the 𝑗-th preceding token or compressed block. This technique
allows each query head to adjust its total attention scores to be not equal to 1, and even to be
near 0.
. Efficiency Discussion
Due to the employment of hybrid CSA and HCA, together with low-precision computation
and storage, the attention module of DeepSeek-V4 series achieves remarkable efficiency in both
attention FLOPs and KV cache size, especially in long-context scenarios. First, we adopt a
mixed storage format for KV entries: BF16 precision is used for the rotary positional embedding
(RoPE) dimensions, while FP8 precision is applied to the remaining dimensions. This hybrid
representation reduces the KV cache size by nearly half compared with pure BF16 storage.
Second, attention computation within the lightning indexer is performed in FP4 precision,
which accelerates the attention operation under extremely long contexts. Third, relative to
, a smaller attention top-k is chosen in DeepSeek-V4 series, thereby improving
model efficiency on short- and medium-length texts. Finally, and most importantly, compressed
attention and hybrid attention techniques substantially reduce both the KV cache size and the
computational FLOPs.
Taking BF16 GQA8 (Ainslie et al., 2023) with a head dimension of 128 as the baseline — one
of the common configurations of LLM attention — the KV cache size of DeepSeek-V4 series can
be dramatically reduced to approximately 2% times of that baseline in the 1M-context setting.
13
Algorithm 1 Muon Optimizer for DeepSeek-V4
Require: Learning rate 𝜂, momentum 𝜇, weight decay 𝜆, update rescaling factor 𝛾
1: for each training step 𝑡 do
2: for each logically independent weight 𝑊 ∈ R𝑛×𝑚 do
3: 𝐺𝑡 = ∇𝑊L𝑡 (𝑊𝑡−1) ⊲ Compute gradients
4: 𝑀𝑡 = 𝜇𝑀𝑡−1 + 𝐺𝑡 ⊲ Accumulate momentum buffer
5: 𝑂′𝑡 = HybridNewtonSchulz(𝜇𝑀𝑡 + 𝐺𝑡) ⊲ Nesterov trick and hybrid Newton-Schulz
6: 𝑂𝑡 = 𝑂
′
𝑡 ·
√︁
max(𝑛,𝑚) · 𝛾 ⊲ Rescale the update RMS
7: 𝑊𝑡 =𝑊𝑡−1 · (1 − 𝜂𝜆) − 𝜂𝑂𝑡 ⊲ Perform weight decay and update
8: end for
9: end for
Moreover, even when compared with (DeepSeek-AI, 2025) — already an efficient
baseline — DeepSeek-V4 series still exhibits substantial advantages in efficiency. A comparison
of their inference FLOPs and KV cache size is provided in the right part of Figure 1.
. Muon Optimizer
We employ the Muon (Jordan et al., 2024; Liu et al., 2025) optimizer for the majority of modules
in DeepSeek-V4 series due to its faster convergence and improved training stability. The full
algorithm of our Muon optimization is summarized in Algorithm 1.
Basic Configurations. We maintain the AdamW (Loshchilov and Hutter, 2017) optimizer for
the embedding module, the prediction head module, the static biases and gating factors of
mHC modules, and the weights of all RMSNorm modules. All other modules are updated with
Muon. Following Liu et al. (2025), we also apply weight decay to Muon parameters, use the
Nesterov (Jordan et al., 2024; Nesterov, 1983) trick, and rescale the Root Mean Square (RMS) of
the update matrix for reutilization of our AdamW hyper-parameters. Different from them, we
use hybrid Newton-Schulz iterations for orthogonalization.
Hybrid Newton-Schulz Iterations. For a given matrix 𝑀, let its Singular Value Decomposition
(SVD) be 𝑀 = 𝑈Σ𝑉𝑇 . The Newton-Schulz iterations aim to approximately orthogonalize 𝑀 to be
𝑈𝑉𝑇 . Usually, 𝑀 will be first normalized as 𝑀0 = 𝑀/| |𝑀 | |𝐹 to ensure its maximum singular value
does not exceed 1. Then, each Newton-Schulz iteration performs the following operation:
𝑀𝑘 = 𝑎𝑀𝑘−1 + 𝑏(𝑀𝑘−1𝑀𝑇𝑘−1)𝑀𝑘−1 + 𝑐(𝑀𝑘−1𝑀
𝑇
𝑘−1)
2
𝑀𝑘−1. (28)
Our hybrid Newton-Schulz performs 10 iterations over two distinct stages. During the first 8
steps, we use coefficients (𝑎, 𝑏, 𝑐) = (,−, ) to drive rapid convergence, bringing
the singular values close to 1. In the final 2 steps, we switch to coefficients (𝑎, 𝑏, 𝑐) = (2,−, ),
which stabilize the singular values precisely at 1.
Avoiding Exploding Attention Logits. The attention architecture of DeepSeek-V4 series al-
lows us to directly apply RMSNorm on the attention queries and KV entries, which effectively
prevents attention logits from exploding. Consequently, we do not employ the QK-Clip tech-
nique (Liu et al., 2025) in our Muon optimizer.
14
3. General Infrastructures
. Fine-Grained Communication-Computation Overlap in Expert Parallelism
Mixture-of-Experts (MoE) can be accelerated via Expert Parallelism (EP). However, EP re-
quires complex inter-node communication and imposes substantial demands on interconnect
bandwidth and latency. To alleviate the communication bottleneck in EP and achieve higher
end-to-end performance under lower interconnection bandwidth requirements, we propose
a fine-grained EP scheme that fuses communication and computation into a single pipelined
kernel for communication-computation overlapping.
Communication Latency Can Be Hidden. The key insight of our EP scheme is that the
communication latency can be effectively hidden beneath computation in MoE layers. As shown
in Figure 5, in DeepSeek-V4 series, each MoE layer can be decomposed mainly into four stages:
two communication-bound stages, Dispatch and Combine, and two computation-bound stages,
Linear-1 and Linear-2. Our profiling reveals that within a single MoE layer, the total time of
communication is less than that of the computation. Therefore, after fusing communication and
computation into a unified pipeline, computation remains the dominant bottleneck, implying
that the system can tolerate lower interconnect bandwidth without degrading end-to-end
performance.
L1 Act L2
(a) Naive Solution
Communication
Computation L1 Act L2
Theoretical speedup: ×
(b) Comet
Dispatch
Computation
Activation
& Combine
L1 L2 L1 L2 L1 L2
Act Act Act
Expert Wave 1 Expert Wave 2 Expert Wave 3
Theoretical speedup: ×
(c) Ours
Dispatch All-to-All
Linear 1 GEMM
SwiGLU + FP8 Cast
Combine All-to-All
Linear 2 GEMM
Figure 5 | Illustration of our EP scheme with related works. Comet (Zhang et al., 2025b) overlaps
Dispatch with Linear-1, and Linear-2 with Combine, separately. Our EP scheme achieves a finer-
grained overlapping by splitting and scheduling experts into waves. The theoretical speedup is
evaluated in the configuration of the DeepSeek-V4-Flash architecture.
Fine-Grained EP Scheme. To further lower the interconnect bandwidth requirement and
amplify the benefits of overlapping, we introduce a finer-grained expert partitioning scheme.
Inspired by many related works (Aimuyo et al., 2025; Zhang et al., 2025b), we split and schedule
the experts into waves. Each wave consists of a small portion of experts. As soon as all experts
within the wave have completed their communication, computation can commence immediately
without waiting for other experts. In steady state, computation of current wave, token transfer for
the next wave, and result sending of completed experts all proceed concurrently, as demonstrated
in Figure 5. This forms a fine-grained pipeline among experts, keeping both computation and
communication continuous throughout the wave. The wave-based scheduling speeds up the
15
performance on extreme cases such as Reinforcement Learning (RL) rollout, which usually
encounters long-tail small batches.
Performance and Open-Sourced Mega-Kernel. We validated the fine-grained EP scheme
on both NVIDIA GPUs and HUAWEI Ascend NPUs platforms. Compared against strong
non-fused baselines, it achieves ∼ × speedup for general inference workloads, and
up to × for latency-sensitive scenarios such as RL rollouts and high-speed agent serving.
We have open-sourced the CUDA-based mega-kernel implementation named MegaMoE2 as a
component of DeepGEMM.
Observations and Proposals. We share observations and lessons from kernel development
and offer some proposals to hardware vendors, in the hope of aiding efficient hardware design
and achieving better software-hardware co-design:
• Computation-Communication Ratio. Full communication-computation overlap hinges
on the computation-communication ratio, rather than the bandwidth solely. Denoting peak
compute throughput as 𝐶 and interconnect bandwidth as 𝐵, communication can be fully
hidden when 𝐶/𝐵 ⩽ 𝑉comp/𝑉comm, where 𝑉comp denotes the computation volume and 𝑉comm
refers to the communication volume. For DeepSeek-V4-Pro, where each token-expert
pair requires 6ℎ𝑑 FLOPs (SwiGLU gate, up, and down projections) but only 3ℎ bytes of
communication (FP8 Dispatch + BF16 Combine), this simplifies to:
𝐶
𝐵
⩽ 2𝑑 = 6144 FLOPs/Byte.
That is, each GBps of interconnect bandwidth suffices to hide the communication for
TFLOP/s of compute. Once bandwidth meets this threshold, it ceases to be the bottleneck,
and devoting additional silicon area to further bandwidth brings diminishing returns.
We encourage future hardware designs to target such balance points rather than scale
bandwidth unconditionally.
• Power Budget. Extreme kernel fusion drives compute, memory, and network to high
load simultaneously, making power throttling a key performance limiter. We suggest that
future hardware designs provide sufficient power headroom for such fully concurrent
workloads.
• Communication Primitives. We adopt a pull-based approach where each GPU actively
reads data from remote GPUs, avoiding the high notification latency that fine-grained
push entails. Future hardware with lower-latency cross-GPU signaling would make push
viable and enable more natural communication patterns.
• Activation Function. We propose replacing SwiGLU with a low-cost element-wise activa-
tion that involves no exponential or division operations. This lightens the post-GEMM
processing directly, and under the same parameter budget, removing the gate projection
enlarges the intermediate dimension 𝑑, further relaxing the bandwidth requirement.
. Flexible and Efficient Kernel Development with TileLang
In practice, our elaborate model architecture would have resulted in hundreds of fine-grained
Torch ATen operators. We adopt TileLang (Wang et al., 2026) to develop a set of fused kernels
to replace the vast majority of them, delivering optimal performance with minimal effort. It
2
16
also allows us to quickly prototype operators like attention variants during validation. These
kernels play critical roles in model architecture development, large-scale training, and ultimately
production deployment of inference services. As a Domain-Specific Language (DSL), TileLang
balances development productivity with runtime efficiency, enabling rapid development while
supporting deep, iterative optimizations within the same codebase. Additionally, we collab-
orate closely with the TileLang community to foster a more agile, efficient, and stable kernel
development workflow.
Reducing Invocation Overhead with Host Codegen. As accelerators continue to grow in
performance, CPU-side orchestration overhead becomes increasingly prominent. For small,
highly optimized kernels, such fixed host overhead can easily cap utilization and throughput.
A common source of this overhead is that host-side logic, such as runtime contract checks, is
typically written in Python for flexibility and thus incurs a fixed per-invocation cost.
We mitigate this overhead with Host Codegen, which moves most host-side logic into gen-
erated host code. Specifically, we first co-generate the device kernel and a lightweight host
launcher at the IR (Intermediate Representation) level, embedding the necessary metadata—such
as data types, rank/shape constraints, and stride/layout assumptions—parsed from the lan-
guage frontend. The launcher is then lowered to the host source code built on top of the
TVM-FFI (Chen et al., 2018) framework, whose compact calling convention and zero-copy tensor
interop together minimize host-side overhead. At runtime, this generated host code performs
validation and argument marshaling, shifting all per-invocation checks out of the Python exe-
cution path. Our measurements show that CPU-side validation overhead drops from tens or
hundreds of microseconds to less than one microsecond per invocation.
SMT-Solver-Assisted Formal Integer Analysis. TileLang kernels involve complex tensor
index arithmetic that requires strong formal integer analysis. During compilation passes such
as layout inference, memory hazard detection, and bound analysis, the compiler must verify
whether integer expressions satisfy specific properties to enable the corresponding optimiza-
tions. Therefore, stronger formal analysis capabilities can unlock more advanced and complex
optimization opportunities.
To this end, we integrate the Z3 SMT solver (De Moura and Bjørner, 2008) into TileLang’s
algebraic system, providing formal analysis capability for most integer expressions in tensor
programs. We strike a balance between computational overhead and formal expressiveness by
translating TileLang’s integer expressions into Z3’s quantifier-free non-linear integer arithmetic
(QF_NIA). Based on Integer Linear Programming (ILP) solvers, QF_NIA seamlessly resolves
standard linear integer expressions common in kernels. Furthermore, its inherent non-linear
reasoning capacity effectively addresses advanced challenges like vectorization over variable
tensor shapes. Under reasonable resource limits, Z3 elevates overall optimization performance
while restricting compilation time overhead to just a few seconds. The impact is substantial
across multiple passes, including vectorization, barrier insertion, and code simplification.
Numerical Precision and Bitwise Reproducibility. In production settings, numerical correct-
ness and reproducibility are as critical as raw throughput. We therefore prioritize accuracy by
default: fast-math optimizations are disabled at the compiler level, and precision-affecting ap-
proximations are provided only as explicit, opt-in frontend operators (., T.__exp, T.__log,
and T.__sin). Conversely, when strict IEEE-754 semantics are required, TileLang provides
17
IEEE-compliant intrinsics with explicit rounding modes (., _fsqrt, _fdiv,
and _add), enabling developers to precisely specify numerical behavior.
We also target bitwise reproducibility for validating kernels against hand-written CUDA
baselines. We align TileLang’s algebraic simplification and lowering rules with mainstream
CUDA toolchains (., NVCC) to avoid transformations that introduce unintended bit-level
differences. Layout annotations (., _layout) further allow users to pin down
layout-dependent lowering decisions, keeping evaluation and accumulation order consistent
with the reference CUDA implementation and thus enabling bit-identical outputs when desired.
Our evaluation shows that these accuracy- and reproducibility-oriented design choices do
not sacrifice performance: under conservative defaults, TileLang kernels remain competitive,
while exposing knobs to selectively relax numerical constraints for higher speed.
. High-Performance Batch-Invariant and Deterministic Kernel Libraries
To enable efficient training and inference, we develop a comprehensive set of high-performance
computational kernels. Beyond basic functionalities and maximizing hardware utilization,
another pivotal design goal is to ensure training reproducibility and bitwise alignment among
pre-training, post-training, and inference pipelines. Therefore, we implement end-to-end,
bitwise batch-invariant, and deterministic kernels with minimal performance overhead. These
kernels are helpful for debugging, stability analysis, and consistent post-training behavior.
Batch Invariance. Batch invariance ensures that the output of any given token remains bitwise
identical, regardless of its position within a batch. To implement batch invariance, the primary
challenges are listed as follows:
• Attention. To achieve batch invariance, we cannot use the split-KV method (Dao et al.,
2023), which distributes the attention computation for a single sequence across multiple
Stream Multiprocessors (SMs) to balance the load of SMs. However, abandoning this
technique will lead to severe wave-quantization problems3, which can adversely affect
GPU utilization. To address this, we develop a dual-kernel strategy for batch-invariant
decoding. The first kernel computes the attention output for an entire sequence within
a single SM, ensuring high throughput for fully occupied waves. The second kernel, to
minimize the latency of the final partially-filled wave and thus alleviate wave-quantization,
uses multiple SMs for a single sequence. For the bitwise identity of these two kernels,
we carefully design the calculation path of the second kernel to ensure its accumulation
order is the same as that of the first kernel. Additionally, the second kernel utilizes dis-
tributed shared memory4 within thread-block clusters, enabling high-speed data exchange
across SMs. This dual-kernel method effectively confines the overhead of batch-invariant
decoding to be negligible.
• Matrix Multiplication. Traditional cuBLAS library (NVIDIA Corporation, 2024) cannot
achieve batch invariance. Therefore, we replace it end-to-end with DeepGEMM (Zhao
et al., 2025). Furthermore, for very small batch sizes, conventional implementation usually
employs split-k (Osama et al., 2023) techniques to improve performance. Unfortunately,
split-k techniques cannot guarantee batch invariance, a pivotal feature in DeepSeek-V4.
3
ion/
4
.html#distributed-shared-memory
18
Therefore, we abandon split-k in most scenarios, which, however, may cause performance
degradation. To address this, we introduce a set of optimizations that enable our imple-
mentation of matrix multiplication to match or even surpass the performance of standard
split-k in most major scenarios.
Determinism. Deterministic training is highly beneficial for debugging hardware or software
issues. Moreover, when training exhibits anomalies such as loss spikes, determinism enables
researchers to more easily pinpoint numerical causes and further refine the model design. Non-
determinism in training typically stems from non-deterministic accumulation order, often due
to the use of atomic addition instructions. This issue primarily occurs during the backward pass,
notably at the following parts:
• Attention Backward. In conventional implementations of backward propagation for
sparse attention, we use atomicAdd to accumulate gradients for the KV tokens. This
introduces non-determinism due to the non-associativity of floating-point addition. To
address this problem, we allocate separate accumulation buffers for each SM, followed by
a global deterministic summation across all buffers.
• MoE Backward. When multiple SMs from different ranks concurrently write data to
the same buffer on a receiving rank, negotiating writing positions also introduces non-
determinism. To resolve this, we design a token order pre-processing mechanism within
each single rank, combined with buffer isolation across multiple ranks. This strategy
ensures determinism of both the send results of expert parallelism and the accumulation
order in the MoE backward pass.
• Matrix Multiplication in mHC. mHC involves a matrix multiplication with an output di-
mension of only 24. For very small batch sizes, we are compelled to use the split-k (Osama
et al., 2023) algorithm, whose naive implementation will cause non-determinism. To
overcome this, we output each split part separately and perform a deterministic reduction
in a subsequent kernel, thereby preserving both performance and determinism.
. FP4 Quantization-Aware Training
To achieve inference acceleration and memory savings at deployment, we introduce Quantization-
Aware Training (QAT) (Jacob et al., 2018) during the post-training stage, enabling the model
to adapt to the precision degradation introduced by quantization. We apply FP4 (MXFP4)
quantization (Rouhani et al., 2023) to two components: (1) MoE expert weights, which are a
major source of GPU memory occupancy (OpenAI, 2025), and (2) the Query-Key (QK) path
in the indexer of CSA, where QK activations are cached, loaded, and multiplied entirely in
FP4, accelerating attention score computation in long-context scenarios. In addition, we further
quantize the index scores 𝐼:,: from FP32 to BF16 during this QAT process. This optimization
achieves a 2× speedup for the top-k selector, while preserving a % recall rate of KV entries.
For MoE expert weights, following the common practice of QAT, the FP32 master weights
maintained by the optimizer are first quantized to FP4, then dequantized back to FP8 for
computation. Notably, our FP4-to-FP8 dequantization is lossless. This is because FP8 (E4M3)
has 2 additional exponent bits compared with FP4 (E2M1), offering a larger dynamic range.
Consequently, as long as the ratio between the maximum and minimum scale factors of the FP4
sub-blocks (1 × 32 tiles) within each FP8 quantization block (128 × 128 tiles) does not exceed
a certain threshold, the fine-grained scale information can be fully absorbed by the extended
dynamic range of FP8. We empirically verify that current weights satisfy this condition. This
allows the entire QAT pipeline to fully reuse the existing FP8 training framework without
19
any modification. In the backward pass, gradients are computed with respect to the same FP8
weights in the forward pass and directly propagated back to the FP32 master weights, equivalent
to applying the Straight-Through Estimator (STE) through the quantization operation. This also
avoids the need to re-quantize transposed weights.
During the inference and rollout phases of RL training, which do not involve backward
passes, we directly use real FP4 quantized weights instead of simulated quantization. This
ensures that model behavior during sampling is fully consistent with online deployment, while
also reducing kernel memory loading for actual speedup and significantly lowering memory
consumption. We process the QK path in the indexer of CSA similarly.
. Training Framework
Our training framework is built upon the scalable and efficient infrastructure developed for
DeepSeek-V3 (DeepSeek-AI, 2024). In training DeepSeek-V4, we inherit this robust foundation
while introducing several key innovations to accommodate its novel architectural components —
specifically the Muon optimizer, mHC, and the hybrid attention mechanism — while maintaining
high training efficiency and stability.
. Efficient Implementation of Muon
The Muon optimizer requires the full gradient matrix to compute parameter updates, which
presents a challenge when combined with the Zero Redundancy Optimizer (ZeRO) (Rajbhandari
et al., 2020). Traditional ZeRO is designed for element-wise optimizers like AdamW, where a
single parameter matrix can be partitioned and updated across multiple ranks. To address this
conflict, we design a hybrid strategy of ZeRO bucket assignment for Muon.
For dense parameters, we limit the maximum size of ZeRO parallelism and employ a
knapsack algorithm to assign parameter matrices to these ranks, ensuring each rank manages a
roughly balanced load. The bucket on each rank is padded to match the size of the largest bucket
across ranks, facilitating efficient reduce-scatter operations. This padding typically incurs less
than 10% memory overhead in our setup, where each rank manages no more than five parameter
matrices. When the overall size of data parallelism exceeds the limit for ZeRO, we compute
the Muon update redundantly across the extra data-parallel groups, trading computation for
reduced total bucket memory.
For MoE parameters, we optimize each expert independently. We first flatten all down
projection matrices in SwiGLU (Shazeer, 2020) of all experts across all layers, followed by
flattened up projection matrices and gate matrices. Then, we pad the flattened vector to ensure
we can evenly distribute this vector across all ranks without splitting any logically independent
matrix. Given the large number of experts, we do not impose a limit of ZeRO parallelism for
MoE parameters, and the padding overhead is also negligible.
Additionally, on each rank, consecutive parameters of identical shape will be automatically
merged, enabling batched execution of the Newton-Schulz iterations for better hardware utiliza-
tion. Furthermore, we observe that the Newton-Schulz iterations in Muon remain stable when
computed with BF16 matrix multiplications. Leveraging this, we further quantize, in a stochastic
rounding manner, the MoE gradients to be synchronized across data-parallel ranks to the BF16
precision, halving the communication volume. To avoid accumulation errors introduced by
low-precision adders, we replace conventional tree- or ring-based reduce-scatter collectives with
a two-phase approach. First, an all-to-all operation exchanges local gradients across ranks, and
then each rank performs a local sum in FP32. This design maintains numerical robustness.
20
. Cost-Effective and Memory-Efficient Implementation of mHC
The introduction of mHC increases both activation memory consumption and communication
volume between pipeline stages, compared with conventional residual connections. To mitigate
these costs, we implement several optimization strategies.
Firstly, we carefully design and implement fused kernels of mHC for both training and
inference. Secondly, we introduce a recomputation strategy that selectively checkpoints interme-
diate tensors. Specifically, we recompute most hidden states between layers and all normalized
layer inputs, while avoiding recomputation of compute-intensive operations. This achieves a
balance between memory saving and computational overhead. Thirdly, we adjust the DualPipe
1F1B overlapping scheme to accommodate the increased pipeline communication and enable
concurrent execution of some operations in mHC.
Collectively, these optimizations constrain the wall-time overhead of mHC to only % of
the overlapped 1F1B pipeline stage. More details of the engineering optimization can be found
in the dedicated mHC paper (Xie et al., 2026).
. Contextual Parallelism for Long-Context Attention
Conventional Context Parallelism (CP) partitions the sequence dimension, with each rank
maintaining contiguous 𝑠 tokens. This introduces two challenges to our compressed attention
mechanisms (., CSA and HCA). On the one hand, training samples are packed from multiple
sequences, and each sequence is compressed independently by a factor of 𝑚 (or 𝑚′), with any
trailing tokens fewer than 𝑚 being discarded. Consequently, the compressed KV lengths are
typically less than 𝑠
𝑚
and vary across ranks. On the other hand, the compression requires 𝑚
consecutive KV entries, which may straddle the boundary between two neighboring CP ranks.
To address these challenges, we design a two-stage communication approach. In the first
stage, each rank 𝑖 sends its last 𝑚 uncompressed KV entries to rank 𝑖 + 1. Then, rank 𝑖 + 1
compresses some of these received entries together with its local 𝑠 uncompressed KV entries,
producing a fixed length of 𝑠
𝑚
+ 1 compressed entries, in which exist some padding entries. In
the second stage, an all-gather operation across all CP ranks collects the locally compressed KV
entries. Then, a fused select-and-pad operator reorganizes them into the full set of compressed
KV entries with a total length of cp_size · 𝑠
𝑚
. Any padding entries are placed at the tail. For
HCA and the indexer in CSA, the visible range of compressed KV entries for each query token
can be precomputed by rules. For the sparse attention in CSA, the top-𝑘 selector explicitly
specifies the indices of visible compressed KV entries for each query.
. Extended Automatic Differentiation for Flexible Activation Checkpointing
Conventional activation checkpointing implementations operate at the granularity of an entire
module, deciding whether to retain or recompute its output activations during the backward
pass. This coarse granularity often leads to suboptimal trade-offs between recomputation cost
and activation memory footprint. An alternative approach is to manually implement the forward
and backward logic of an entire layer, explicitly managing tensor checkpointing states. While
enabling fine-grained control, this method loses the convenience of the automatic differentiation
framework, substantially increasing development complexity.
To achieve fine-grained control without sacrificing programming efficiency, we implement a
tensor-level activation checkpointing mechanism with automatic differentiation support. With
this mechanism, developers only need to implement the forward pass and selectively annotate
21
individual tensors for automatic checkpointing and recomputation. Our framework leverages
TorchFX (Reed et al., 2022) to trace the full computation graph. For each annotated tensor, it
performs a backward traversal to identify the minimal subgraph required for its recomputation.
We define these minimal subgraphs as recomputation graphs and insert them into the backward
logic just before the corresponding gradient computation.
Compared with the manual implementation, this design introduces no additional overhead
during training. Recomputation in this framework is implemented by directly freeing the
GPU memory of the annotated tensor and reusing the storage pointer from the recomputed
tensor, without any GPU memory copy. Furthermore, since graph tracing executes the model
concretely, we can track the underlying storage pointer of each tensor, which enables automatic
deduplication of recomputation for tensors that share storage (., the input and output of a
reshape operation). This relieves developers from reasoning about low-level memory details
when annotating recomputation.
. Inference Framework
Our inference framework largely inherits from that of DeepSeek-V3, with some differences in
KV Cache management.
. KV Cache Structure and Management
To efficiently manage the heterogeneous KV caches arising from the hybrid attention mechanism
in DeepSeek-V4, we design a customized KV cache layout. The layout is illustrated in Figure 6,
and we will elaborate on it in detail as follows.
Heterogeneous KV Entries in DeepSeek-V4. The hybrid attention mechanism in DeepSeek-
V4 series introduces multiple types of KV entries with different Key-Value (KV) cache sizes
and update rules. The lightning indexer for sparse selection introduces additional dimensions
into the KV cache that possess embedding sizes distinct from those in the primary attention.
The compression techniques employed in CSA and HCA reduce the sequence length by factors
of 1
𝑚
and 1
𝑚′ , respectively, thereby decreasing the overall KV cache size. As a result, KV cache
sizes vary across different layers. Furthermore, Sliding Window Attention (SWA) layers also
operate with distinct KV cache sizes, as well as separate cache hit and eviction policies. In
the compression branch, one KV entry is generated for every 𝑚 tokens. When the number
of remaining tokens is insufficient for compression, all pending tokens and their associated
hidden states must be retained in a buffer until the compression operation can be executed.
These buffered tokens represent a sequence state determined by positional context and are also
managed within the KV cache framework.
Challenges in Managing Hybrid Attention KV Cache. The hybrid attention mechanism
violates fundamental assumptions behind PagedAttention and its variants. Although recent
hybrid KV cache managing algorithms (., Jenga (Zhang et al., 2025a), Hymba (Dong et al.,
2025)) target general hybrid attention models or specific structures, two principal obstacles
prevent consolidating KV caches across all layers under the PagedAttention framework:
• Diverse cache policies, such as those used in Sliding Window Attention.
• Constraints imposed by high-performance attention kernels, including alignment require-
ments.
22
State Cache
SWA KV
KV Cache
Block 0
Block 1
Block 2
Block N
SWA KV
SWA KV
SWA KV
SWA KV
Uncompressed
KV State
Uncompressed
KV State
Uncompressed
KV State
Uncompressed
KV State
Uncompressed
KV State
Layer-2 CSA State
Layer-3 HCA State
…
Layer-0 SWA KV
…
Layer-n SWA KV
Request 1
Request R
Request 2
Request 3
…
…
CSA KV
HCA KV
CSA KV
HCA KV
CSA KV
HCA KV
CSA KV
HCA KV
CSA KV
HCA KV
CSA Indexer KV
of k1 tokens
CSA Main KV
of k1 tokens
HCA KV of k2 tokens
CSA Indexer KV
of k1 tokens
CSA Main KV
of k1 tokens
HCA KV of k2 tokens
Layer-2
Layer-3
Layer-5
Layer-4
......
…
…
Figure 6 | Illustration of the KV cache Layout for DeepSeek-V4. The KV cache is organized into
two primary components: a classical KV cache for CSA/HCA, and a state cache for SWA and
unready-for-compression tokens in CSA/HCA. In the state cache, each request is assigned a
fixed-size cache block. Within this block, the SWA segment stores the KV entries corresponding
to the most recent 𝑛win tokens, while the CSA/HCA segment stores uncompressed tail states
that are not yet ready for compression. In the classical KV cache, we allocate multiple blocks
per request. Each cache block covers lcm(𝑚,𝑚′) original tokens, producing 𝑘1 =
lcm(𝑚,𝑚′ )
𝑚
CSA
compressed tokens and 𝑘2 =
lcm(𝑚,𝑚′ )
𝑚′ HCA compressed tokens.
For efficient KV cache management of DeepSeek-V4, we design corresponding strategies to
overcome these two challenges.
State Cache for SWA and Uncompressed Tail Tokens. To address the first obstacle, we adopt
an alternative cache management mechanism. Since SWA is designed to enhance performance
under a limited KV cache size, it is reasonable to treat it, along with the uncompressed tail tokens
from the compression branch, as a state-space model. The corresponding KV cache can thus be
regarded as a sequence-specific state that depends solely on the current position. Accordingly,
we pre-allocate a fixed- and limited-size pool of state caches, and dynamically assign it to each
sequence.
Sparse Attention Kernel Co-Design. Regarding the second obstacle, conventional high-
performance attention kernels typically assume a fixed number 𝐵 of tokens per block to optimize
performance, corresponding to 𝐵 ·𝑚 original tokens in CSA and 𝐵 ·𝑚′ in HCA. Through em-
ploying a high-performance sparse-attention kernel, different layers can accommodate variable
tokens per block without performance degradation. Achieving this requires co-designing the
KV cache layout and the sparse attention kernel. For instance, padding blocks to align with
cache lines can improve performance. Thus, for CSA with compression ratio 𝑚 and HCA with
ratio 𝑚′, the number of original tokens per block can be any multiple of lcm(𝑚,𝑚′), the least
common multiple of these two compression ratios.
. On-Disk KV Cache Storage
When serving DeepSeek-V4, we leverage an on-disk KV cache storage mechanism to eliminate
repeated prefilling for shared-prefix requests. For the compressed KV entries in CSA/HCA and
the uncompressed KV entries in Sliding Window Attention (SWA), we design separate solutions
for storage management.
23
For CSA and HCA, we simply store all of the compressed KV entries to the disk. When
a request hits a stored prefix, we read and reuse the compressed KV entries corresponding
to the prefix, until the last complete compression block. Specially, for prefix tokens in the tail
incomplete block, we still need to recompute them to restore the uncompressed KV entries, as
uncompressed KV entries in CSA and HCA are not stored.
For the SWA KV entries, since they are not compressed and exist in every layer, their volume
is approximately 8 times larger than the compressed CSA and HCA KV entries. To handle
these large SWA KV entries efficiently, we propose and implement three distinct strategies for
managing on-disk SWA KV entries, each offering a different trade-off between storage overhead
and computational redundancy:
• Full SWA Caching. This strategy stores the complete SWA KV entries for all tokens,
ensuring computational zero-redundancy. Under this strategy, the SWA KV entries of the
hitting prefix can be reconstructed by just reading the on-disk cache of the last 𝑛win tokens
within that prefix. Despite computational zero-redundancy, this strategy is inefficient for
modern SSD-based storage systems — only a small subset of the stored SWA KV cache
will be accessed for each hitting request, which leads to an unbalanced write-intensive
access pattern.
• Periodic Checkpointing. This strategy checkpoints SWA KV entries of the last 𝑛win tokens
within every 𝑝 tokens, where 𝑝 is a tunable parameter. For a hitting prefix, we load the
most recent checkpointed state, and then recompute the remaining tail tokens. Through
tuning 𝑝, this strategy enables an on-demand trade-off between storage and computation.
• Zero SWA Caching. This strategy does not store any SWA KV entries. For a hitting prefix,
we need to perform more recomputation to restore the SWA KV entries. To be specific, in
each attention layer, the SWA KV entry of each token depends on the SWA KV entries of
only the most recent 𝑛win tokens from the previous layer. Therefore, leveraging cached
CSA and HCA KV entries, recomputing the last 𝑛win · 𝐿 tokens is enough to restore the last
𝑛win SWA KV entries for an 𝐿-layer model.
Depending on specific deployment scenarios, we select the most suitable strategy to achieve the
desired trade-off between storage and computation.
4. Pre-Training
. Data Construction
On top of the pre-training data of DeepSeek-V3, we endeavor to construct a more diverse and
higher-quality training corpus with longer effective contexts. We continually refine our data con-
struction pipelines. For web-sourced data, we implement filtering strategies to remove batched
auto-generated and templated content, thereby mitigating the risk of model collapse (Zhu et al.,
2024). Mathematical and programming corpora still remain core components of our training
data, and we further enhance the coding capabilities of DeepSeek-V4 series by incorporating
agentic data during the mid-training phase. For multilingual data, we build a larger corpus
for DeepSeek-V4, improving its capture of long-tail knowledge across different cultures. For
DeepSeek-V4, we place a particular emphasis on long-document data curation, prioritizing
scientific papers, technical reports, and other materials that reflect unique academic values.
Combining all the above, our pre-training corpus comprises more than 32T tokens, containing
mathematical contents, codes, web pages, long documents, and other high-quality categories.
For pre-training data, we largely follow the same pre-processing strategies of DeepSeek-
24
V3. For tokenization, on top of the DeepSeek-V3 tokenizer, we introduce a few special tokens
for context construction, and still remain the vocabulary size to be 128K. We also inherit the
token-splitting (DeepSeek-AI, 2024) and Fill-in-Middle (FIM) (DeepSeek-AI, 2024) strategies
from DeepSeek-V3. Inspired by Ding et al. (2024), we pack documents from different sources
into appropriate sequences to minimize sample truncation. Different from DeepSeek-V3, we
employ sample-level attention masking during pre-training.
. Pre-Training Setups
. Model Setups
DeepSeek-V4-Flash. We set the number of Transformer layers to 43 and the hidden dimension
𝑑 to 4096. For the first two layers, we use pure sliding window attention. For the subsequent
layers, CSA and HCA are used in an interleaved manner. For CSA, we set the compression rate
𝑚 to 4, the number of indexer query heads 𝑛𝐼
ℎ
to 64, the indexer head dimension 𝑐𝐼 to 128, and
the number of KV entries selected for sparse attention (., attention top-k) to 512. For HCA,
we set the compression rate 𝑚′ to 128. For both CSA and HCA, we set the number of query
heads 𝑛ℎ to 64, the head dimension 𝑐 to 512, and the query compression dimension 𝑑𝑐 to 1024.
The number of output projection groups 𝑔 is set to 8, and the dimension of each intermediate
attention output 𝑑𝑔 is set to 1024. For the additional branch of sliding window attention, the
window size 𝑛win is set to 128. We employ MoE layers in all Transformer blocks, but use the
Hash routing strategy for the first 3 MoE layers. Each MoE layer consists of 1 shared expert and
256 routed experts, where the intermediate hidden dimension of each expert is 2048. Among the
routed experts, 6 experts will be activated for each token. The multi-token prediction depth is
set to 1. As for mHC, the expansion factor 𝑛hc is set to 4, and the number of Sinkhorn-Knopp
iterations 𝑡max is set to 20. Under this configuration, DeepSeek-V4-Flash comprises 284B total
parameters, of which 13B are activated for each token.
DeepSeek-V4-Pro. We set the number of Transformer layers to 61 and the hidden dimension
𝑑 to 7168. For the first two layers, we use HCA. For the subsequent layers, CSA and HCA are
used in an interleaved manner. For CSA, we set the compression rate 𝑚 to 4, the number of
indexer query heads 𝑛𝐼
ℎ
to 64, the indexer head dimension 𝑐𝐼 to 128, and the number of KV entries
selected for sparse attention (., attention top-k) to 1024. For HCA, we set the compression
rate 𝑚′ to 128. For both CSA and HCA, we set the number of query heads 𝑛ℎ to 128, the head
dimension 𝑐 to 512, and the query compression dimension 𝑑𝑐 to 1536. The number of output
projection groups 𝑔 is set to 16, and the dimension of each intermediate attention output 𝑑𝑔 is set
to 1024. For the additional branch of sliding window attention, the window size 𝑛win is set to
128. We employ MoE layers in all Transformer blocks, but use the Hash routing strategy for the
first 3 MoE layers. Each MoE layer consists of 1 shared expert and 384 routed experts, where
the intermediate hidden dimension of each expert is 3072. Among the routed experts, 6 experts
will be activated for each token. The multi-token prediction depth is set to 1. As for mHC, the
expansion factor 𝑛hc is set to 4, and the number of Sinkhorn-Knopp iterations 𝑡max is set to 20.
Under this configuration, DeepSeek-V4-Flash comprises total parameters, of which 49B are
activated for each token.
. Training Setups
DeepSeek-V4-Flash. We employ the Muon optimizer (Jordan et al., 2024; Liu et al., 2025) for
the majority of parameters, but use the AdamW optimizer (Loshchilov and Hutter, 2017) for the
25
embedding module, the prediction head module, and the weights of all RMSNorm modules. For
AdamW, we set its hyper-parameters to 𝛽1 = , 𝛽2 = , 𝜀 = 10−20, and weight_decay = .
For Muon, we set the momentum to and the weight decay to , and rescale the RMS of each
update matrix to for reutilization of the AdamW learning rate. We train DeepSeek-V4-Flash
on 32T tokens, and as in DeepSeek-V3, we also employ a batch size scheduling strategy that
increases the batch size (in tokens) from a small size to and then keeps it at during
most of the training. The learning rate is linearly warmed up in the first 2000 steps, maintained
at × 10−4 for most of the training. Near the end of the training, we finally decay the learning
rate to × 10−5 following a cosine schedule. The training starts with a sequence length of 4K,
and we gradually extend the training sequence length to 16K, 64K, and 1M. As for the setups of
sparse attention, we first warmup the model with dense attention for the first 1T tokens, and
introduce sparse attention at the sequence length of 64K and keep sparse attention during the
rest of the training. When introducing attention sparsity, we first set a short stage to warm up
the lightning indexer in CSA, and then train the model with sparse attention for most of the
training. For auxiliary-loss-free load balancing, we set the bias update speed to . For the
balance loss, we set its loss weight to to avoid extreme imbalance within single sequences.
The MTP loss weight is set to for most of the training, and to upon the start of learning
rate decay.
DeepSeek-V4-Pro. Except for specific values of hyper-parameters, the training setup of
DeepSeek-V4-Pro is largely consistent with that of DeepSeek-V4-Flash. We employ the Muon op-
timizer for the majority of parameters, but use the AdamW optimizer for the embedding module,
the prediction head module, and the weights of all RMSNorm modules. The hyper-parameters
of AdamW and Muon are the same as those of DeepSeek-V4-Flash. We train DeepSeek-V4-Pro
on 33T tokens, and also employ a batch size scheduling strategy, with the maximum batch
size being tokens. The learning rate scheduling strategy is largely the same as that of
DeepSeek-V4-Flash, but the peak learning rate is set to × 10−4 and the end learning rate is set
to × 10−5. The training also starts with a sequence length of 4K, and the length is gradually
extended to 16K, 64K, and 1M. Compared with DeepSeek-V4-Flash, DeepSeek-V4-Pro starts
with a longer stage of dense attention, and the strategy of introducing sparse attention is the
same as DeepSeek-V4-Flash, following a two-stage training method. For auxiliary-loss-free load
balancing, we set the bias update speed to . For the balance loss, we set its loss weight to
to avoid extreme imbalance within single sequences. The MTP loss weight is set to for
most of the training, and to upon the start of learning rate decay.
. Mitigating Training Instability
Training trillion-parameter MoE models presents significant stability challenges, and DeepSeek-
V4 series are no exception. We encountered notable instability challenges during training.
While simple rollbacks could temporarily restore the training state, they proved inadequate as a
long-term solution because they do not prevent the recurrence of loss spikes. Empirically, we
identified that the occurrence of spikes is consistently tied to outliers in the MoE layers, and the
routing mechanism itself appears to exacerbate the emergence of these outliers. Therefore, we
sought to tackle this issue from two dimensions: breaking the vicious cycle induced by routing,
and directly suppressing anomalous values. Fortunately, we discovered two practical techniques
that effectively maintain training stability. Although a comprehensive theoretical understanding
of their underlying mechanisms remains an open question for now, we are sharing them openly
to foster further exploration by the community.
26
Anticipatory Routing. We found that decoupling the synchronous updates of the backbone
network and the routing network significantly improves training stability. Consequently, at step
𝑡, we use the current network parameters 𝜃𝑡 for feature computation, but the routing indices are
computed and applied using the historical network parameters 𝜃𝑡−Δ𝑡. In practice, to circumvent
the overhead of loading model parameters twice, we fetch the data for step 𝑡 in advance at
step 𝑡 − Δ𝑡. We "anticipatorily" compute and cache the routing indices to be used later at step
𝑡, which is why we name this approach Anticipatory Routing. We also heavily optimized this
at the infrastructure level. First, given that pre-computing the routing indices only requires
a single forward pass over the data, we carefully orchestrated the pipeline execution and the
overlapping of computation with Expert Parallelism (EP) communication, successfully bounding
the additional wall-clock time overhead of Anticipatory Routing to approximately 20%. Second,
we introduced an automatic detection mechanism that triggers a short rollback and activates
Anticipatory Routing exclusively when a loss spike occurs; after operating in this mode for a
certain period, the system reverts to standard training. Ultimately, this dynamic application
allows us to avert loss spikes with negligible overall additional training overhead, all without
compromising model performance.
SwiGLU Clamping. In previous literature (Bello et al., 2017; Riviere et al., 2024), clamping
has been explicitly utilized to constrain numerical ranges, thereby enhancing training stability.
In our actual training runs, we empirically found that applying SwiGLU clamping (OpenAI,
2025) effectively eliminates outliers and substantially aids in stabilizing the training process,
without compromising performance. Throughout the training of both DeepSeek-V4-Flash and
DeepSeek-V4-Pro, we clamped the linear component of SwiGLU to the range of [−10, 10], while
capping the upper bound of the gate component at 10.
. Evaluations
. Evaluation Benchmarks
For the evaluation of the base models, we consider benchmarks spanning four key dimensions:
world knowledge, language understanding and reasoning, coding and mathematics, and long-
context processing.
World knowledge benchmarks include AGIEval (Zhong et al., 2023), C-Eval (Huang et al.,
2023), CMMLU (Li et al., 2023) MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al.,
2024), MMLU-Pro (Wang et al., 2024b), MMMLU (OpenAI, 2024a), MultiLoKo (Hupkes and
Bogoychev, 2025), Simple-QA verified (Haas et al., 2025), SuperGPQA (Du et al., 2025), FACTS
Parametric (Cheng et al., 2025), and TriviaQA (Joshi et al., 2017).
Language understanding and reasoning benchmarks include BigBench Hard (BBH) (Suzgun
et al., 2022), DROP (Dua et al., 2019), HellaSwag (Zellers et al., 2019), CLUEWSC (Xu et al., 2020),
and WinoGrande (Sakaguchi et al., 2019).
Coding and mathematical benchmarks include BigCodeBench (Zhuo et al., 2025), Hu-
manEval (Chen et al., 2021), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), MGSM
(Shi et al., 2023), and CMath (Wei et al., 2023).
Long context benchmarks include LongBench-V2 (Bai et al., 2025b).
27
Table 1 | Comparison among -Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4-
Pro-Base. All models are evaluated in our internal framework and share the same evaluation
setting. Scores with a gap not exceeding are considered to be at the same level. The highest
score in each row is in bold font, and the second is underlined.
Benchmark (Metric) # Shots
DeepSeek-V4-Flash DeepSeek-V4-Pro
Base Base Base
Architecture - MoE MoE MoE
# Activated Params - 37B 13B 49B
# Total Params - 671B 284B
World Knowl.
AGIEval (EM) 0-shot
MMLU (EM) 5-shot
MMLU-Redux (EM) 5-shot
MMLU-Pro (EM) 5-shot
MMMLU (EM) 5-shot
C-Eval (EM) 5-shot
CMMLU (EM) 5-shot
MultiLoKo (EM) 5-shot
Simple-QA verified (EM) 25-shot
SuperGPQA (EM) 5-shot
FACTS Parametric (EM) 25-shot
TriviaQA (EM) 5-shot
Lang. & Reas.
BBH (EM) 3-shot
DROP (F1) 1-shot
HellaSwag (EM) 0-shot
WinoGrande (EM) 0-shot
CLUEWSC (EM) 5-shot
Code & Math
BigCodeBench (Pass@1) 3-shot
HumanEval (Pass@1) 0-shot
GSM8K (EM) 8-shot
MATH (EM) 4-shot
MGSM (EM) 8-shot
CMath (EM) 3-shot
Long Context LongBench-V2 (EM) 1-shot
. Evaluation Results
In Table 1, we provide a detailed comparison of the base models for , DeepSeek-
V4-Flash, and DeepSeek-V4-Pro, all evaluated under a unified internal framework with strictly
consistent settings.
Comparing DeepSeek-V4-Flash-Base with -Base reveals a compelling ef-
ficiency story. Despite utilizing a substantially smaller number of both activated and total
parameters, DeepSeek-V4-Flash-Base outperforms -Base across a wide array of
benchmarks. This advantage is especially evident in world knowledge tasks and challenging
long-context scenarios. These results underscore that architectural improvements, refined data
quality, and training optimizations in DeepSeek-V4-Flash-Base yield superior performance even
with a more compact parameter budget, effectively surpassing the larger -Base
on the majority of evaluations.
Furthermore, DeepSeek-V4-Pro-Base demonstrates a further, decisive leap in capability,
establishing near-universal dominance over both -Base and DeepSeek-V4-Flash-
Base. With improvements across almost all categories, DeepSeek-V4-Pro-Base reaches new
28
performance highs among DeepSeek base models on the most demanding benchmarks. On
knowledge-intensive evaluations, it delivers dramatic gains, while also substantially advancing
long-context understanding. On most reasoning and code benchmarks, DeepSeek-V4-Pro-Base
also exceeds both previous models. This comprehensive uplift confirms DeepSeek-V4-Pro-Base
as the strongest foundation model in the DeepSeek series, outperforming its predecessors across
the spectrum of knowledge, reasoning, coding, and long-context capabilities.
5. Post-Training
. Post-Training Pipeline
Following pre-training, we conducted a post-training phase to yield the final models of DeepSeek-
V4 series. Although the training pipeline largely mirrored that of , a critical
methodological substitution was made: the mixed Reinforcement Learning (RL) stage was
entirely replaced by On-Policy Distillation (OPD).
. Specialist Training
The development of domain specialists was conducted by adapting the training
pipeline. Specifically, each model was sequentially optimized through an initial fine-tuning
phase and subsequent Reinforcement Learning (RL) guided by domain-specific prompts and re-
ward signals. For the RL stage, we implemented the Group Relative Policy Optimization (GRPO)
algorithm, maintaining hyper-parameters closely aligned with our prior research (DeepSeek-AI,
2025; DeepSeek-AI, 2025).
Reasoning Efforts. It is widely recognized that a model’s performance on reasoning tasks
is fundamentally governed by the computational effort expended. Consequently, we trained
distinct specialist models under divergent RL configurations to facilitate the development of
models optimized for varying reasoning capacities. As detailed in Table 2, DeepSeek-V4-Pro and
DeepSeek-V4-Flash both support three specific reasoning effort modes. For each mode, we apply
distinct length penalties and context windows during RL training, which results in varying
output token lengths for reasoning. To integrate these distinct reasoning modes, we utilize
specialized response formats demarcated by the <think> and </think> tokens. Furthermore,
for the "Think Max" mode, we prepend a specific instruction to the beginning of the system
prompt to guide the model’s reasoning process, as shown in Table 3.
Generative Reward Model. Typically, easy-to-verify tasks can be effectively optimized using
simple rule-based verifiers or test cases. In contrast, hard-to-verify tasks traditionally rely on
Reinforcement Learning from Human Feedback (RLHF), which necessitates extensive human
annotation to train a scalar reward model. In the post-training phase of DeepSeek-V4 series,
however, we dispense with these conventional scalar-based reward models. Instead, to address
hard-to-verify tasks, we curate rubric-guided RL data and employ a Generative Reward Model
(GRM) to evaluate policy trajectories. Crucially, we apply RL optimization directly to the GRM
itself. In this paradigm, the actor network natively functions as the GRM, enabling the joint
optimization of the model’s evaluative (judging) proficiency alongside its standard generative
capabilities. By unifying these roles, the model’s internal reasoning capabilities are inherently
fused into its evaluative process, resulting in highly robust scoring. Furthermore, this approach
achieves superior performance with only a minimal set of diverse human annotations, as the
29
Table 2 | Comparison of three reasoning modes
Reasoning
Mode
Characteristics Typical Use Cases Response Format
Non-think Fast, intuitive re-
sponses based on
habits or simple
rules.
Routine daily tasks,
emergency reactions,
low-risk decisions.
</think> summary
Think High Conscious logical
analysis, slower but
more accurate.
Complex problem-
solving, planning,
medium-risk deci-
sions.
<think> thinking
tokens </think>
summary
Think Max Push reasoning to its
fullest extent. Slow
but powerful.
Exploring the bound-
ary of model reason-
ing capability.
1. A special system
prompt at the begin-
ning.
2. <think> thinking
tokens </think>
summary
Table 3 | Instruction injected into the system prompt for the "Think Max" mode.
Injected Instruction
Reasoning Effort: Absolute maximum with no shortcuts permitted.
You MUST be very thorough in your thinking and comprehensively decompose the
problem to resolve the root cause, rigorously stress-testing your logic against all potential
paths, edge cases, and adversarial scenarios.
Explicitly write out your entire deliberation process, documenting every intermediate
step, considered alternative, and rejected hypothesis to ensure absolutely no assumption
is left unchecked.
model leverages its own logic to generalize across complex tasks.
Tool-Call Schema and Special Token. Consistent with our previous version, we utilize a
dedicated <think></think> tag to delineate the reasoning path. In DeepSeek-V4 series, we
introduce a new tool-call schema that employs a special "|DSML|" token and utilizes an XML-
based format for tool invocations, as demonstrated in Table 4. Our experiments demonstrate that
the XML format effectively mitigates escaping failures and reduces tool-call errors, providing a
more robust interface for model-tool interactions.
Interleaved Thinking. introduced a context management strategy that retains
reasoning traces across tool-result rounds but discards them upon the arrival of new user mes-
sages. While effective, this still caused unnecessary token waste in complex agentic workflows
— each new user turn would flush all accumulated reasoning content, forcing the model to
reconstruct its problem-solving state from scratch. Leveraging the expanded 1M-token context
30
Table 4 | Tool-call schema for DeepSeek-V4 series.
Tool Call Schema
## Tools
You have access to a set of tools to help answer the user’s question. You can
invoke tools by writing a "<|DSML|tool_calls>" block like the following:
<|DSML|tool_calls>
<|DSML|invoke name="$TOOL_NAME">
<|DSML|parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE
</|DSML|parameter>
...
</|DSML|invoke>
<|DSML|invoke name="$TOOL_NAME2">
...
</|DSML|invoke>
</|DSML|tool_calls>
String parameters should be specified as is and set ‘string="true"‘. For all
other types (numbers, booleans, arrays, objects), pass the value in JSON
format and set ‘string="false"‘.
If thinking_mode is enabled (triggered by <think>), you MUST output your
complete reasoning inside <think>...</think> BEFORE any tool calls or
final response.
Otherwise, output directly after </think> with tool calls or final response.
### Available Tool Schemas
{Tool Definition...}
You MUST strictly follow the above definedtool name and parameter schemas to
invoke tool calls.
window of DeepSeek-V4 series, we further refine this mechanism to maximize the effectiveness
of interleaved thinking in agentic environments:
• Tool-Calling Scenarios. As illustrated in Figure 7(a), all reasoning content is fully pre-
served throughout the entire conversation. Unlike , which discarded
thinking traces upon each new user turn, DeepSeek-V4 series retain the complete reason-
ing history across all rounds, including across user message boundaries. This allows the
model to maintain a coherent, cumulative chain of thought over long-horizon agent tasks.
• General Conversational Scenarios. As illustrated in Figure 7(b), the original strategy is
preserved: reasoning content from previous turns is discarded when a new user message
arrives, keeping the context concise for settings where persistent reasoning traces provide
limited benefit.
As with , agent frameworks that simulate tool interactions via user messages (.,
Terminus) may not trigger the tool-calling context path and thus may not benefit from enhanced
reasoning persistence. We continue to recommend non-think models for such architectures.
31
a) Thinking with tools
b) Thinking without tools
Figure 7 | Thinking management of DeepSeek-V4 series.
Quick Instruction. In chatbot scenarios, a number of auxiliary tasks (., determining whether
to trigger a web search, intent recognition, etc.) must be executed before generating the response.
Conventionally, these tasks are handled by a separate small model, requiring redundant prefill-
ing since it cannot reuse the existing KV cache. To overcome this limitation, we introduce Quick
Instruction. We append a set of dedicated special tokens directly to the input sequence, where
each token corresponds to a specific auxiliary task. By directly reusing the already-computed
KV cache, this mechanism completely avoids redundant prefilling and allows certain tasks, such
as generating search queries and determining authority and domain, to be executed in parallel.
Consequently, this approach significantly reduces the user-perceived time-to-first-token (TTFT)
and eliminates the engineering overhead of maintaining and iterating an extra small model. The
supported Quick Instruction tokens are summarized in Table 5.
. On-Policy Distillation
After training multiple domain-specific experts via specialized fine-tuning and reinforcement
learning, we employ multi-teacher On-Policy Distillation (OPD) as the primary technique for
merging expert capabilities into the final model. OPD has emerged as an effective post-training
paradigm for efficiently transferring the knowledge and capabilities of domain experts to a
s