智库文档所有分类

DeepSeek_V4：迈向高效百万级标记上下文智能.pdf

下载

He Wangmin

58页 | 4.27MB | 0次下载 |

0.0

(0人评价)

我要评价：

投诉举报

用手机看文档

扫一扫,手机看文档

下载

开通VIP

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence DeepSeek-AI research@ Abstract We present a preview version of DeepSeek-V4 series, including two strong Mixture-of- Experts (MoE) language models — DeepSeek-V4-Pro with parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and op- timization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold- Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro- Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek- V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with . This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at SimpleQA Verified (Pass@1) HLE (Pass@1) Apex Shortlist (Pass@1) Codeforces (Rating) SWE Verified (Resolved) Terminal Bench (Acc) Toolathlon (Pass@1) 0 20 40 60 80 100 A cc u ra cy / P as s@ 1 ( % ) 32063168 3052 Knowledge & Reasoning Agentic Capabilities DeepSeek-V4-Pro-Max -Max -xHigh -Pro-High 0 256 512 768 1024 Token Position (K) S in g le -T ok en F L O P s (T ) × lower × lower DeepSeek-V4-Pro DeepSeek-V4-Flash 0 256 512 768 1024 Sequence Length (K) 0 10 20 30 40 50 A cc u m u la te d K V C ac h e (G B ) × smaller × smaller DeepSeek-V4-Pro DeepSeek-V4-Flash Figure 1 | Left: benchmark performance of DeepSeek-V4-Pro-Max and its counterparts. Right: inference FLOPs and KV cache size of DeepSeek-V4 series and . Contents 1 Introduction 4 2 Architecture 6 Designs Inherited from DeepSeek-V3 . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Manifold-Constrained Hyper-Connections . . . . . . . . . . . . . . . . . . . . . . 7 Hybrid Attention with CSA and HCA . . . . . . . . . . . . . . . . . . . . . . . . . 9 Compressed Sparse Attention . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Heavily Compressed Attention . . . . . . . . . . . . . . . . . . . . . . . . . 11 Other Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Efficiency Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Muon Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 General Infrastructures 15 Fine-Grained Communication-Computation Overlap in Expert Parallelism . . . . 15 Flexible and Efficient Kernel Development with TileLang . . . . . . . . . . . . . . 16 High-Performance Batch-Invariant and Deterministic Kernel Libraries . . . . . . 18 FP4 Quantization-Aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Training Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Efficient Implementation of Muon . . . . . . . . . . . . . . . . . . . . . . . 20 Cost-Effective and Memory-Efficient Implementation of mHC . . . . . . . 21 Contextual Parallelism for Long-Context Attention . . . . . . . . . . . . . 21 Extended Automatic Differentiation for Flexible Activation Checkpointing 21 Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 KV Cache Structure and Management . . . . . . . . . . . . . . . . . . . . . 22 On-Disk KV Cache Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Pre-Training 24 Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Pre-Training Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Model Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Training Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Mitigating Training Instability . . . . . . . . . . . . . . . . . . . . . . . . . 26 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2 5 Post-Training 29 Post-Training Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Specialist Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 On-Policy Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 RL and OPD Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 FP4 Quantization Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Efficient Teacher Scheduling for Full-Vocabulary OPD . . . . . . . . . . . 34 Preemptible and Fault-Tolerant Rollout Service . . . . . . . . . . . . . . . 34 Scaling RL Framework for Million-Token Context . . . . . . . . . . . . . . 35 Sandbox Infrastructure for Agentic AI . . . . . . . . . . . . . . . . . . . . . 35 Standard Benchmark Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Performance on Real-World Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chinese Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 White-Collar Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Code Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 Conclusion, Limitations, and Future Directions 44 A Author List and Acknowledgment 54 Author List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 B Evaluation Details 55 3 1. Introduction The emergence of reasoning models (DeepSeek-AI, 2025; OpenAI, 2024c) has established a new paradigm of test-time scaling, driving substantial performance gains for Large Language Models (LLMs). However, this scaling paradigm is fundamentally constrained by the quadratic computational complexity of the vanilla attention mechanism (Vaswani et al., 2017), which creates a prohibitive bottleneck for ultra-long contexts and reasoning processes. Concurrently, the emergence of long-horizon scenarios and tasks — from complex agentic workflows to massive cross-document analysis — has also made efficient support for ultra-long contexts critical for future progress. While recent open-source efforts (Bai et al., 2025a; DeepSeek-AI, 2024; MiniMax, 2025; Qwen, 2025) have advanced general capabilities, this core architectural inefficiency in handling ultra-long sequences remains a key impediment, limiting further gains from test-time scaling and hindering further exploration into long-horizon scenarios and tasks. In order to break the efficiency barrier in ultra-long contexts, we develop the DeepSeek-V4 series, including the preview versions of DeepSeek-V4-Pro with parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Through architectural innova- tions, DeepSeek-V4 series achieve a dramatic leap in computational efficiency for processing ultra-long sequences. This breakthrough enables efficient support for a context length of one million tokens, ushering in a new era of million-length contexts for next-generation LLMs. We believe our capability to efficiently handle ultra-long sequences unlocks the next frontier of test-time scaling, paves the way for deeper research into long-horizon tasks, and establishes a necessary foundation for exploring future paradigms like online learning. Compared with the DeepSeek-V3 architecture (DeepSeek-AI, 2024), DeepSeek-V4 series retain the DeepSeekMoE framework (Dai et al., 2024) and Multi-Token Prediction (MTP) strategy, while introducing several key innovations in architecture and optimization. To enhance long- context efficiency, we design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses the KV caches along the sequence dimension and then performs DeepSeek Sparse Attention (DSA) (DeepSeek- AI, 2025), whereas HCA applies more aggressive compression to the KV caches but keeps dense attention. To strengthen modeling capability, we incorporate Manifold-Constrained Hyper-Connections (mHC) (Xie et al., 2026) that upgrade conventional residual connections. Additionally, we introduce the Muon (Jordan et al., 2024; Liu et al., 2025) optimizer to the training of DeepSeek-V4 series, leading to faster convergence and improved training stability. To enable efficient training and inference for DeepSeek-V4 series as well as productive de- velopment, we introduce several infrastructure optimizations. First, we design and implement a single fused kernel for MoE modules that fully overlaps computation, communication, and memory access. Second, we employ TileLang (Wang et al., 2026), a Domain-Specific Language (DSL) to balance development productivity and runtime efficiency. Third, we provide efficient batch-invariant and deterministic kernel libraries to ensure bitwise reproducibility across train- ing and inference. Fourth, we incorporate FP4 quantization-aware training for MoE expert weights and the indexer QK path to reduce memory and computation. Fifth, for the training framework, we extend the autograd framework with tensor-level checkpointing for fine-grained recomputation control; and we enhance training efficiency with a hybrid ZeRO strategy for the Muon optimizer, cost-effective mHC implementations via recomputation and fused kernels, and two-stage contextual parallelism to manage compressed attention. Finally, for the inference framework, we design a heterogeneous KV cache structure with on-disk storage strategies to enable efficient shared-prefix reuse. 4 By employing hybrid CSA and HCA, along with precision optimizations on computation and storage, DeepSeek-V4 series achieve significantly lower inference FLOPs and a substantially reduced KV cache size compared with , especially in long-context settings. The right part of Figure 1 demonstrates the estimated single-token inference FLOPs and accumulated KV cache size of and DeepSeek-V4 series. In the scenario of 1M-token context, even DeepSeek-V4-Pro, which has a larger number of activated parameters, attains only 27% of the single-token FLOPs (measured in equivalent FP8 FLOPs) and 10% of the KV cache size relative to . Furthermore, DeepSeek-V4-Flash, with its smaller number of activated parameters, pushes efficiency even further: in the 1M-token context setting, it achieves only 10% of the single-token FLOPs and 7% of the KV cache size compared with . Additionally, for DeepSeek-V4 series, the routed expert parameters utilize FP4 precision. While the peak FLOPs for FP4 × FP8 operations are currently the same as FP8 × FP8 on existing hardware, they can theoretically be implemented to be 1/3 more efficient on future hardware, which will further enhance the efficiency of DeepSeek-V4 series. During pre-training, we train DeepSeek-V4-Flash on 32T tokens and DeepSeek-V4-Pro on 33T tokens, respectively. After pre-training, these two models can natively and efficiently support 1M-length contexts. In our internal evaluations, DeepSeek-V4-Flash-Base already surpasses -Base across a majority of benchmarks with its more parameter-efficient design. DeepSeek-V4-Pro-Base further extends this advantage to set a new performance standard among DeepSeek foundation models, achieving comprehensive superiority across reasoning, coding, long-context, and world knowledge tasks. The post-training pipeline of DeepSeek-V4 series features a two-stage paradigm: the inde- pendent cultivation of domain-specific experts, followed by unified model consolidation via on-policy distillation (Lu and Lab, 2025). Initially, for each target domain — such as mathematics, coding, agent, and instruction following — a separate expert model is trained independently. The base model first undergoes Supervised Fine-Tuning (SFT) on high-quality, domain-specific data to establish foundational capabilities. Subsequently, Reinforcement Learning (RL) is ap- plied using Group Relative Policy Optimization (GRPO) (DeepSeek-AI, 2025), which further optimizes the model for domain-aligned behaviors guided by reward models tailored to specific success criteria. This phase yields a diverse set of specialized experts, each excelling in its respective field. Finally, to integrate these distinct proficiencies, a single unified model is trained through on-policy distillation, wherein the unified model acts as the student learning to optimize the reverse KL loss with teacher models. Summary of Core Evaluation Results • Knowledge: In assessments of broad world knowledge, DeepSeek-V4-Pro-Max, the maxi- mum reasoning effort mode of DeepSeek-V4-Pro, significantly outperforms leading open- source models on the SimpleQA (OpenAI, 2024d) and Chinese-SimpleQA (He et al., 2024) benchmarks. Regarding educational knowledge — evaluated via MMLU-Pro (Wang et al., 2024b), HLE (Phan et al., 2025), and GPQA (Rein et al., 2023) — DeepSeek-V4-Pro-Max shows a marginal lead over its open-source counterparts. DeepSeek-V4-Pro-Max has significantly closed the gap with the leading proprietary model, -Pro, despite still trailing it in these knowledge-based evaluations. • Reasoning: Through the expansion of reasoning tokens, DeepSeek-V4-Pro-Max demon- strates superior performance relative to and -Pro on standard reasoning benchmarks. Nevertheless, its performance falls marginally short of and Gemini- -Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months. Furthermore, DeepSeek-V4-Flash-Max achieves comparable 5 Input Tokens Embedding CSA / HCA Prediction Head MTP Modules LM Loss MTP Loss Residual Mixing Pre-Block Mixing Post-Block Mixing Transformer Block ×𝐿𝐿 DeepSeekMoE Residual Mixing Pre-Block Mixing Post-Block Mixing Figure 2 | Overall architecture of DeepSeek-V4 series. We use hybrid CSA (Compressed Sparse Attention) and HCA (Heavily Compressed Attention) for attention layers, DeepSeekMoE for feed-forward layers, and strengthen conventional residual connections with mHC. performance to and -Pro, establishing itself as a highly cost-effective architecture for complex reasoning tasks. • Agent: On public benchmarks, DeepSeek-V4-Pro-Max is on par with leading open-source models, such as and , but slightly worse than frontier closed models. In our internal evaluation, DeepSeek-V4-Pro-Max outperforms Claude Sonnet and approaches the level of Opus . • Long-Context: DeepSeek-V4-Pro-Max delivers strong results on synthetic and real use cases with a 1-million-token context window, surpassing even -Pro on academic benchmarks. • DeepSeek-V4-Pro . DeepSeek-V4-Flash: DeepSeek-V4-Flash-Max exhibits lower per- formance in knowledge evaluations due to its smaller parameter scale. However, it achieves comparable results on reasoning tasks when allocated a larger thinking bud- get. In agent evaluations, while DeepSeek-V4-Flash-Max matches the performance of DeepSeek-V4-Pro-Max on several benchmarks, it still trails its larger counterpart on more complex, high-difficulty tasks. 2. Architecture Overall, DeepSeek-V4 series retain the Transformer (Vaswani et al., 2017) architecture and Multi- Token Prediction (MTP) modules (DeepSeek-AI, 2024; Gloeckle et al., 2024), while introducing several key upgrades over DeepSeek-V3: (1) firstly, we introduce the Manifold-Constrained Hyper-Connections (mHC) (Xie et al., 2026) to strengthen conventional residual connections; 6 (2) secondly, we design a hybrid attention architecture, which greatly improves long-context efficiency through Compressed Sparse Attention and Heavily Compressed Attention. (3) thirdly, we employ Muon (Jordan et al., 2024; Liu et al., 2025) as the optimizer. For the Mixture-of- Experts (MoE) components, we still adopt the DeepSeekMoE (Dai et al., 2024) architecture, with only minor adjustments from DeepSeek-V3. The Multi-Token Prediction (MTP) (DeepSeek-AI, 2024; Gloeckle et al., 2024; Li et al., 2024; Qi et al., 2020) configuration remains identical to that of DeepSeek-V3. All other unspecified details follow the settings established in DeepSeek- V3 (DeepSeek-AI, 2024). Figure 2 illustrates the overall architecture of DeepSeek-V4, and the details are described below. . Designs Inherited from DeepSeek-V3 Mixture-of-Experts. As previous DeepSeek-series models (DeepSeek-AI, 2024; DeepSeek-AI, 2024), DeepSeek-V4 series also adopt the DeepSeekMoE paradigm (Dai et al., 2024) for Feed- Forward Networks (FFNs), which sets fine-grained routed experts and shared experts. Different from DeepSeek-V3, we change the activation function that computes the affinity scores from Sigmoid(·) into Sqrt(Softplus(·)). For load balancing, we also employ the auxiliary-loss-free strategy (DeepSeek-AI, 2024; Wang et al., 2024a), augmented by a slight sequence-wise balance loss that prevents extreme imbalance within individual sequences. For DeepSeek-V4, we remove the constraint on the number of routing target nodes, and carefully redesign the parallelism strategy to maintain training efficiency. Furthermore, compared with DeepSeek-V3, we replace the dense FFN layers in the initial several Transformer blocks with MoE layers that employ Hash routing (Roller et al., 2021). The Hash routing strategy determines the target experts of each token according to a predefined hash function with regard to the input token ID. Multi-Token Prediction. As DeepSeek-V3, DeepSeek-V4 series also set MTP modules and objectives. Given that the MTP strategy has been validated in DeepSeek-V3, we adopt the same strategy for DeepSeek-V4 series without modification. . Manifold-Constrained Hyper-Connections As shown in Figure 2, DeepSeek-V4 series incorporate Manifold-Constrained Hyper-Connections (mHC) (Xie et al., 2026) to strengthen the conventional residual connections between adjacent Transformer blocks. Compared with naive Hyper-Connections (HC) (Zhu et al., 2025), the core idea of mHC is to constrain the residual mapping onto a specific manifold, and thus enhance the stability of signal propagation across layers while preserving model expressivity. This subsection briefly introduces the standard HC and describes how we design mHC for stable training. Standard Hyper-Connections. The standard HC expands the width of the residual stream by a factor of 𝑛hc. Specifically, the shape of the residual stream is expanded from R𝑑 to R𝑛hc×𝑑 , where 𝑑 is the hidden size of the actual layer input. Let 𝑋𝑙 = [x𝑙,1; . . . ; x𝑙,𝑛hc] 𝑇 ∈ R𝑛hc×𝑑 be the residual state before the 𝑙-th layer. HC introduces three linear mappings: an input mapping 𝐴𝑙 ∈ R1×𝑛hc , a residual transformation 𝐵𝑙 ∈ R𝑛hc×𝑛hc , and an output mapping 𝐶𝑙 ∈ R𝑛hc×1. The update of the residual state is then formulated as: 𝑋𝑙+1 = 𝐵𝑙𝑋𝑙 + 𝐶𝑙F𝑙 (𝐴𝑙𝑋𝑙), (1) where F𝑙 denotes the 𝑙-th layer (., an MoE layer), whose input and output shapes are both R𝑑 . Note that the actual layer input 𝐴𝑙𝑋𝑙 ∈ R𝑑 is also 𝑑-dimensional, so the expanded residual 7 width does not influence the design of the inner layers. HC decouples the residual width from the actual hidden size, offering a complementary scaling axis with minimal computational overhead, as 𝑛hc is typically much smaller than the hidden size 𝑑. However, even though HC has demonstrated potential in improving model performance, we find that the training will frequently exhibit numerical instability when stacking multiple layers, which hinders the scaling of HC. Manifold-Constrained Residual Mapping. The core innovation of mHC is to constrain the residual mapping matrix 𝐵𝑙 to the manifold of doubly stochastic matrices (the Birkhoff polytope) M, and thus enhance the stability of signal propagation across layers: 𝐵𝑙 ∈ M ≔ {𝑀 ∈ R𝑛×𝑛 | 𝑀1𝑛 = 1𝑛, 1𝑇𝑛𝑀 = 1 𝑇 𝑛 , 𝑀 ⩾ 0}. (2) This constraint ensures that the spectral norm of the mapping matrix ∥𝐵𝑙∥2 is bounded by 1, so the residual transformation is non-expansive, which increases the numerical stability during both the forward pass and backpropagation. Besides, the set M is closed under multiplication, which guarantees stability in the scenarios of deep stacks of mHC. In addition, the input transformation 𝐴𝑙 and output transformation 𝐶𝑙 are also constrained to be non-negative and bounded via a Sigmoid function to avoid the risk of signal cancellation. Dynamic Parameterization. The parameters of three linear mappings are dynamically gen- erated, which are decomposed into a dynamic (input-dependent) component and a static (input-independent) component. Given the input 𝑋𝑙 ∈ R𝑛hc×𝑑 , it is first flattened and normal- ized: �̂�𝑙 = RMSNorm(vec(𝑋𝑙)) ∈ R1×𝑛hc𝑑 . Then, we follow the conventional HC to generate the unconstrained raw parameters �̃�𝑙 ∈ R1×𝑛hc , �̃�𝑙 ∈ R𝑛hc×𝑛hc , and 𝐶𝑙 ∈ R𝑛hc×1: �̃�𝑙 = 𝛼 pre 𝑙 · ( �̂�𝑙𝑊 pre 𝑙 ) + 𝑆pre 𝑙 , (3) �̃�𝑙 = 𝛼 res 𝑙 · Mat( �̂�𝑙𝑊res𝑙 ) + 𝑆 res 𝑙 , (4) 𝐶𝑙 = 𝛼 post 𝑙 · ( �̂�𝑙𝑊 post 𝑙 )𝑇 + 𝑆post 𝑙 , (5) where 𝑊 pre 𝑙 ,𝑊 post 𝑙 ∈ R𝑛hc𝑑×𝑛hc and 𝑊res 𝑙 ∈ R𝑛hc𝑑×𝑛 2 hc are learnable parameters for generating the dynamic components; Mat(·) reshapes a vector of size 1 × 𝑛2hc into a matrix of size 𝑛hc × 𝑛hc; 𝑆 pre 𝑙 ∈ R1×𝑛hc , 𝑆post 𝑙 ∈ R𝑛hc×1, and 𝑆res 𝑙 ∈ R𝑛hc×𝑛hc are learnable static biases; and 𝛼pre 𝑙 , 𝛼res 𝑙 , 𝛼 post 𝑙 ∈ R are learnable gating factors initialized to small values. Applying Parameter Constraints. After obtaining the unconstrained raw parameters �̃�𝑙, �̃�𝑙,𝐶𝑙, we then apply constraints described earlier to them to enhance the numerical stability. To be specific, for the input and output mappings, we employ a Sigmoid function 𝜎(·) to ensure their non-negativity and boundedness: 𝐴𝑙 = 𝜎( �̃�𝑙), (6) 𝐶𝑙 = 2𝜎(𝐶𝑙). (7) As for the residual mapping �̃�𝑙, we project it onto the manifold of doubly stochastic matrices M. This is achieved by the Sinkhorn-Knopp algorithm, which first applies an exponential function to �̃�𝑙 to ensure positivity, getting 𝑀 (0) = exp( �̃�𝑙), and then iteratively performs column and row normalization: 𝑀 (𝑡) = T𝑟 (T𝑐 (𝑀 (𝑡−1) )), (8) where T𝑟 and T𝑐 denote row and column normalization, respectively. This iteration converges to a constrained doubly stochastic matrix 𝐵𝑙 = 𝑀 (𝑡max ) . We choose 𝑡max = 20 as a practical value. 8 …Hidden States of KV Tokens … Compressed Indexer Keys Hidden State of Query Token Multi-Query Attention Compressed KV Entries … Top-k Selector Selected Compressed KV Entries … Shared Key-Value Multi-Query Attention Indexer Queries Queries Sliding Window KV Entries Concatenation Token-Level Compressor … Token-Level Compressor Lightning Indexer Index Scores Figure 3 | Core architectures of CSA. It compresses the number of KV entries to 1 𝑚 times, and then applies DeepSeek Sparse Attention for further acceleration. Additionally, a small set of sliding window KV entries is combined with the selected compressed KV entries to enhance local fine-grained dependencies. . Hybrid Attention with CSA and HCA As the context length reaches extreme scales, the attention mechanism emerges as the dominant computational bottleneck in a model. For DeepSeek-V4, we design two efficient attention architectures — Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) — and employ their interleaved hybrid configuration, which substantially reduces the compu- tational cost of attention in long-text scenarios. CSA integrates both compression and sparse attention strategies: it first compresses the Key-Value (KV) cache of every 𝑚 tokens into one entry, and then applies DeepSeek Sparse Attention (DSA) (DeepSeek-AI, 2025) where each query token attends to only 𝑘 compressed KV entries. HCA aims for extreme compression by consol- idating the KV cache of every 𝑚′ (≫ 𝑚) tokens into a single entry. The hybrid architecture of CSA and HCA remarkably improves the long-context efficiency of DeepSeek-V4 series, making one-million-token context feasible in practice. This subsection describes the core techniques of our hybrid attention architecture, and we also provide an open-source implementation1 to specify more details unambiguously. . Compressed Sparse Attention The core architecture of CSA is illustrated in Figure 3, which first compresses the KV cache of each 𝑚 tokens into one entry, and then applies DeepSeek Sparse Attention for further acceleration. Compressed Key-Value Entries. Let 𝐻 ∈ R𝑛×𝑑 be a sequence of input hidden states, where 𝑛 is the sequence length and 𝑑 is the hidden size. CSA first computes two series of KV entries 𝐶𝑎,𝐶𝑏 ∈ R𝑛×𝑐 and their corresponding compression weights 𝑍𝑎, 𝑍𝑏 ∈ R𝑛×𝑐, where 𝑐 is the head 1 9 dimension: 𝐶 𝑎 = 𝐻 ·𝑊𝑎𝐾𝑉 , 𝐶𝑏 = 𝐻 ·𝑊𝑏𝐾𝑉 , (9) 𝑍 𝑎 = 𝐻 ·𝑊𝑎𝑍, 𝑍𝑏 = 𝐻 ·𝑊𝑏𝑍, (10) where 𝑊𝑎𝐾𝑉 ,𝑊𝑏𝐾𝑉 ,𝑊𝑎𝑍,𝑊𝑏𝑍 ∈ R𝑑×𝑐 are trainable parameters. Next, each 𝑚 KV entries in 𝐶𝑎 and 𝐶𝑏 will be compressed into one entry according to their compression weights and learnable positional biases 𝐵𝑎, 𝐵𝑏 ∈ R𝑚×𝑐, producing 𝐶Comp ∈ R 𝑛 𝑚 ×𝑐. Each compressed entry 𝐶 Comp 𝑖 ∈ R𝑐 is computed by [𝑆𝑎 𝑚𝑖:𝑚(𝑖+1)−1; 𝑆 𝑏 𝑚(𝑖−1) :𝑚𝑖−1] = Softmaxrow( [𝑍 𝑎 𝑚𝑖:𝑚(𝑖+1)−1 + 𝐵 𝑎; 𝑍𝑏 𝑚(𝑖−1) :𝑚𝑖−1 + 𝐵 𝑏]), (11) 𝐶 Comp 𝑖 = 𝑚(𝑖+1)−1∑︁ 𝑗=𝑚𝑖 𝑆 𝑎 𝑗 ⊙ 𝐶 𝑎 𝑗 + 𝑚𝑖−1∑︁ 𝑗=𝑚(𝑖−1) 𝑆 𝑏 𝑗 ⊙ 𝐶 𝑏 𝑗 , (12) where ⊙ denotes the Hadamard product; Softmaxrow(·) denotes the softmax operation along the row dimension, which performs normalization across the total of 2𝑚 elements from both 𝑍𝑎 and 𝑍𝑏. When 𝑖 = 0, 𝑍𝑏 𝑚(𝑖−1) :𝑚𝑖−1 is padded with negative infinity and 𝐶 𝑏 𝑚(𝑖−1) :𝑚𝑖−1 is padded with zeros. Note that each 𝐶 Comp 𝑖 is derived from 2𝑚 KV entries, but the indexes of 𝐶𝑏 used for 𝐶 Comp 𝑖 and the indexes of 𝐶𝑎 used for 𝐶 Comp 𝑖−1 are overlapped. Therefore, CSA in fact compresses the sequence length to 1 𝑚 times. Lightning Indexer for Sparse Selection. After obtaining the compressed KV entries 𝐶Comp, CSA applies the DSA strategy to select top-k compressed KV entries for core attention. First, CSA performs the same compression operation used for 𝐶Comp to get compressed indexer keys 𝐾IComp ∈ R 𝑛 𝑚 ×𝑐𝐼 , where 𝑐𝐼 is the indexer head dimension. Then, for a query token 𝑡, we produce the indexer queries {q𝐼 𝑡,1; q 𝐼 𝑡,2; ...; q 𝐼 𝑡,𝑛𝐼 ℎ } in a low-rank manner: c𝑄𝑡 = h𝑡 ·𝑊 𝐷𝑄, (13) [q𝐼 𝑡,1; q 𝐼 𝑡,2; ...; q 𝐼 𝑡,𝑛𝐼 ℎ ] = q𝐼𝑡 = c 𝑄 𝑡 ·𝑊 𝐼𝑈𝑄, (14) where h𝑡 ∈ R𝑑 is the input hidden state of the query token 𝑡; c𝑄𝑡 ∈ R𝑑𝑐 is the compressed latent vector for queries; 𝑑𝑐 denotes the query compression dimension; 𝑛𝐼ℎ denotes the number of indexer query heads; 𝑊𝐷𝑄 ∈ R𝑑×𝑑𝑐 and 𝑊 𝐼𝑈𝑄 ∈ R𝑑𝑐×𝑐 𝐼𝑛𝐼 ℎ are the down-projection and up- projection matrices for indexer queries, respectively. Next, the index score 𝐼𝑡,𝑠 ∈ R between the query token 𝑡 and a preceding compressed block 𝑠 (𝑠 < Floor( 𝑡 𝑚 )) is computed by [𝑤𝐼 𝑡,1;𝑤 𝐼 𝑡,2; ...;𝑤 𝐼 𝑡,𝑛𝐼 ℎ ] = w𝐼𝑡 = h𝑡 ·𝑊 𝑤, (15) 𝐼𝑡,𝑠 = 𝑛𝐼 ℎ∑︁ ℎ=1 𝑤 𝐼 𝑡,ℎ · ReLU ( q𝐼 𝑡,ℎ · 𝐾 IComp 𝑠 ) , (16) where 𝑊𝑤 ∈ R𝑑×𝑛 𝐼 ℎ is a learnable matrix; 𝑤𝐼 𝑡,ℎ ∈ R is the weight of the ℎ-th indexer head. For a query token 𝑡, given its index scores 𝐼𝑡,:, we employ a top-k selector to selectively retain a subset of compressed KV entries CSprsComp𝑡 for subsequent core attention: CSprsComp𝑡 = { 𝐶 Comp 𝑠 �� 𝐼𝑡,𝑠 ∈ Top-k(𝐼𝑡,:)} . (17) 10 … Hidden States of KV Tokens … Hidden State of Query Token Heavily Compressed KV Entries Shared Key-Value Multi-Query Attention Queries Sliding Window KV Entries Concatenation Token-Level Compressor Figure 4 | Core architectures of HCA. It performs heavier compression, where the KV entries of 𝑚′ (≫ 𝑚) tokens will be consolidated into one. Also, we additionally introduce a small set of sliding window KV entries to enhance local fine-grained dependencies. Shared Key-Value MQA. After selecting the sparse KV entries, CSA then performs core attention in a Multi-Query Attention (MQA) (Shazeer, 2019) manner, where each compressed KV entry in CSprsComp𝑡 serves as both attention key and value. To be specific, for a query token 𝑡, we first produce attention queries {q𝑡,1; q𝑡,2; ...; q𝑡,𝑛ℎ} from the compressed latent vector c 𝑄 𝑡 : [q𝑡,1; q𝑡,2; ...; q𝑡,𝑛ℎ] = q𝑡 = c 𝑄 𝑡 ·𝑊 𝑈𝑄, (18) where 𝑛ℎ denotes the number of query heads; 𝑊𝑈𝑄 ∈ R𝑑𝑐×𝑐𝑛ℎ is the up-projection matrices for queries. Note that the latent query vector c𝑄𝑡 is shared with that used for the indexer queries. Next, we perform MQA on {q𝑡,𝑖} and C SprsComp 𝑡 : o𝑡,𝑖 = CoreAttn ( query=q𝑡,𝑖, key=C SprsComp 𝑡 , value=C SprsComp 𝑡 ) , (19) where o𝑡,𝑖 ∈ R𝑐 is the core attention output of the 𝑖-th head at the 𝑡-th token; CoreAttn(·) denotes the core attention operation. Grouped Output Projection. In the configuration of DeepSeek-V4, 𝑐𝑛ℎ is quite large. Therefore, directly projecting the outputs of the core attention operation [o𝑡,1; o𝑡,2; ...; o𝑡,𝑛ℎ] = o𝑡 ∈ R𝑐𝑛ℎ to a 𝑑-dimensional hidden state will impose a substantial computational burden. To mitigate this cost, we design a grouped output projection strategy. To be specific, we first split 𝑛ℎ outputs into 𝑔 groups, and then for each group of output o𝐺 𝑡,𝑖 ∈ R 𝑐 𝑛ℎ 𝑔 , we project it to a 𝑑𝑔-dimensional intermediate output o𝐺 ′ 𝑡,𝑖 ∈ R 𝑑𝑔 , where 𝑑𝑔 < 𝑐 𝑛ℎ 𝑔 . Finally, we project the intermediate output [o𝐺′ 𝑡,1; o 𝐺′ 𝑡,2; ...; o 𝐺′ 𝑡,𝑔] ∈ R𝑑𝑔𝑔 to the final attention output ô𝑡 ∈ R𝑑 . . Heavily Compressed Attention The core architecture of HCA is illustrated in Figure 4, which compresses the KV cache in a heavier manner, but does not employ sparse attention. Compressed Key-Value Entries. By and large, the compression strategy of HCA is similar to that of CSA, but employs a larger compression rate 𝑚′ (≫ 𝑚) and does not perform overlapped 11 compression. Let 𝐻 ∈ R𝑛×𝑑 be a sequence of input hidden states, HCA first computes the original KV entries 𝐶 ∈ R𝑛×𝑐 and their corresponding compression weights 𝑍 ∈ R𝑛×𝑐: 𝐶 = 𝐻 ·𝑊𝐾𝑉 , (20) 𝑍 = 𝐻 ·𝑊𝑍, (21) where𝑊𝐾𝑉 ,𝑊𝑍 ∈ R𝑑×𝑐 are trainable parameters. Next, each𝑚′ KV entries in 𝐶 will be compressed into one according to the compression weights and learnable positional biases 𝐵 ∈ R𝑚′×𝑐, producing 𝐶Comp ∈ R 𝑛 𝑚′ ×𝑐. Each compressed entry 𝐶 Comp 𝑖 ∈ R𝑐 is computed by 𝑆𝑚′ 𝑖:𝑚′ (𝑖+1)−1 = Softmaxrow(𝑍𝑚′ 𝑖:𝑚′ (𝑖+1)−1 + 𝐵), (22) 𝐶 Comp 𝑖 = 𝑚′ (𝑖+1)−1∑︁ 𝑗=𝑚′ 𝑖 𝑆 𝑗 ⊙ 𝐶 𝑗. (23) Through this compression operation, HCA compresses the sequence length to 1 𝑚′ times. Shared Key-Value MQA and Grouped Output Projection. HCA also employs the shared KV MQA and grouped output projection strategies as CSA does. After the KV compression, for a query token 𝑡, HCA first produces attention queries {q𝑡,1; q𝑡,2; ...; q𝑡,𝑛ℎ} in a low-rank manner: c𝑄𝑡 = h𝑡 ·𝑊 𝐷𝑄, (24) [q𝑡,1; q𝑡,2; ...; q𝑡,𝑛ℎ] = q𝑡 = c 𝑄 𝑡 ·𝑊 𝑈𝑄, (25) where h𝑡 ∈ R𝑑 is the input hidden state of the query token 𝑡; 𝑛ℎ denotes the number of query heads; 𝑊𝐷𝑄 ∈ R𝑑×𝑑𝑐 and 𝑊𝑈𝑄 ∈ R𝑑𝑐×𝑐𝑛ℎ are the down-projection and up-projection matrices for queries, respectively. Next, we perform MQA on {q𝑡,𝑖} and 𝐶Comp: o𝑡,𝑖 = CoreAttn ( query=q𝑡,𝑖, key=𝐶Comp, value=𝐶Comp ) , (26) where o𝑡,𝑖 ∈ R𝑐 is the core attention output of the 𝑖-th head at the 𝑡-th token. Next, as CSA does, HCA splits 𝑛ℎ outputs into 𝑔 groups, and for each group of output o𝐺𝑡,𝑖 ∈ R 𝑐 𝑛ℎ 𝑔 , HCA projects it to a 𝑑𝑔-dimensional intermediate output o𝐺 ′ 𝑡,𝑖 ∈ R 𝑑𝑔 , where 𝑑𝑔 < 𝑐 𝑛ℎ 𝑔 . Finally, HCA projects the intermediate output [o𝐺′ 𝑡,1; o 𝐺′ 𝑡,2; ...; o 𝐺′ 𝑡,𝑔] ∈ R𝑑𝑔𝑔 to the final attention output ô𝑡 ∈ R𝑑 . . Other Details In addition to the core architectures of CSA and HCA described above, our hybrid attention incorporates several other techniques. For writing clarity, we omit these additional techniques from the above introduction and will briefly describe them in this subsection. Also, this subsec- tion focuses only on the core ideas of them and may omit some tiny details for simplicity. We encourage the readers to refer to our open-source implementation for unambiguous details. Query and Key-Value Entry Normalization. For both CSA and HCA, we perform an addi- tional RMSNorm operation on each head of the queries and the only head of the compressed KV entries, just before the core attention operation. This normalization avoids exploding attention logits and may improve training stability. 12 Partial Rotary Positional Embedding. For both CSA and HCA, we partially employ the Rotary Positional Embedding (RoPE) (Su et al., 2024) to the attention queries, KV entries, and the core attention outputs. To be specific, for each query vector and KV entry vector used in CSA and HCA, we apply RoPE to its last 64 dimensions. Since the KV entries serve as both attention keys and values, the naive core attention outputs {o𝑡,𝑖} will carry absolute position embeddings, derived from the weighted sum of KV entries. As a countermeasure, we also apply RoPE with position −𝑖 on the last 64 dimensions of each o𝑡,𝑖. In this way, the output of the core attention will also carry relative position embeddings — the contribution of each KV entry to the core attention outputs will also be related to the distance between the query and the KV entry. Additional Branch of Sliding Window Attention. In order to strictly preserve causality in CSA and HCA, each query attends to only preceding compressed KV blocks. Consequently, a query cannot access information from other tokens within its own compressed block. Meanwhile, recent tokens usually possess greater relevance to the query token in language modeling. For these reasons, we introduce a supplementary attention branch to both CSA and HCA in a sliding window manner, for better modeling of local dependencies. To be specific, for each query token, we additionally produce 𝑛win uncompressed KV entries corresponding to the recent 𝑛win tokens. In the core attention of CSA and HCA, these KV entries in the sliding window will be used along with the compressed KV entries. Attention Sink. In the core attention of CSA and HCA, we employ the trick of attention sink (OpenAI, 2025; Xiao et al., 2024). To be specific, we set a series of learnable sink logits {𝑧′1, 𝑧 ′ 2, ..., 𝑧 ′ 𝑛ℎ }. For the ℎ-th attention head, Exp(𝑧′ ℎ ) will be added to the denominator of the attention score: 𝑠ℎ,𝑖, 𝑗 = Exp(𝑧ℎ,𝑖, 𝑗)∑ 𝑘 Exp(𝑧ℎ,𝑖,𝑘) + Exp(𝑧′ℎ) , (27) where 𝑠ℎ,𝑖, 𝑗, 𝑧ℎ,𝑖, 𝑗 ∈ R denote the attention score and attention logit of the ℎ-th attention head between the 𝑖-th query token and the 𝑗-th preceding token or compressed block. This technique allows each query head to adjust its total attention scores to be not equal to 1, and even to be near 0. . Efficiency Discussion Due to the employment of hybrid CSA and HCA, together with low-precision computation and storage, the attention module of DeepSeek-V4 series achieves remarkable efficiency in both attention FLOPs and KV cache size, especially in long-context scenarios. First, we adopt a mixed storage format for KV entries: BF16 precision is used for the rotary positional embedding (RoPE) dimensions, while FP8 precision is applied to the remaining dimensions. This hybrid representation reduces the KV cache size by nearly half compared with pure BF16 storage. Second, attention computation within the lightning indexer is performed in FP4 precision, which accelerates the attention operation under extremely long contexts. Third, relative to , a smaller attention top-k is chosen in DeepSeek-V4 series, thereby improving model efficiency on short- and medium-length texts. Finally, and most importantly, compressed attention and hybrid attention techniques substantially reduce both the KV cache size and the computational FLOPs. Taking BF16 GQA8 (Ainslie et al., 2023) with a head dimension of 128 as the baseline — one of the common configurations of LLM attention — the KV cache size of DeepSeek-V4 series can be dramatically reduced to approximately 2% times of that baseline in the 1M-context setting. 13 Algorithm 1 Muon Optimizer for DeepSeek-V4 Require: Learning rate 𝜂, momentum 𝜇, weight decay 𝜆, update rescaling factor 𝛾 1: for each training step 𝑡 do 2: for each logically independent weight 𝑊 ∈ R𝑛×𝑚 do 3: 𝐺𝑡 = ∇𝑊L𝑡 (𝑊𝑡−1) ⊲ Compute gradients 4: 𝑀𝑡 = 𝜇𝑀𝑡−1 + 𝐺𝑡 ⊲ Accumulate momentum buffer 5: 𝑂′𝑡 = HybridNewtonSchulz(𝜇𝑀𝑡 + 𝐺𝑡) ⊲ Nesterov trick and hybrid Newton-Schulz 6: 𝑂𝑡 = 𝑂 ′ 𝑡 · √︁ max(𝑛,𝑚) · 𝛾 ⊲ Rescale the update RMS 7: 𝑊𝑡 =𝑊𝑡−1 · (1 − 𝜂𝜆) − 𝜂𝑂𝑡 ⊲ Perform weight decay and update 8: end for 9: end for Moreover, even when compared with (DeepSeek-AI, 2025) — already an efficient baseline — DeepSeek-V4 series still exhibits substantial advantages in efficiency. A comparison of their inference FLOPs and KV cache size is provided in the right part of Figure 1. . Muon Optimizer We employ the Muon (Jordan et al., 2024; Liu et al., 2025) optimizer for the majority of modules in DeepSeek-V4 series due to its faster convergence and improved training stability. The full algorithm of our Muon optimization is summarized in Algorithm 1. Basic Configurations. We maintain the AdamW (Loshchilov and Hutter, 2017) optimizer for the embedding module, the prediction head module, the static biases and gating factors of mHC modules, and the weights of all RMSNorm modules. All other modules are updated with Muon. Following Liu et al. (2025), we also apply weight decay to Muon parameters, use the Nesterov (Jordan et al., 2024; Nesterov, 1983) trick, and rescale the Root Mean Square (RMS) of the update matrix for reutilization of our AdamW hyper-parameters. Different from them, we use hybrid Newton-Schulz iterations for orthogonalization. Hybrid Newton-Schulz Iterations. For a given matrix 𝑀, let its Singular Value Decomposition (SVD) be 𝑀 = 𝑈Σ𝑉𝑇 . The Newton-Schulz iterations aim to approximately orthogonalize 𝑀 to be 𝑈𝑉𝑇 . Usually, 𝑀 will be first normalized as 𝑀0 = 𝑀/| |𝑀 | |𝐹 to ensure its maximum singular value does not exceed 1. Then, each Newton-Schulz iteration performs the following operation: 𝑀𝑘 = 𝑎𝑀𝑘−1 + 𝑏(𝑀𝑘−1𝑀𝑇𝑘−1)𝑀𝑘−1 + 𝑐(𝑀𝑘−1𝑀 𝑇 𝑘−1) 2 𝑀𝑘−1. (28) Our hybrid Newton-Schulz performs 10 iterations over two distinct stages. During the first 8 steps, we use coefficients (𝑎, 𝑏, 𝑐) = (,−, ) to drive rapid convergence, bringing the singular values close to 1. In the final 2 steps, we switch to coefficients (𝑎, 𝑏, 𝑐) = (2,−, ), which stabilize the singular values precisely at 1. Avoiding Exploding Attention Logits. The attention architecture of DeepSeek-V4 series al- lows us to directly apply RMSNorm on the attention queries and KV entries, which effectively prevents attention logits from exploding. Consequently, we do not employ the QK-Clip tech- nique (Liu et al., 2025) in our Muon optimizer. 14 3. General Infrastructures . Fine-Grained Communication-Computation Overlap in Expert Parallelism Mixture-of-Experts (MoE) can be accelerated via Expert Parallelism (EP). However, EP re- quires complex inter-node communication and imposes substantial demands on interconnect bandwidth and latency. To alleviate the communication bottleneck in EP and achieve higher end-to-end performance under lower interconnection bandwidth requirements, we propose a fine-grained EP scheme that fuses communication and computation into a single pipelined kernel for communication-computation overlapping. Communication Latency Can Be Hidden. The key insight of our EP scheme is that the communication latency can be effectively hidden beneath computation in MoE layers. As shown in Figure 5, in DeepSeek-V4 series, each MoE layer can be decomposed mainly into four stages: two communication-bound stages, Dispatch and Combine, and two computation-bound stages, Linear-1 and Linear-2. Our profiling reveals that within a single MoE layer, the total time of communication is less than that of the computation. Therefore, after fusing communication and computation into a unified pipeline, computation remains the dominant bottleneck, implying that the system can tolerate lower interconnect bandwidth without degrading end-to-end performance. L1 Act L2 (a) Naive Solution Communication Computation L1 Act L2 Theoretical speedup: × (b) Comet Dispatch Computation Activation & Combine L1 L2 L1 L2 L1 L2 Act Act Act Expert Wave 1 Expert Wave 2 Expert Wave 3 Theoretical speedup: × (c) Ours Dispatch All-to-All Linear 1 GEMM SwiGLU + FP8 Cast Combine All-to-All Linear 2 GEMM Figure 5 | Illustration of our EP scheme with related works. Comet (Zhang et al., 2025b) overlaps Dispatch with Linear-1, and Linear-2 with Combine, separately. Our EP scheme achieves a finer- grained overlapping by splitting and scheduling experts into waves. The theoretical speedup is evaluated in the configuration of the DeepSeek-V4-Flash architecture. Fine-Grained EP Scheme. To further lower the interconnect bandwidth requirement and amplify the benefits of overlapping, we introduce a finer-grained expert partitioning scheme. Inspired by many related works (Aimuyo et al., 2025; Zhang et al., 2025b), we split and schedule the experts into waves. Each wave consists of a small portion of experts. As soon as all experts within the wave have completed their communication, computation can commence immediately without waiting for other experts. In steady state, computation of current wave, token transfer for the next wave, and result sending of completed experts all proceed concurrently, as demonstrated in Figure 5. This forms a fine-grained pipeline among experts, keeping both computation and communication continuous throughout the wave. The wave-based scheduling speeds up the 15 performance on extreme cases such as Reinforcement Learning (RL) rollout, which usually encounters long-tail small batches. Performance and Open-Sourced Mega-Kernel. We validated the fine-grained EP scheme on both NVIDIA GPUs and HUAWEI Ascend NPUs platforms. Compared against strong non-fused baselines, it achieves ∼ × speedup for general inference workloads, and up to × for latency-sensitive scenarios such as RL rollouts and high-speed agent serving. We have open-sourced the CUDA-based mega-kernel implementation named MegaMoE2 as a component of DeepGEMM. Observations and Proposals. We share observations and lessons from kernel development and offer some proposals to hardware vendors, in the hope of aiding efficient hardware design and achieving better software-hardware co-design: • Computation-Communication Ratio. Full communication-computation overlap hinges on the computation-communication ratio, rather than the bandwidth solely. Denoting peak compute throughput as 𝐶 and interconnect bandwidth as 𝐵, communication can be fully hidden when 𝐶/𝐵 ⩽ 𝑉comp/𝑉comm, where 𝑉comp denotes the computation volume and 𝑉comm refers to the communication volume. For DeepSeek-V4-Pro, where each token-expert pair requires 6ℎ𝑑 FLOPs (SwiGLU gate, up, and down projections) but only 3ℎ bytes of communication (FP8 Dispatch + BF16 Combine), this simplifies to: 𝐶 𝐵 ⩽ 2𝑑 = 6144 FLOPs/Byte. That is, each GBps of interconnect bandwidth suffices to hide the communication for TFLOP/s of compute. Once bandwidth meets this threshold, it ceases to be the bottleneck, and devoting additional silicon area to further bandwidth brings diminishing returns. We encourage future hardware designs to target such balance points rather than scale bandwidth unconditionally. • Power Budget. Extreme kernel fusion drives compute, memory, and network to high load simultaneously, making power throttling a key performance limiter. We suggest that future hardware designs provide sufficient power headroom for such fully concurrent workloads. • Communication Primitives. We adopt a pull-based approach where each GPU actively reads data from remote GPUs, avoiding the high notification latency that fine-grained push entails. Future hardware with lower-latency cross-GPU signaling would make push viable and enable more natural communication patterns. • Activation Function. We propose replacing SwiGLU with a low-cost element-wise activa- tion that involves no exponential or division operations. This lightens the post-GEMM processing directly, and under the same parameter budget, removing the gate projection enlarges the intermediate dimension 𝑑, further relaxing the bandwidth requirement. . Flexible and Efficient Kernel Development with TileLang In practice, our elaborate model architecture would have resulted in hundreds of fine-grained Torch ATen operators. We adopt TileLang (Wang et al., 2026) to develop a set of fused kernels to replace the vast majority of them, delivering optimal performance with minimal effort. It 2 16 also allows us to quickly prototype operators like attention variants during validation. These kernels play critical roles in model architecture development, large-scale training, and ultimately production deployment of inference services. As a Domain-Specific Language (DSL), TileLang balances development productivity with runtime efficiency, enabling rapid development while supporting deep, iterative optimizations within the same codebase. Additionally, we collab- orate closely with the TileLang community to foster a more agile, efficient, and stable kernel development workflow. Reducing Invocation Overhead with Host Codegen. As accelerators continue to grow in performance, CPU-side orchestration overhead becomes increasingly prominent. For small, highly optimized kernels, such fixed host overhead can easily cap utilization and throughput. A common source of this overhead is that host-side logic, such as runtime contract checks, is typically written in Python for flexibility and thus incurs a fixed per-invocation cost. We mitigate this overhead with Host Codegen, which moves most host-side logic into gen- erated host code. Specifically, we first co-generate the device kernel and a lightweight host launcher at the IR (Intermediate Representation) level, embedding the necessary metadata—such as data types, rank/shape constraints, and stride/layout assumptions—parsed from the lan- guage frontend. The launcher is then lowered to the host source code built on top of the TVM-FFI (Chen et al., 2018) framework, whose compact calling convention and zero-copy tensor interop together minimize host-side overhead. At runtime, this generated host code performs validation and argument marshaling, shifting all per-invocation checks out of the Python exe- cution path. Our measurements show that CPU-side validation overhead drops from tens or hundreds of microseconds to less than one microsecond per invocation. SMT-Solver-Assisted Formal Integer Analysis. TileLang kernels involve complex tensor index arithmetic that requires strong formal integer analysis. During compilation passes such as layout inference, memory hazard detection, and bound analysis, the compiler must verify whether integer expressions satisfy specific properties to enable the corresponding optimiza- tions. Therefore, stronger formal analysis capabilities can unlock more advanced and complex optimization opportunities. To this end, we integrate the Z3 SMT solver (De Moura and Bjørner, 2008) into TileLang’s algebraic system, providing formal analysis capability for most integer expressions in tensor programs. We strike a balance between computational overhead and formal expressiveness by translating TileLang’s integer expressions into Z3’s quantifier-free non-linear integer arithmetic (QF_NIA). Based on Integer Linear Programming (ILP) solvers, QF_NIA seamlessly resolves standard linear integer expressions common in kernels. Furthermore, its inherent non-linear reasoning capacity effectively addresses advanced challenges like vectorization over variable tensor shapes. Under reasonable resource limits, Z3 elevates overall optimization performance while restricting compilation time overhead to just a few seconds. The impact is substantial across multiple passes, including vectorization, barrier insertion, and code simplification. Numerical Precision and Bitwise Reproducibility. In production settings, numerical correct- ness and reproducibility are as critical as raw throughput. We therefore prioritize accuracy by default: fast-math optimizations are disabled at the compiler level, and precision-affecting ap- proximations are provided only as explicit, opt-in frontend operators (., T.__exp, T.__log, and T.__sin). Conversely, when strict IEEE-754 semantics are required, TileLang provides 17 IEEE-compliant intrinsics with explicit rounding modes (., _fsqrt, _fdiv, and _add), enabling developers to precisely specify numerical behavior. We also target bitwise reproducibility for validating kernels against hand-written CUDA baselines. We align TileLang’s algebraic simplification and lowering rules with mainstream CUDA toolchains (., NVCC) to avoid transformations that introduce unintended bit-level differences. Layout annotations (., _layout) further allow users to pin down layout-dependent lowering decisions, keeping evaluation and accumulation order consistent with the reference CUDA implementation and thus enabling bit-identical outputs when desired. Our evaluation shows that these accuracy- and reproducibility-oriented design choices do not sacrifice performance: under conservative defaults, TileLang kernels remain competitive, while exposing knobs to selectively relax numerical constraints for higher speed. . High-Performance Batch-Invariant and Deterministic Kernel Libraries To enable efficient training and inference, we develop a comprehensive set of high-performance computational kernels. Beyond basic functionalities and maximizing hardware utilization, another pivotal design goal is to ensure training reproducibility and bitwise alignment among pre-training, post-training, and inference pipelines. Therefore, we implement end-to-end, bitwise batch-invariant, and deterministic kernels with minimal performance overhead. These kernels are helpful for debugging, stability analysis, and consistent post-training behavior. Batch Invariance. Batch invariance ensures that the output of any given token remains bitwise identical, regardless of its position within a batch. To implement batch invariance, the primary challenges are listed as follows: • Attention. To achieve batch invariance, we cannot use the split-KV method (Dao et al., 2023), which distributes the attention computation for a single sequence across multiple Stream Multiprocessors (SMs) to balance the load of SMs. However, abandoning this technique will lead to severe wave-quantization problems3, which can adversely affect GPU utilization. To address this, we develop a dual-kernel strategy for batch-invariant decoding. The first kernel computes the attention output for an entire sequence within a single SM, ensuring high throughput for fully occupied waves. The second kernel, to minimize the latency of the final partially-filled wave and thus alleviate wave-quantization, uses multiple SMs for a single sequence. For the bitwise identity of these two kernels, we carefully design the calculation path of the second kernel to ensure its accumulation order is the same as that of the first kernel. Additionally, the second kernel utilizes dis- tributed shared memory4 within thread-block clusters, enabling high-speed data exchange across SMs. This dual-kernel method effectively confines the overhead of batch-invariant decoding to be negligible. • Matrix Multiplication. Traditional cuBLAS library (NVIDIA Corporation, 2024) cannot achieve batch invariance. Therefore, we replace it end-to-end with DeepGEMM (Zhao et al., 2025). Furthermore, for very small batch sizes, conventional implementation usually employs split-k (Osama et al., 2023) techniques to improve performance. Unfortunately, split-k techniques cannot guarantee batch invariance, a pivotal feature in DeepSeek-V4. 3 ion/ 4 .html#distributed-shared-memory 18 Therefore, we abandon split-k in most scenarios, which, however, may cause performance degradation. To address this, we introduce a set of optimizations that enable our imple- mentation of matrix multiplication to match or even surpass the performance of standard split-k in most major scenarios. Determinism. Deterministic training is highly beneficial for debugging hardware or software issues. Moreover, when training exhibits anomalies such as loss spikes, determinism enables researchers to more easily pinpoint numerical causes and further refine the model design. Non- determinism in training typically stems from non-deterministic accumulation order, often due to the use of atomic addition instructions. This issue primarily occurs during the backward pass, notably at the following parts: • Attention Backward. In conventional implementations of backward propagation for sparse attention, we use atomicAdd to accumulate gradients for the KV tokens. This introduces non-determinism due to the non-associativity of floating-point addition. To address this problem, we allocate separate accumulation buffers for each SM, followed by a global deterministic summation across all buffers. • MoE Backward. When multiple SMs from different ranks concurrently write data to the same buffer on a receiving rank, negotiating writing positions also introduces non- determinism. To resolve this, we design a token order pre-processing mechanism within each single rank, combined with buffer isolation across multiple ranks. This strategy ensures determinism of both the send results of expert parallelism and the accumulation order in the MoE backward pass. • Matrix Multiplication in mHC. mHC involves a matrix multiplication with an output di- mension of only 24. For very small batch sizes, we are compelled to use the split-k (Osama et al., 2023) algorithm, whose naive implementation will cause non-determinism. To overcome this, we output each split part separately and perform a deterministic reduction in a subsequent kernel, thereby preserving both performance and determinism. . FP4 Quantization-Aware Training To achieve inference acceleration and memory savings at deployment, we introduce Quantization- Aware Training (QAT) (Jacob et al., 2018) during the post-training stage, enabling the model to adapt to the precision degradation introduced by quantization. We apply FP4 (MXFP4) quantization (Rouhani et al., 2023) to two components: (1) MoE expert weights, which are a major source of GPU memory occupancy (OpenAI, 2025), and (2) the Query-Key (QK) path in the indexer of CSA, where QK activations are cached, loaded, and multiplied entirely in FP4, accelerating attention score computation in long-context scenarios. In addition, we further quantize the index scores 𝐼:,: from FP32 to BF16 during this QAT process. This optimization achieves a 2× speedup for the top-k selector, while preserving a % recall rate of KV entries. For MoE expert weights, following the common practice of QAT, the FP32 master weights maintained by the optimizer are first quantized to FP4, then dequantized back to FP8 for computation. Notably, our FP4-to-FP8 dequantization is lossless. This is because FP8 (E4M3) has 2 additional exponent bits compared with FP4 (E2M1), offering a larger dynamic range. Consequently, as long as the ratio between the maximum and minimum scale factors of the FP4 sub-blocks (1 × 32 tiles) within each FP8 quantization block (128 × 128 tiles) does not exceed a certain threshold, the fine-grained scale information can be fully absorbed by the extended dynamic range of FP8. We empirically verify that current weights satisfy this condition. This allows the entire QAT pipeline to fully reuse the existing FP8 training framework without 19 any modification. In the backward pass, gradients are computed with respect to the same FP8 weights in the forward pass and directly propagated back to the FP32 master weights, equivalent to applying the Straight-Through Estimator (STE) through the quantization operation. This also avoids the need to re-quantize transposed weights. During the inference and rollout phases of RL training, which do not involve backward passes, we directly use real FP4 quantized weights instead of simulated quantization. This ensures that model behavior during sampling is fully consistent with online deployment, while also reducing kernel memory loading for actual speedup and significantly lowering memory consumption. We process the QK path in the indexer of CSA similarly. . Training Framework Our training framework is built upon the scalable and efficient infrastructure developed for DeepSeek-V3 (DeepSeek-AI, 2024). In training DeepSeek-V4, we inherit this robust foundation while introducing several key innovations to accommodate its novel architectural components — specifically the Muon optimizer, mHC, and the hybrid attention mechanism — while maintaining high training efficiency and stability. . Efficient Implementation of Muon The Muon optimizer requires the full gradient matrix to compute parameter updates, which presents a challenge when combined with the Zero Redundancy Optimizer (ZeRO) (Rajbhandari et al., 2020). Traditional ZeRO is designed for element-wise optimizers like AdamW, where a single parameter matrix can be partitioned and updated across multiple ranks. To address this conflict, we design a hybrid strategy of ZeRO bucket assignment for Muon. For dense parameters, we limit the maximum size of ZeRO parallelism and employ a knapsack algorithm to assign parameter matrices to these ranks, ensuring each rank manages a roughly balanced load. The bucket on each rank is padded to match the size of the largest bucket across ranks, facilitating efficient reduce-scatter operations. This padding typically incurs less than 10% memory overhead in our setup, where each rank manages no more than five parameter matrices. When the overall size of data parallelism exceeds the limit for ZeRO, we compute the Muon update redundantly across the extra data-parallel groups, trading computation for reduced total bucket memory. For MoE parameters, we optimize each expert independently. We first flatten all down projection matrices in SwiGLU (Shazeer, 2020) of all experts across all layers, followed by flattened up projection matrices and gate matrices. Then, we pad the flattened vector to ensure we can evenly distribute this vector across all ranks without splitting any logically independent matrix. Given the large number of experts, we do not impose a limit of ZeRO parallelism for MoE parameters, and the padding overhead is also negligible. Additionally, on each rank, consecutive parameters of identical shape will be automatically merged, enabling batched execution of the Newton-Schulz iterations for better hardware utiliza- tion. Furthermore, we observe that the Newton-Schulz iterations in Muon remain stable when computed with BF16 matrix multiplications. Leveraging this, we further quantize, in a stochastic rounding manner, the MoE gradients to be synchronized across data-parallel ranks to the BF16 precision, halving the communication volume. To avoid accumulation errors introduced by low-precision adders, we replace conventional tree- or ring-based reduce-scatter collectives with a two-phase approach. First, an all-to-all operation exchanges local gradients across ranks, and then each rank performs a local sum in FP32. This design maintains numerical robustness. 20 . Cost-Effective and Memory-Efficient Implementation of mHC The introduction of mHC increases both activation memory consumption and communication volume between pipeline stages, compared with conventional residual connections. To mitigate these costs, we implement several optimization strategies. Firstly, we carefully design and implement fused kernels of mHC for both training and inference. Secondly, we introduce a recomputation strategy that selectively checkpoints interme- diate tensors. Specifically, we recompute most hidden states between layers and all normalized layer inputs, while avoiding recomputation of compute-intensive operations. This achieves a balance between memory saving and computational overhead. Thirdly, we adjust the DualPipe 1F1B overlapping scheme to accommodate the increased pipeline communication and enable concurrent execution of some operations in mHC. Collectively, these optimizations constrain the wall-time overhead of mHC to only % of the overlapped 1F1B pipeline stage. More details of the engineering optimization can be found in the dedicated mHC paper (Xie et al., 2026). . Contextual Parallelism for Long-Context Attention Conventional Context Parallelism (CP) partitions the sequence dimension, with each rank maintaining contiguous 𝑠 tokens. This introduces two challenges to our compressed attention mechanisms (., CSA and HCA). On the one hand, training samples are packed from multiple sequences, and each sequence is compressed independently by a factor of 𝑚 (or 𝑚′), with any trailing tokens fewer than 𝑚 being discarded. Consequently, the compressed KV lengths are typically less than 𝑠 𝑚 and vary across ranks. On the other hand, the compression requires 𝑚 consecutive KV entries, which may straddle the boundary between two neighboring CP ranks. To address these challenges, we design a two-stage communication approach. In the first stage, each rank 𝑖 sends its last 𝑚 uncompressed KV entries to rank 𝑖 + 1. Then, rank 𝑖 + 1 compresses some of these received entries together with its local 𝑠 uncompressed KV entries, producing a fixed length of 𝑠 𝑚 + 1 compressed entries, in which exist some padding entries. In the second stage, an all-gather operation across all CP ranks collects the locally compressed KV entries. Then, a fused select-and-pad operator reorganizes them into the full set of compressed KV entries with a total length of cp_size · 𝑠 𝑚 . Any padding entries are placed at the tail. For HCA and the indexer in CSA, the visible range of compressed KV entries for each query token can be precomputed by rules. For the sparse attention in CSA, the top-𝑘 selector explicitly specifies the indices of visible compressed KV entries for each query. . Extended Automatic Differentiation for Flexible Activation Checkpointing Conventional activation checkpointing implementations operate at the granularity of an entire module, deciding whether to retain or recompute its output activations during the backward pass. This coarse granularity often leads to suboptimal trade-offs between recomputation cost and activation memory footprint. An alternative approach is to manually implement the forward and backward logic of an entire layer, explicitly managing tensor checkpointing states. While enabling fine-grained control, this method loses the convenience of the automatic differentiation framework, substantially increasing development complexity. To achieve fine-grained control without sacrificing programming efficiency, we implement a tensor-level activation checkpointing mechanism with automatic differentiation support. With this mechanism, developers only need to implement the forward pass and selectively annotate 21 individual tensors for automatic checkpointing and recomputation. Our framework leverages TorchFX (Reed et al., 2022) to trace the full computation graph. For each annotated tensor, it performs a backward traversal to identify the minimal subgraph required for its recomputation. We define these minimal subgraphs as recomputation graphs and insert them into the backward logic just before the corresponding gradient computation. Compared with the manual implementation, this design introduces no additional overhead during training. Recomputation in this framework is implemented by directly freeing the GPU memory of the annotated tensor and reusing the storage pointer from the recomputed tensor, without any GPU memory copy. Furthermore, since graph tracing executes the model concretely, we can track the underlying storage pointer of each tensor, which enables automatic deduplication of recomputation for tensors that share storage (., the input and output of a reshape operation). This relieves developers from reasoning about low-level memory details when annotating recomputation. . Inference Framework Our inference framework largely inherits from that of DeepSeek-V3, with some differences in KV Cache management. . KV Cache Structure and Management To efficiently manage the heterogeneous KV caches arising from the hybrid attention mechanism in DeepSeek-V4, we design a customized KV cache layout. The layout is illustrated in Figure 6, and we will elaborate on it in detail as follows. Heterogeneous KV Entries in DeepSeek-V4. The hybrid attention mechanism in DeepSeek- V4 series introduces multiple types of KV entries with different Key-Value (KV) cache sizes and update rules. The lightning indexer for sparse selection introduces additional dimensions into the KV cache that possess embedding sizes distinct from those in the primary attention. The compression techniques employed in CSA and HCA reduce the sequence length by factors of 1 𝑚 and 1 𝑚′ , respectively, thereby decreasing the overall KV cache size. As a result, KV cache sizes vary across different layers. Furthermore, Sliding Window Attention (SWA) layers also operate with distinct KV cache sizes, as well as separate cache hit and eviction policies. In the compression branch, one KV entry is generated for every 𝑚 tokens. When the number of remaining tokens is insufficient for compression, all pending tokens and their associated hidden states must be retained in a buffer until the compression operation can be executed. These buffered tokens represent a sequence state determined by positional context and are also managed within the KV cache framework. Challenges in Managing Hybrid Attention KV Cache. The hybrid attention mechanism violates fundamental assumptions behind PagedAttention and its variants. Although recent hybrid KV cache managing algorithms (., Jenga (Zhang et al., 2025a), Hymba (Dong et al., 2025)) target general hybrid attention models or specific structures, two principal obstacles prevent consolidating KV caches across all layers under the PagedAttention framework: • Diverse cache policies, such as those used in Sliding Window Attention. • Constraints imposed by high-performance attention kernels, including alignment require- ments. 22 State Cache SWA KV KV Cache Block 0 Block 1 Block 2 Block N SWA KV SWA KV SWA KV SWA KV Uncompressed KV State Uncompressed KV State Uncompressed KV State Uncompressed KV State Uncompressed KV State Layer-2 CSA State Layer-3 HCA State … Layer-0 SWA KV … Layer-n SWA KV Request 1 Request R Request 2 Request 3 … … CSA KV HCA KV CSA KV HCA KV CSA KV HCA KV CSA KV HCA KV CSA KV HCA KV CSA Indexer KV of k1 tokens CSA Main KV of k1 tokens HCA KV of k2 tokens CSA Indexer KV of k1 tokens CSA Main KV of k1 tokens HCA KV of k2 tokens Layer-2 Layer-3 Layer-5 Layer-4 ...... … … Figure 6 | Illustration of the KV cache Layout for DeepSeek-V4. The KV cache is organized into two primary components: a classical KV cache for CSA/HCA, and a state cache for SWA and unready-for-compression tokens in CSA/HCA. In the state cache, each request is assigned a fixed-size cache block. Within this block, the SWA segment stores the KV entries corresponding to the most recent 𝑛win tokens, while the CSA/HCA segment stores uncompressed tail states that are not yet ready for compression. In the classical KV cache, we allocate multiple blocks per request. Each cache block covers lcm(𝑚,𝑚′) original tokens, producing 𝑘1 = lcm(𝑚,𝑚′ ) 𝑚 CSA compressed tokens and 𝑘2 = lcm(𝑚,𝑚′ ) 𝑚′ HCA compressed tokens. For efficient KV cache management of DeepSeek-V4, we design corresponding strategies to overcome these two challenges. State Cache for SWA and Uncompressed Tail Tokens. To address the first obstacle, we adopt an alternative cache management mechanism. Since SWA is designed to enhance performance under a limited KV cache size, it is reasonable to treat it, along with the uncompressed tail tokens from the compression branch, as a state-space model. The corresponding KV cache can thus be regarded as a sequence-specific state that depends solely on the current position. Accordingly, we pre-allocate a fixed- and limited-size pool of state caches, and dynamically assign it to each sequence. Sparse Attention Kernel Co-Design. Regarding the second obstacle, conventional high- performance attention kernels typically assume a fixed number 𝐵 of tokens per block to optimize performance, corresponding to 𝐵 ·𝑚 original tokens in CSA and 𝐵 ·𝑚′ in HCA. Through em- ploying a high-performance sparse-attention kernel, different layers can accommodate variable tokens per block without performance degradation. Achieving this requires co-designing the KV cache layout and the sparse attention kernel. For instance, padding blocks to align with cache lines can improve performance. Thus, for CSA with compression ratio 𝑚 and HCA with ratio 𝑚′, the number of original tokens per block can be any multiple of lcm(𝑚,𝑚′), the least common multiple of these two compression ratios. . On-Disk KV Cache Storage When serving DeepSeek-V4, we leverage an on-disk KV cache storage mechanism to eliminate repeated prefilling for shared-prefix requests. For the compressed KV entries in CSA/HCA and the uncompressed KV entries in Sliding Window Attention (SWA), we design separate solutions for storage management. 23 For CSA and HCA, we simply store all of the compressed KV entries to the disk. When a request hits a stored prefix, we read and reuse the compressed KV entries corresponding to the prefix, until the last complete compression block. Specially, for prefix tokens in the tail incomplete block, we still need to recompute them to restore the uncompressed KV entries, as uncompressed KV entries in CSA and HCA are not stored. For the SWA KV entries, since they are not compressed and exist in every layer, their volume is approximately 8 times larger than the compressed CSA and HCA KV entries. To handle these large SWA KV entries efficiently, we propose and implement three distinct strategies for managing on-disk SWA KV entries, each offering a different trade-off between storage overhead and computational redundancy: • Full SWA Caching. This strategy stores the complete SWA KV entries for all tokens, ensuring computational zero-redundancy. Under this strategy, the SWA KV entries of the hitting prefix can be reconstructed by just reading the on-disk cache of the last 𝑛win tokens within that prefix. Despite computational zero-redundancy, this strategy is inefficient for modern SSD-based storage systems — only a small subset of the stored SWA KV cache will be accessed for each hitting request, which leads to an unbalanced write-intensive access pattern. • Periodic Checkpointing. This strategy checkpoints SWA KV entries of the last 𝑛win tokens within every 𝑝 tokens, where 𝑝 is a tunable parameter. For a hitting prefix, we load the most recent checkpointed state, and then recompute the remaining tail tokens. Through tuning 𝑝, this strategy enables an on-demand trade-off between storage and computation. • Zero SWA Caching. This strategy does not store any SWA KV entries. For a hitting prefix, we need to perform more recomputation to restore the SWA KV entries. To be specific, in each attention layer, the SWA KV entry of each token depends on the SWA KV entries of only the most recent 𝑛win tokens from the previous layer. Therefore, leveraging cached CSA and HCA KV entries, recomputing the last 𝑛win · 𝐿 tokens is enough to restore the last 𝑛win SWA KV entries for an 𝐿-layer model. Depending on specific deployment scenarios, we select the most suitable strategy to achieve the desired trade-off between storage and computation. 4. Pre-Training . Data Construction On top of the pre-training data of DeepSeek-V3, we endeavor to construct a more diverse and higher-quality training corpus with longer effective contexts. We continually refine our data con- struction pipelines. For web-sourced data, we implement filtering strategies to remove batched auto-generated and templated content, thereby mitigating the risk of model collapse (Zhu et al., 2024). Mathematical and programming corpora still remain core components of our training data, and we further enhance the coding capabilities of DeepSeek-V4 series by incorporating agentic data during the mid-training phase. For multilingual data, we build a larger corpus for DeepSeek-V4, improving its capture of long-tail knowledge across different cultures. For DeepSeek-V4, we place a particular emphasis on long-document data curation, prioritizing scientific papers, technical reports, and other materials that reflect unique academic values. Combining all the above, our pre-training corpus comprises more than 32T tokens, containing mathematical contents, codes, web pages, long documents, and other high-quality categories. For pre-training data, we largely follow the same pre-processing strategies of DeepSeek- 24 V3. For tokenization, on top of the DeepSeek-V3 tokenizer, we introduce a few special tokens for context construction, and still remain the vocabulary size to be 128K. We also inherit the token-splitting (DeepSeek-AI, 2024) and Fill-in-Middle (FIM) (DeepSeek-AI, 2024) strategies from DeepSeek-V3. Inspired by Ding et al. (2024), we pack documents from different sources into appropriate sequences to minimize sample truncation. Different from DeepSeek-V3, we employ sample-level attention masking during pre-training. . Pre-Training Setups . Model Setups DeepSeek-V4-Flash. We set the number of Transformer layers to 43 and the hidden dimension 𝑑 to 4096. For the first two layers, we use pure sliding window attention. For the subsequent layers, CSA and HCA are used in an interleaved manner. For CSA, we set the compression rate 𝑚 to 4, the number of indexer query heads 𝑛𝐼 ℎ to 64, the indexer head dimension 𝑐𝐼 to 128, and the number of KV entries selected for sparse attention (., attention top-k) to 512. For HCA, we set the compression rate 𝑚′ to 128. For both CSA and HCA, we set the number of query heads 𝑛ℎ to 64, the head dimension 𝑐 to 512, and the query compression dimension 𝑑𝑐 to 1024. The number of output projection groups 𝑔 is set to 8, and the dimension of each intermediate attention output 𝑑𝑔 is set to 1024. For the additional branch of sliding window attention, the window size 𝑛win is set to 128. We employ MoE layers in all Transformer blocks, but use the Hash routing strategy for the first 3 MoE layers. Each MoE layer consists of 1 shared expert and 256 routed experts, where the intermediate hidden dimension of each expert is 2048. Among the routed experts, 6 experts will be activated for each token. The multi-token prediction depth is set to 1. As for mHC, the expansion factor 𝑛hc is set to 4, and the number of Sinkhorn-Knopp iterations 𝑡max is set to 20. Under this configuration, DeepSeek-V4-Flash comprises 284B total parameters, of which 13B are activated for each token. DeepSeek-V4-Pro. We set the number of Transformer layers to 61 and the hidden dimension 𝑑 to 7168. For the first two layers, we use HCA. For the subsequent layers, CSA and HCA are used in an interleaved manner. For CSA, we set the compression rate 𝑚 to 4, the number of indexer query heads 𝑛𝐼 ℎ to 64, the indexer head dimension 𝑐𝐼 to 128, and the number of KV entries selected for sparse attention (., attention top-k) to 1024. For HCA, we set the compression rate 𝑚′ to 128. For both CSA and HCA, we set the number of query heads 𝑛ℎ to 128, the head dimension 𝑐 to 512, and the query compression dimension 𝑑𝑐 to 1536. The number of output projection groups 𝑔 is set to 16, and the dimension of each intermediate attention output 𝑑𝑔 is set to 1024. For the additional branch of sliding window attention, the window size 𝑛win is set to 128. We employ MoE layers in all Transformer blocks, but use the Hash routing strategy for the first 3 MoE layers. Each MoE layer consists of 1 shared expert and 384 routed experts, where the intermediate hidden dimension of each expert is 3072. Among the routed experts, 6 experts will be activated for each token. The multi-token prediction depth is set to 1. As for mHC, the expansion factor 𝑛hc is set to 4, and the number of Sinkhorn-Knopp iterations 𝑡max is set to 20. Under this configuration, DeepSeek-V4-Flash comprises total parameters, of which 49B are activated for each token. . Training Setups DeepSeek-V4-Flash. We employ the Muon optimizer (Jordan et al., 2024; Liu et al., 2025) for the majority of parameters, but use the AdamW optimizer (Loshchilov and Hutter, 2017) for the 25 embedding module, the prediction head module, and the weights of all RMSNorm modules. For AdamW, we set its hyper-parameters to 𝛽1 = , 𝛽2 = , 𝜀 = 10−20, and weight_decay = . For Muon, we set the momentum to and the weight decay to , and rescale the RMS of each update matrix to for reutilization of the AdamW learning rate. We train DeepSeek-V4-Flash on 32T tokens, and as in DeepSeek-V3, we also employ a batch size scheduling strategy that increases the batch size (in tokens) from a small size to and then keeps it at during most of the training. The learning rate is linearly warmed up in the first 2000 steps, maintained at × 10−4 for most of the training. Near the end of the training, we finally decay the learning rate to × 10−5 following a cosine schedule. The training starts with a sequence length of 4K, and we gradually extend the training sequence length to 16K, 64K, and 1M. As for the setups of sparse attention, we first warmup the model with dense attention for the first 1T tokens, and introduce sparse attention at the sequence length of 64K and keep sparse attention during the rest of the training. When introducing attention sparsity, we first set a short stage to warm up the lightning indexer in CSA, and then train the model with sparse attention for most of the training. For auxiliary-loss-free load balancing, we set the bias update speed to . For the balance loss, we set its loss weight to to avoid extreme imbalance within single sequences. The MTP loss weight is set to for most of the training, and to upon the start of learning rate decay. DeepSeek-V4-Pro. Except for specific values of hyper-parameters, the training setup of DeepSeek-V4-Pro is largely consistent with that of DeepSeek-V4-Flash. We employ the Muon op- timizer for the majority of parameters, but use the AdamW optimizer for the embedding module, the prediction head module, and the weights of all RMSNorm modules. The hyper-parameters of AdamW and Muon are the same as those of DeepSeek-V4-Flash. We train DeepSeek-V4-Pro on 33T tokens, and also employ a batch size scheduling strategy, with the maximum batch size being tokens. The learning rate scheduling strategy is largely the same as that of DeepSeek-V4-Flash, but the peak learning rate is set to × 10−4 and the end learning rate is set to × 10−5. The training also starts with a sequence length of 4K, and the length is gradually extended to 16K, 64K, and 1M. Compared with DeepSeek-V4-Flash, DeepSeek-V4-Pro starts with a longer stage of dense attention, and the strategy of introducing sparse attention is the same as DeepSeek-V4-Flash, following a two-stage training method. For auxiliary-loss-free load balancing, we set the bias update speed to . For the balance loss, we set its loss weight to to avoid extreme imbalance within single sequences. The MTP loss weight is set to for most of the training, and to upon the start of learning rate decay. . Mitigating Training Instability Training trillion-parameter MoE models presents significant stability challenges, and DeepSeek- V4 series are no exception. We encountered notable instability challenges during training. While simple rollbacks could temporarily restore the training state, they proved inadequate as a long-term solution because they do not prevent the recurrence of loss spikes. Empirically, we identified that the occurrence of spikes is consistently tied to outliers in the MoE layers, and the routing mechanism itself appears to exacerbate the emergence of these outliers. Therefore, we sought to tackle this issue from two dimensions: breaking the vicious cycle induced by routing, and directly suppressing anomalous values. Fortunately, we discovered two practical techniques that effectively maintain training stability. Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now, we are sharing them openly to foster further exploration by the community. 26 Anticipatory Routing. We found that decoupling the synchronous updates of the backbone network and the routing network significantly improves training stability. Consequently, at step 𝑡, we use the current network parameters 𝜃𝑡 for feature computation, but the routing indices are computed and applied using the historical network parameters 𝜃𝑡−Δ𝑡. In practice, to circumvent the overhead of loading model parameters twice, we fetch the data for step 𝑡 in advance at step 𝑡 − Δ𝑡. We "anticipatorily" compute and cache the routing indices to be used later at step 𝑡, which is why we name this approach Anticipatory Routing. We also heavily optimized this at the infrastructure level. First, given that pre-computing the routing indices only requires a single forward pass over the data, we carefully orchestrated the pipeline execution and the overlapping of computation with Expert Parallelism (EP) communication, successfully bounding the additional wall-clock time overhead of Anticipatory Routing to approximately 20%. Second, we introduced an automatic detection mechanism that triggers a short rollback and activates Anticipatory Routing exclusively when a loss spike occurs; after operating in this mode for a certain period, the system reverts to standard training. Ultimately, this dynamic application allows us to avert loss spikes with negligible overall additional training overhead, all without compromising model performance. SwiGLU Clamping. In previous literature (Bello et al., 2017; Riviere et al., 2024), clamping has been explicitly utilized to constrain numerical ranges, thereby enhancing training stability. In our actual training runs, we empirically found that applying SwiGLU clamping (OpenAI, 2025) effectively eliminates outliers and substantially aids in stabilizing the training process, without compromising performance. Throughout the training of both DeepSeek-V4-Flash and DeepSeek-V4-Pro, we clamped the linear component of SwiGLU to the range of [−10, 10], while capping the upper bound of the gate component at 10. . Evaluations . Evaluation Benchmarks For the evaluation of the base models, we consider benchmarks spanning four key dimensions: world knowledge, language understanding and reasoning, coding and mathematics, and long- context processing. World knowledge benchmarks include AGIEval (Zhong et al., 2023), C-Eval (Huang et al., 2023), CMMLU (Li et al., 2023) MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024b), MMMLU (OpenAI, 2024a), MultiLoKo (Hupkes and Bogoychev, 2025), Simple-QA verified (Haas et al., 2025), SuperGPQA (Du et al., 2025), FACTS Parametric (Cheng et al., 2025), and TriviaQA (Joshi et al., 2017). Language understanding and reasoning benchmarks include BigBench Hard (BBH) (Suzgun et al., 2022), DROP (Dua et al., 2019), HellaSwag (Zellers et al., 2019), CLUEWSC (Xu et al., 2020), and WinoGrande (Sakaguchi et al., 2019). Coding and mathematical benchmarks include BigCodeBench (Zhuo et al., 2025), Hu- manEval (Chen et al., 2021), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), MGSM (Shi et al., 2023), and CMath (Wei et al., 2023). Long context benchmarks include LongBench-V2 (Bai et al., 2025b). 27 Table 1 | Comparison among -Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4- Pro-Base. All models are evaluated in our internal framework and share the same evaluation setting. Scores with a gap not exceeding are considered to be at the same level. The highest score in each row is in bold font, and the second is underlined. Benchmark (Metric) # Shots DeepSeek-V4-Flash DeepSeek-V4-Pro Base Base Base Architecture - MoE MoE MoE # Activated Params - 37B 13B 49B # Total Params - 671B 284B World Knowl. AGIEval (EM) 0-shot MMLU (EM) 5-shot MMLU-Redux (EM) 5-shot MMLU-Pro (EM) 5-shot MMMLU (EM) 5-shot C-Eval (EM) 5-shot CMMLU (EM) 5-shot MultiLoKo (EM) 5-shot Simple-QA verified (EM) 25-shot SuperGPQA (EM) 5-shot FACTS Parametric (EM) 25-shot TriviaQA (EM) 5-shot Lang. & Reas. BBH (EM) 3-shot DROP (F1) 1-shot HellaSwag (EM) 0-shot WinoGrande (EM) 0-shot CLUEWSC (EM) 5-shot Code & Math BigCodeBench (Pass@1) 3-shot HumanEval (Pass@1) 0-shot GSM8K (EM) 8-shot MATH (EM) 4-shot MGSM (EM) 8-shot CMath (EM) 3-shot Long Context LongBench-V2 (EM) 1-shot . Evaluation Results In Table 1, we provide a detailed comparison of the base models for , DeepSeek- V4-Flash, and DeepSeek-V4-Pro, all evaluated under a unified internal framework with strictly consistent settings. Comparing DeepSeek-V4-Flash-Base with -Base reveals a compelling ef- ficiency story. Despite utilizing a substantially smaller number of both activated and total parameters, DeepSeek-V4-Flash-Base outperforms -Base across a wide array of benchmarks. This advantage is especially evident in world knowledge tasks and challenging long-context scenarios. These results underscore that architectural improvements, refined data quality, and training optimizations in DeepSeek-V4-Flash-Base yield superior performance even with a more compact parameter budget, effectively surpassing the larger -Base on the majority of evaluations. Furthermore, DeepSeek-V4-Pro-Base demonstrates a further, decisive leap in capability, establishing near-universal dominance over both -Base and DeepSeek-V4-Flash- Base. With improvements across almost all categories, DeepSeek-V4-Pro-Base reaches new 28 performance highs among DeepSeek base models on the most demanding benchmarks. On knowledge-intensive evaluations, it delivers dramatic gains, while also substantially advancing long-context understanding. On most reasoning and code benchmarks, DeepSeek-V4-Pro-Base also exceeds both previous models. This comprehensive uplift confirms DeepSeek-V4-Pro-Base as the strongest foundation model in the DeepSeek series, outperforming its predecessors across the spectrum of knowledge, reasoning, coding, and long-context capabilities. 5. Post-Training . Post-Training Pipeline Following pre-training, we conducted a post-training phase to yield the final models of DeepSeek- V4 series. Although the training pipeline largely mirrored that of , a critical methodological substitution was made: the mixed Reinforcement Learning (RL) stage was entirely replaced by On-Policy Distillation (OPD). . Specialist Training The development of domain specialists was conducted by adapting the training pipeline. Specifically, each model was sequentially optimized through an initial fine-tuning phase and subsequent Reinforcement Learning (RL) guided by domain-specific prompts and re- ward signals. For the RL stage, we implemented the Group Relative Policy Optimization (GRPO) algorithm, maintaining hyper-parameters closely aligned with our prior research (DeepSeek-AI, 2025; DeepSeek-AI, 2025). Reasoning Efforts. It is widely recognized that a model’s performance on reasoning tasks is fundamentally governed by the computational effort expended. Consequently, we trained distinct specialist models under divergent RL configurations to facilitate the development of models optimized for varying reasoning capacities. As detailed in Table 2, DeepSeek-V4-Pro and DeepSeek-V4-Flash both support three specific reasoning effort modes. For each mode, we apply distinct length penalties and context windows during RL training, which results in varying output token lengths for reasoning. To integrate these distinct reasoning modes, we utilize specialized response formats demarcated by the <think> and </think> tokens. Furthermore, for the "Think Max" mode, we prepend a specific instruction to the beginning of the system prompt to guide the model’s reasoning process, as shown in Table 3. Generative Reward Model. Typically, easy-to-verify tasks can be effectively optimized using simple rule-based verifiers or test cases. In contrast, hard-to-verify tasks traditionally rely on Reinforcement Learning from Human Feedback (RLHF), which necessitates extensive human annotation to train a scalar reward model. In the post-training phase of DeepSeek-V4 series, however, we dispense with these conventional scalar-based reward models. Instead, to address hard-to-verify tasks, we curate rubric-guided RL data and employ a Generative Reward Model (GRM) to evaluate policy trajectories. Crucially, we apply RL optimization directly to the GRM itself. In this paradigm, the actor network natively functions as the GRM, enabling the joint optimization of the model’s evaluative (judging) proficiency alongside its standard generative capabilities. By unifying these roles, the model’s internal reasoning capabilities are inherently fused into its evaluative process, resulting in highly robust scoring. Furthermore, this approach achieves superior performance with only a minimal set of diverse human annotations, as the 29 Table 2 | Comparison of three reasoning modes Reasoning Mode Characteristics Typical Use Cases Response Format Non-think Fast, intuitive re- sponses based on habits or simple rules. Routine daily tasks, emergency reactions, low-risk decisions. </think> summary Think High Conscious logical analysis, slower but more accurate. Complex problem- solving, planning, medium-risk deci- sions. <think> thinking tokens </think> summary Think Max Push reasoning to its fullest extent. Slow but powerful. Exploring the bound- ary of model reason- ing capability. 1. A special system prompt at the begin- ning. 2. <think> thinking tokens </think> summary Table 3 | Instruction injected into the system prompt for the "Think Max" mode. Injected Instruction Reasoning Effort: Absolute maximum with no shortcuts permitted. You MUST be very thorough in your thinking and comprehensively decompose the problem to resolve the root cause, rigorously stress-testing your logic against all potential paths, edge cases, and adversarial scenarios. Explicitly write out your entire deliberation process, documenting every intermediate step, considered alternative, and rejected hypothesis to ensure absolutely no assumption is left unchecked. model leverages its own logic to generalize across complex tasks. Tool-Call Schema and Special Token. Consistent with our previous version, we utilize a dedicated <think></think> tag to delineate the reasoning path. In DeepSeek-V4 series, we introduce a new tool-call schema that employs a special "|DSML|" token and utilizes an XML- based format for tool invocations, as demonstrated in Table 4. Our experiments demonstrate that the XML format effectively mitigates escaping failures and reduces tool-call errors, providing a more robust interface for model-tool interactions. Interleaved Thinking. introduced a context management strategy that retains reasoning traces across tool-result rounds but discards them upon the arrival of new user mes- sages. While effective, this still caused unnecessary token waste in complex agentic workflows — each new user turn would flush all accumulated reasoning content, forcing the model to reconstruct its problem-solving state from scratch. Leveraging the expanded 1M-token context 30 Table 4 | Tool-call schema for DeepSeek-V4 series. Tool Call Schema ## Tools You have access to a set of tools to help answer the user’s question. You can invoke tools by writing a "<|DSML|tool_calls>" block like the following: <|DSML|tool_calls> <|DSML|invoke name="$TOOL_NAME"> <|DSML|parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE </|DSML|parameter> ... </|DSML|invoke> <|DSML|invoke name="$TOOL_NAME2"> ... </|DSML|invoke> </|DSML|tool_calls> String parameters should be specified as is and set ‘string="true"‘. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set ‘string="false"‘. If thinking_mode is enabled (triggered by <think>), you MUST output your complete reasoning inside <think>...</think> BEFORE any tool calls or final response. Otherwise, output directly after </think> with tool calls or final response. ### Available Tool Schemas {Tool Definition...} You MUST strictly follow the above definedtool name and parameter schemas to invoke tool calls. window of DeepSeek-V4 series, we further refine this mechanism to maximize the effectiveness of interleaved thinking in agentic environments: • Tool-Calling Scenarios. As illustrated in Figure 7(a), all reasoning content is fully pre- served throughout the entire conversation. Unlike , which discarded thinking traces upon each new user turn, DeepSeek-V4 series retain the complete reason- ing history across all rounds, including across user message boundaries. This allows the model to maintain a coherent, cumulative chain of thought over long-horizon agent tasks. • General Conversational Scenarios. As illustrated in Figure 7(b), the original strategy is preserved: reasoning content from previous turns is discarded when a new user message arrives, keeping the context concise for settings where persistent reasoning traces provide limited benefit. As with , agent frameworks that simulate tool interactions via user messages (., Terminus) may not trigger the tool-calling context path and thus may not benefit from enhanced reasoning persistence. We continue to recommend non-think models for such architectures. 31 a) Thinking with tools b) Thinking without tools Figure 7 | Thinking management of DeepSeek-V4 series. Quick Instruction. In chatbot scenarios, a number of auxiliary tasks (., determining whether to trigger a web search, intent recognition, etc.) must be executed before generating the response. Conventionally, these tasks are handled by a separate small model, requiring redundant prefill- ing since it cannot reuse the existing KV cache. To overcome this limitation, we introduce Quick Instruction. We append a set of dedicated special tokens directly to the input sequence, where each token corresponds to a specific auxiliary task. By directly reusing the already-computed KV cache, this mechanism completely avoids redundant prefilling and allows certain tasks, such as generating search queries and determining authority and domain, to be executed in parallel. Consequently, this approach significantly reduces the user-perceived time-to-first-token (TTFT) and eliminates the engineering overhead of maintaining and iterating an extra small model. The supported Quick Instruction tokens are summarized in Table 5. . On-Policy Distillation After training multiple domain-specific experts via specialized fine-tuning and reinforcement learning, we employ multi-teacher On-Policy Distillation (OPD) as the primary technique for merging expert capabilities into the final model. OPD has emerged as an effective post-training paradigm for efficiently transferring the knowledge and capabilities of domain experts to a s

联系我们

智库文档公众号

客服微信

DeepSeek_V4：迈向高效百万级标记上下文智能.pdf

下载

标签

联系我们

意见反馈