WED, 03 JUN 2026 · 18:34:02 UTC

DeepSeek V3

Open weights

by DeepSeek·China·Released

671B MoE (37B active) — frontier-class quality at a fraction of competitor pricing.

textcodechatreasoningtoolslong-context
Vendor site Paper
· 0 reviews

About this model

DeepSeek V3 (December 2024) was the model that shocked the industry. A 671B-parameter MoE with 37B active per token, trained reportedly for ~$5.6M of compute — orders of magnitude less than Western frontier labs spend. The model is open-weights under a custom permissive license (commercial use OK) and achieves quality competitive with Claude 3.5 Sonnet and GPT-4o on most benchmarks.

Beyond the cost story, DeepSeek V3 introduced several genuine architectural innovations: Multi-Head Latent Attention (a memory-efficient attention variant), native FP8 mixed-precision training, and Multi-Token Prediction during pretraining. These have since been adopted or studied by every major frontier lab.

Served via the official DeepSeek API at extremely low prices ($0.27/M input, $1.10/M output) and by all major open-weights inference providers.

Strengths

  • Cheapest frontier-class model — $1.10/M output via the official API
  • Open weights under permissive license — no MAU restrictions
  • Genuine architectural research (MLA, FP8 training, MTP)
  • Trained on a tiny budget compared to Western labs
  • Competitive with Claude 3.5 Sonnet on most general benchmarks

Limitations

  • SWE-bench Verified (42%) trails Claude Sonnet 4 substantially
  • Less mature tool-use ecosystem than Western labs
  • Some safety/alignment gaps vs RLHF-heavy Western models
  • US enterprise procurement friction (Chinese origin)
  • 64K context (128K via API) — smaller than top frontier

When to use it

  • Cost-sensitive frontier-class workloads
  • Chinese-language enterprise deployments
  • Self-hosted deployments needing permissive license
  • Research applications studying MoE architectures and FP8 training

Architecture & training

Trained on 14.8T tokens using native FP8 mixed-precision (a DeepSeek innovation that significantly reduces compute cost vs BF16). Uses Multi-Head Latent Attention to reduce KV-cache memory, and Multi-Token Prediction during pretraining to improve sample efficiency. The reported $5.6M training cost refers only to the final pretraining run; total R&D cost is higher but still believed to be much lower than Western competitors. The MoE has 671B total parameters with 37B activated per token.

Benchmarks

BenchmarkScoreBar
MATH90.2
MMLU88.5
HumanEval82.6
SWE-bench Verified42.0

Reviews · 0

Sign in to leave a rating.

Compare against

All models →