WED, 03 JUN 2026 · 18:32:23 UTC

Phi-4

Open weights

by Microsoft Research·USA·Released

14B small-language model — outperforms much larger models thanks to curated synthetic data.

textchatreasoningmath
Vendor site Paper
· 0 reviews

About this model

Phi-4 (December 2024) is Microsoft Research's continued bet on the 'small model trained on perfect data' thesis. At 14B parameters Phi-4 scores competitively with much larger models on reasoning and coding benchmarks — 80.4% on MATH, 82.6% on HumanEval, 56.1% on GPQA Diamond.

The Phi team has built a substantial body of research around the idea that data quality matters far more than data quantity for the small-model regime. Phi-4's training corpus is described as 'textbook quality' — heavily curated educational content, code with explanations, and synthetic data generated by larger models.

Released under MIT license. Phi-4 is the highest-quality option for on-device or edge inference where a 14B model is the largest that fits.

Strengths

  • Best reasoning-per-parameter ratio in the small-model regime
  • Designed for on-device, edge, and browser inference
  • MIT license — most permissive licensing available
  • Strong synthetic-data methodology, documented in Microsoft's papers

Limitations

  • 14B is too small for the hardest reasoning tasks
  • 16K context window — much smaller than frontier models
  • Less suitable for creative writing than larger models
  • Limited tool-use ecosystem vs Claude / GPT

When to use it

  • On-device AI assistants (Copilot+ PCs, mobile apps)
  • Browser-resident inference via WebGPU / WebLLM
  • Edge deployments without cloud connectivity
  • Privacy-first applications where data never leaves the device

Architecture & training

14B-parameter dense transformer trained on a heavily-curated 'textbook quality' corpus — Microsoft Research has explicitly de-emphasised raw web crawl in favour of educational content, code with explanations, and synthetic data generated by larger models (notably GPT-4). The Phi technical reports have repeatedly validated this hypothesis: at the 14B scale, data quality dominates data quantity for downstream benchmark performance.

Benchmarks

BenchmarkScoreBar
GPQA56.1
MATH80.4
MMLU84.8
HumanEval82.6

Reviews · 0

Sign in to leave a rating.

Compare against

All models →