Phi-4

Open weights

by Microsoft Research·USA·Released Dec 12, 2024

14B small-language model — outperforms much larger models thanks to curated synthetic data.

textchatreasoningmath

Vendor site Paper

— · 0 reviews

About this model

Phi-4 (December 2024) is Microsoft Research's continued bet on the 'small model trained on perfect data' thesis. At 14B parameters Phi-4 scores competitively with much larger models on reasoning and coding benchmarks — 80.4% on MATH, 82.6% on HumanEval, 56.1% on GPQA Diamond.

The Phi team has built a substantial body of research around the idea that data quality matters far more than data quantity for the small-model regime. Phi-4's training corpus is described as 'textbook quality' — heavily curated educational content, code with explanations, and synthetic data generated by larger models.

Released under MIT license. Phi-4 is the highest-quality option for on-device or edge inference where a 14B model is the largest that fits.

Strengths

•Best reasoning-per-parameter ratio in the small-model regime
•Designed for on-device, edge, and browser inference
•MIT license — most permissive licensing available
•Strong synthetic-data methodology, documented in Microsoft's papers

Limitations

•14B is too small for the hardest reasoning tasks
•16K context window — much smaller than frontier models
•Less suitable for creative writing than larger models
•Limited tool-use ecosystem vs Claude / GPT

When to use it

→On-device AI assistants (Copilot+ PCs, mobile apps)
→Browser-resident inference via WebGPU / WebLLM
→Edge deployments without cloud connectivity
→Privacy-first applications where data never leaves the device

Architecture & training

14B-parameter dense transformer trained on a heavily-curated 'textbook quality' corpus — Microsoft Research has explicitly de-emphasised raw web crawl in favour of educational content, code with explanations, and synthetic data generated by larger models (notably GPT-4). The Phi technical reports have repeatedly validated this hypothesis: at the 14B scale, data quality dominates data quantity for downstream benchmark performance.

Benchmarks

Benchmark	Score	Bar
GPQA	56.1
MATH	80.4
MMLU	84.8
HumanEval	82.6

Phi-4

About this model

Strengths

Limitations

When to use it

Architecture & training

Benchmarks

Reviews · 0

Compare against

GLM-4.5

Qwen3-Coder

Kimi K2

MiniMax-M1

About this model

✓ Strengths

× Limitations

When to use it

Architecture & training

Benchmarks

Reviews · 0

Compare against

GLM-4.5

Qwen3-Coder

Kimi K2

MiniMax-M1

Strengths

Limitations