GPT-4.1
Supersededby OpenAI·USA·Released
OpenAI's coding-focused refresh with a full 1M-token context window.
About this model
GPT-4.1 (April 2025) was OpenAI's coding-focused refresh. The headline feature is a full 1M-token context window — up from 128K in GPT-4o — which matches Gemini 2.5 Pro and unlocks workflows like whole-codebase analysis. The model is also priced lower than GPT-4o ($2/M input, $8/M output) despite the larger context.
On SWE-bench Verified, GPT-4.1 scored 54.6% at launch — a meaningful jump over GPT-4o's mid-30s but still behind Claude Sonnet 4 (72.7%) and DeepSeek V3 (42%). OpenAI positioned GPT-4.1 explicitly as a workhorse for production coding workflows, with GPT-4.1 mini and nano variants for cheaper tiers.
Strengths
- •1M-token context — full GPT-4-era multimodal API expanded
- •Cheaper than GPT-4o at $2/M input despite larger context
- •Strong on instruction-following and structured outputs
- •GPT-4.1 mini and nano variants extend the family downward
Limitations
- •SWE-bench score (54.6%) trails Claude Sonnet 4 and Opus 4 substantially
- •No native audio (use GPT-4o or Whisper for voice)
- •Closed weights, no fine-tuning access for the full 1M context
When to use it
- →Whole-codebase analysis and refactor planning
- →Long-document Q&A (200K+ token inputs)
- →Structured-output pipelines (JSON schema generation)
- →Production coding agents where Claude isn't an option
Architecture & training
OpenAI has not disclosed the architecture beyond noting that GPT-4.1 was trained 'to better serve developers' with explicit weighting toward coding tasks in the RLHF stage. The 1M-context capability uses positional encoding extensions similar to those documented in earlier OpenAI research on long-context inference; OpenAI cautions in the model card that quality degrades somewhat past the 200K mark.
Benchmarks
| Benchmark | Score | Bar |
|---|---|---|
| MMLU | 88.7 | |
| HumanEval | 91.5 |