WED, 03 JUN 2026 · 18:34:56 UTC

GPT-4o

by OpenAI·USA·Released

OpenAI's multimodal flagship — text, vision, audio, and image in one model.

textvisionaudiocodechattoolslong-contextvoice
Vendor site
· 0 reviews

About this model

GPT-4o (released May 2024) was OpenAI's first model with native voice, vision, and text in a single network. The 'o' stands for 'omni' — it can accept any combination of those modalities as input and respond in voice or text in real time. ChatGPT's Advanced Voice Mode is powered by GPT-4o.

On text tasks GPT-4o is broadly comparable to Claude 3.5 Sonnet — 88.7% MMLU, 90.2% HumanEval — with the multimodal capability as the main differentiator. Pricing has dropped multiple times since launch and currently sits at $2.50/M input, $10/M output as of the most recent revision.

For text-only workloads, newer specialist models (GPT-4.1 for long context, o1/o3-mini for reasoning) often outperform GPT-4o. But for any workflow that needs vision + voice + text in one model, GPT-4o remains the default OpenAI choice.

Strengths

  • Native multimodal: voice, vision, text, image generation in one model
  • Real-time Advanced Voice Mode in ChatGPT
  • Aggressive price drops since launch — now $2.50/M input
  • Mature ecosystem of fine-tuning and tool support

Limitations

  • Beaten by GPT-4.1 on coding and long-context tasks
  • Beaten by o1/o3-mini on reasoning-heavy tasks
  • 128K context — smaller than GPT-4.1 (1M) or Gemini 2.5 Pro (2M)
  • No native video generation (Sora is a separate model)

When to use it

  • Voice-enabled assistants (ChatGPT Advanced Voice Mode)
  • Multimodal chat with image upload and analysis
  • Customer support combining voice + vision
  • General-purpose default when one model needs all modalities

Architecture & training

OpenAI has not disclosed parameter count for GPT-4o. The technical innovation is the unified end-to-end architecture: voice input is tokenised directly into the same model rather than routed through a separate ASR system, which is why response latency is much lower than the original GPT-4-with-Whisper voice pipeline. Post-training uses RLHF and follows OpenAI's standard model spec.

Benchmarks

BenchmarkScoreBar
MATH76.6
MMLU88.7
HumanEval90.2

Reviews · 0

Sign in to leave a rating.

Stories about GPT-4o

More →

Compare against

All models →