NEW
OpenAI Blog
A shared playbook for trustworthy third party evaluations
OpenAI published recommendations for designing trustworthy third-party evaluations of frontier AI models, emphasizing that the surrounding "harness"—the environment, tools, and setup enabling agentic execution—fundamentally shapes measured capabilities and safeguard robustness. The post categorizes evaluation claims and urges evaluators to transparently report their setup, budget, and validity checks to avoid under-elicitation or miscalibrated results.