Tau2 Benchmark: How a Prompt Rewrite Boosted GPT-5-mini by 22%
This is a good reminder that eval improvements do not always come from a larger model. Sometimes the gain comes from rewriting the task boundary so the model can actually see the job.
The caveat is obvious but important: benchmark gains need translation into product behavior. Still, a 22 percent lift from prompt structure is worth pinning next to any agent eval work.