80.9% vs 80.0%.
Thats the SWE-bench gap between Claude Opus 4.5 and GPT-5.2 Codex. Every developer on the internet is fighting over this 0.9% like its the difference between shipping and not shipping. My brother in Christ, your users dont care which model wrote the auth middleware.

But theres something nobody wants to talk about. The tool you prefer reveals something uncomfortable about how you see yourself.
the control tax
In one head to head comparison on the same Figma cloning task, Claude Code consumed 6.2 million tokens. Codex used 1.5 million. Same task. Same quality. Four times the compute burned so the developer could stay in the loop.
I call this the Control Tax. The price you pay in tokens, time, and money to feel like youre still the one coding.
Claude has CLAUDE.md files. Skills. Agents. MCP integrations. Slash commands. Plan Mode. Permission controls. Every configuration option is another hour not spent shipping. And developers LOVE configuring their environments. Ive lost full days of my life testing editor extensions that make me 0.05% more productive. You have too. Dont lie.
Codex developers write one detailed prompt, send it off for 15 to 20 minutes, and go design something in Figma or write their newsletter. When they come back theres a day or a weeks worth of code waiting. Less control. Same results. A fraction of the cost.
Claude makes you feel like an engineer. Codex forces you to admit youre a project manager. And developers would rather burn 4x the tokens than accept that.
where the 0.9% stops mattering
The benchmarks converge at the top and diverge where it actually counts.
Terminal-Bench measures command line DevOps workflows. Deployments, migrations, the stuff that wakes you up at 3 AM. Claude hits 59.3%. Codex gets 47.6%. Thats not a rounding error. Thats an 11.7 point gap on the work that matters most when everything breaks.
Security is even more lopsided. Claudes offensive exploit success rate sits at 57.5%. Codex manages 32.5%. Nearly twice as good at finding holes in your code. On defense Codex edges ahead with 90% patch success vs Claudes 87.5%. One attacks better. One defends better. Pick based on your actual threat model, not your Twitter timeline.
Claudes context window degrades in the last 20% though. Files get forgotten. Corrected mistakes reappear. GPT-5.2 holds steady during marathon sessions without drifting. If your work demands sustained autonomous operation over hours, Claude will frustrate you in ways the benchmarks dont capture.
the uncomfortable split
Heres what nobody in this debate will say out loud.
Claude Code is the preferred tool of developers who are emotionally attached to the act of coding. Codex is the preferred tool of developers who are emotionally attached to the act of shipping.
Neither is wrong. But only one scales to running multiple agents in parallel across different parts of your codebase while you do literally anything else with your afternoon.
The developers winning with Codex changed how they work. Creative energy goes into context engineering. Detailed specs. The right prompt architecture. They front load the thinking so the model can execute autonomously. The developers winning with Claude treat it like a fast pair programmer. Steering in real time. Iterating. Feeling the rhythm. Both approaches produce good code today. The question is which one produces good code AND gives you your time back.
For less mainstream ecosystems like Elixir, Phoenix, Rust, or Svelte, Claude still has a meaningful edge. The deeper reasoning compensates for thinner training data. If youre building in Django or React where both models have seen everything, the Control Tax is harder to justify.
pick the job, not the side
The Codex desktop app dropped this week. Multiple updates per day. A visual command center for parallel agents across workspaces. OpenAI is iterating like their runway depends on it. Meanwhile Anthropic still leads on MCP maturity, developer experience depth, and the raw versatility of Claude Code plus Cowork plus Opus 4.5.
The gap is closing fast. And the developers who picked a side six months ago and stopped evaluating are the ones who will be complaining about productivity in six more months.
Your favorite coding agent isnt your personality. Use Claude for infrastructure, security, and complex refactoring where mid-process judgment changes outcomes. Use Codex for well specified feature work at scale where reviewing the output beats steering the process. Use both when you care about results more than identity.
closes laptop. touches grass.
