AI is already having a major impact on how software is written, and much of the heavy lifting of programming is now performed by swarms of agents and subagents. But as developers experiment with new interfaces and form factors for human-AI collaboration, even the most advanced AI labs are finding it difficult to keep up.
The current trend is agent software development (systems that allow AI agents to independently work on coding tasks), as exemplified by the Claude Code and Cowork apps. Meanwhile, OpenAI has been gradually building the Codex tool, which was released as a command line tool last April and expanded to a web interface a month later.
Now, OpenAI is taking big steps to catch up. The company released a new macOS app for Codex on Monday that integrates many of the agent practices that have become popular over the past year. The new app is designed to work in parallel with multiple agents and integrate agent skills and other cutting-edge workflows. This release also comes less than two months after the release of OpenAI’s most powerful coding model, GPT-5.2-Codex, which the company hopes will be enough to attract Claude Code users.
“If you really want to do sophisticated work on complex things, 5.2 is the most powerful model we’ve ever had,” CEO Sam Altman told reporters at a press conference. “But it’s getting harder to use, so we think it’s going to be pretty important to build that level of model functionality into a more flexible interface.”
Altman’s confidence in GPT-5.2 is understandable, but the coding benchmarks tell a more complicated story. GPT-5.2, at least at the time of writing, holds the top spot in Terminal Bench, a test that measures how well an AI handles command-line programming tasks. However, the Gemini 3 and Claude Opus agents scored almost identically. The score is lower, but within the benchmark’s margin of error. Results from SWE Bench, another coding benchmark that tests AI’s ability to fix bugs in real-world software, are similar, showing no clear advantage for GPT-5.2. However, agent use cases are difficult to benchmark effectively, and state-of-the-art models can have very different user experiences.
The Codex app also comes with a variety of new features, and OpenAI says it can match, and in some cases outperform, the various Claude apps. The Codex app allows you to set up automations to run in the background on an automatic schedule, and the results are queued and available for review when the user returns. Users can also choose different personalities for their agents, from down-to-earth to empathetic, depending on their working style.
But the biggest selling point for the company is the speed of development made possible by AI. “You can use this from a clean sheet of paper to create very sophisticated software in a matter of hours,” Altman said. “The ability to input new ideas as quickly as possible is the limit of what you can build.”
tech crunch event
boston, massachusetts
|
June 23, 2026
