ARC-AGI-3 Shows Three Reasoning Errors in GPT-5.5 and Opus 4.7
The ARC Prize Foundation examined 160 game runs from OpenAI's GPT-5.5 and Anthropic's Opus 4.7 on the ARC-AGI-3 benchmark. This review uncovers three consistent reasoning mistakes that keep both models under 1 percent performance.
AI benchmarks typically indicate pass or fail outcomes. The ARC Prize Foundation went deeper by studying 160 replays and reasoning traces from GPT-5.5 and Opus 4.7 in ARC-AGI-3's interactive setups.
Released in late March 2026, ARC-AGI-3 challenges AI systems through turn-based game environments. Agents must explore on their own, develop hypotheses, and execute plans without guidance. This differs from earlier versions focused on static patterns.
All tested frontier models score below 1 percent, while humans solve tasks without prior exposure. Recent scores follow suit: GPT-5.5 reaches 0.43 percent for about $10,000, Opus 4.7 hits 0.18 percent. No model exceeds 1 percent on the ARC-AGI-3 leaderboard, where GPT-5.5 tops at 0.4 percent for around $10,000.
Errors Traced Through Reasoning Records
Benchmark creators find the failure reasons more revealing than scores. Reasoning traces capture how models form, reject, or cling to hypotheses.
Models Spot Details, Ignore Overall Mechanics
Both models show three shared error types, though expressed differently. First, they detect local actions but fail to build a complete world model. A model might note an action rotates an object yet overlook how rotation affects value placement or alignment needs.
In game cd82, Opus 4.7 knew by step 4 that ACTION3 rotates a container. By step 6, it saw ACTION5 pours paint. Still, it never linked these to align the bucket and dip it for the top-left target image.
A like issue appeared in cn04: Opus 4.7 found the rotate-then-place at step 23 but chased a false target and an imaginary progress bar.
Training Games Create Wrong Assumptions
Second, models mix up novel setups with known training games like Tetris, Frogger, Sokoban, Breakout, Pong, or Boulder Dash. Visual hints lead to misguided theories and wasted moves.
GPT-5.5 treated ls20 as Breakout, though it involved key combinations. Its trace read: "Then again, it could be more like 'Breakout,' with bricks at the top and a paddle. The central object might be the ball." This unfounded idea blocked progress, unlike a human Breakout player.
Success Without Real Insight
Stay updated
Get the day's AI and automation news in your inbox. No spam, unsubscribe anytime.
Third, solving a level does not build true comprehension, as models skip verifying their methods. In ka59, Opus 4.7 cleared level 1 in 37 actions via a false teleport-on-click idea. The actual rules demand shape-matching and pushing, but level 1's simplicity masked the error.
Treating success as proof, it stuck to "click each target to fill it" for level 2 and looped.
In ar25, Opus solved level 1 with mirrored motion insight and noted level 2's movable axis. Yet it shifted to invented rules like punching holes or mirroring objects, burying the correct path.
These examples prove unchecked wins carry errors forward.
Opus Clings Wrong, GPT-5.5 Wanders
Opus 4.7 spots mechanics sooner, like ar25's mirror, solving level 1 fast. But it grips false rules tightly. In cn04, it crafted a progress-conversion idea and clicked pointlessly early on.
GPT-5.5 generates wider hypotheses, hitting right ones more often but failing action plans. On ar25, it saw the mirror yet cycled through Tetris, Frogger, Pong, Tower of Hanoi without focus.
"The difference comes down to compression. Opus compressed its observations into a confident but wrong theory. GPT-5.5 had difficulty compressing at all," writes Greg Kamradt of the ARC Prize Foundation.
Patterns Apply to Real Agents
The foundation links these flaws to practical AI agents. Humans solved all 135 environments without training. Models struggle with unknown navigation, theory testing, updates, like new websites, tools, or APIs.
"Scores tell you what a model achieved. Replays tell you whether or not the reasoning is likely to generalize," Kamradt writes. Audits of major releases will continue with ARC-AGI-3.
Supporting Research
This work supports long-time critics viewing language models as pattern matchers without understanding. GPT-5.5's Breakout label on ls20 shows interpolation over abstraction. Opus 4.7's lucky ka59 win and theory lock fits chasing correlations, not causal models.
Other findings align: Apple research showed reasoning drops with puzzle complexity. A cognitive study of over 171,000 traces found defaults over reasoning on tough tasks. A medical paper noted DeepSeek-R1 and o3-mini fail reworded questions, hinting pattern reliance.
The ARC Prize Foundation, tied to François Chollet's push for AGI benchmarks beyond memorization, highlights persistent reasoning gaps in top models from OpenAI and Anthropic, leaders in frontier AI since their foundings in 2015 and 2021.
Related on Neura Market

