ARC-AGI-3 is the benchmark built to separate pattern-matchers from actual reasoners. It is a suite of grid-based puzzles and interactive games where the rules are visible but never spelled out, and the model has to figure them out from two or three examples. Humans solve them casually. Frontier LLMs fail at most of them.
We spent a month pointing a local Cognithor agent at ARC-AGI-3 — not because we expect to beat the state of the art, but because it is the cleanest test of whether a local agent stack can reason at all beyond what its training data contains. Here is where we are.
Two levels in
The current best run: 2 level wins on game ft09 — level 1 in 17 clicks (matching the optimal baseline), level 2 in 9 clicks. Across the full 25-game test set we have 0 wins, 2 level wins, and a lot of instructive failures.
Two levels is not much. But two levels from a local 27B model on consumer hardware against a benchmark that frontier LLMs fail is a data point worth writing down.
The architecture: dual-mode, not one model
The agent lives under src/jarvis/arc/ and has two distinct solving modes because ARC-AGI-3 is actually two problem classes in one suite:
Interactive games (keyboard + click) use a real-time agent in agent.py. The agent observes the screen, picks one of seven actions (reset, four directional, interact, click-at-xy), and iterates. For these games we lean on a multimodal vision model (qwen3-vl:32b) to describe the screen, plus a cluster-based solver (cluster_solver.py) that uses scipy.ndimage.label() to find pixel clusters and a subset search to pick the winning combination.
Static grid puzzles use a DSL solver in classic/ — 25 programmatic primitives (rotate, reflect, tile, fill, pattern match) composed at search time. No learning required; just systematic enumeration.
A hybrid solver (hybrid_solver.py) routes each game to whichever mode fits, with one GPU vision call setting up the problem and a fast CPU solver doing the hard work.
What works
The cluster solver is the hero. For games like ft09 the board is full of colored squares, and the winning move is "click these three specific ones." ndimage.label() finds every connected cluster in milliseconds, and a subset search tries small combinations until it simulates a winning state. That got us the ft09 wins.
Vision is also working better than expected. qwen3-vl:32b correctly reads rules from screen state on three games we tested:
- ft09: "Click on checkered centers to turn them white."
- wa30: "Push green square to cyan bar."
- tu93: "Snake game, food is adjacent, move left."
When the vision model reads the rule correctly and the cluster solver can enumerate the action space, we solve the game. When either piece fails, we don't.
What does not work
Three big gaps, all honest:
Subset search explodes. Beyond 20 clusters the combinatorics blow up. C(36, 6) = 1.9 million combinations, each requiring an arcade.make() simulation to check. At our current one-sim-per-combination speed that's a budget we cannot afford. The fix is simulating toggles in numpy instead of going through arcade — next iteration.
Keyboard games need real RL. Random exploration never stumbles into a win on keyboard-driven games because the action sequences are too long and the reward is sparse. We need a trained CNN predictor that can shortlist promising moves. Training one on-line while playing is Phase 2.
Vision JSON parsing is flaky. Roughly half of vision model responses come back empty or in the wrong JSON shape. num_predict had to be bumped to 8192 because the model spends 2000–4000 tokens "thinking" inside <think> blocks before emitting JSON. Even then, robustness is a problem.
An infrastructure surprise
The single biggest performance win during development was unrelated to the agent itself. We were running with OLLAMA_CONTEXT_LENGTH=262144 because a larger context seemed obviously better. It was not — the KV cache for that context size against a 32B vision model was consuming 64 GB of RAM and spilling to CPU, dropping throughput by 10x. Setting OLLAMA_CONTEXT_LENGTH=8192 put the model back on the GPU and multiplied our step rate by an order of magnitude.
Every local AI stack has a lesson like this hiding in it. Write them down.
Why this matters
ARC-AGI-3 is a benchmark nobody is going to solve with a local model today. We know that. The reason to do the work anyway is that ARC is the cleanest test of reasoning outside the training distribution, and if a local-first agent stack like Cognithor cannot even engage with it, the architecture has a real limit we should know about.
Two level wins say the architecture can engage. The subset search explosion says we need better algorithms. The keyboard game failures say we need training, not just prompting.
All of those are things we can work on. That is the whole point of doing research on your own machine — the next iteration does not need a datacenter. It needs a plan and a weekend.
Follow the work
The ARC module is src/jarvis/arc/ in the main repository. If you want to see the cluster solver or the vision integration, start at agent.py and cluster_solver.py. PRs that speed up the subset search or improve vision JSON parsing are genuinely welcome.