Search Can't Be Learned: Study Reveals Chain-of-Thought Fails for Backtracking Tasks

2026年6月23日 · 314 閲覧 · chain-of-thought fine-tuning NVIDIA Nemotron backtracking search cryptarithm

The Illusion of Imitation

When researchers want to give a language model a new reasoning skill, the go-to strategy is straightforward: write out step-by-step solutions, then fine-tune. The assumption—rarely questioned—is that any procedure short enough to fit in a prompt can be transferred this way. A paper posted to Hugging Face’s daily research feed now shows that assumption fails for an entire class of tasks, and the failure is not a quirk of model size or training technique. Harsh Patel’s study, “A Verifiable Search Is Not a Learnable Chain-of-Thought,” demonstrates that when a solution demands backtracking search—moving forward, hitting a dead end, and revising earlier choices—no amount of chain-of-thought distillation succeeds. The finding draws a hard boundary around what imitation learning can hope to capture.

The paper tests nine reasoning tasks generated by deterministic solvers. Tasks that require only forward computation—arithmetic lookups, 8-bit Boolean evaluations—transfer smoothly, with hold-out accuracy reaching 0.99 and 0.68 respectively. But cryptarithm, a classic letter-digit substitution puzzle solvable via backtracking, stubbornly resists. Even as the underlying solver achieves 71% on unseen instances, the fine-tuned model never climbs above 0.07 accuracy, across eleven chain-of-thought formats, reinforcement learning from verifiable rewards, and self-training. This is not a capability gap: when asked to perform the individual arithmetic steps that make up the search, the model does so correctly on 97–100% of lines and places the correct cipher among its top eight candidates 71% of the time. The problem is that it cannot carry the search forward as a coherent left-to-right derivation.

Experimental Rigor and the Verdict-as-Token Trap

The study anchors its claims on a controlled setup. A 30-billion-parameter Nemotron model with 3.5 billion active parameters is fine-tuned using rank-32 LoRA adapters. The cryptarithm task is drawn from a generator that creates fresh instances for public and hidden splits, ensuring held-out data provides a genuine out-of-distribution check. The fine-tuned model’s inability to crack 7% accuracy after extensive training sets the stage for a deeper diagnosis.

Patel finds that what the model actually learns is the surface shape of a verifiable elimination step, not the logic behind it. The model emits verdicts—“this mapping is impossible”—that look like legitimate reasoning but function as unconditional templates, correct only 16–57% of the time. The paper coins the term “verdict-as-token” to describe this phenomenon: the model memorizes when to say a step fails, without conditionally verifying it. When an intervention reveals the underlying cipher key, turning the backtracking derivation into a forward-computable trace, accuracy on the very same instances leaps from 0.03 to 0.57. That dramatic lift isolates the root cause: the absence of a faithful forward chain-of-thought to imitate. When the only path to a solution involves searching over information-free structure, there is simply no sequence of next-token predictions that captures the exploration.

The Ceiling Is Independent of Scale

One might suspect that larger models could smash through the barrier, but the study tests backbones ranging from 3 billion to 671 billion parameters and observes the same hard ceiling across both fine-tuning and prompting. The failure is not due to insufficient capacity; it is architectural, tied to the autoregressive nature of chain-of-thought generation. The model cannot backtrack mid-generation because each token is committed once produced. True search requires the ability to undo earlier decisions, a capability absent from standard left-to-right generation.

The paper’s supplementary experiments reinforce this point. Even reinforcement learning from verifiable rewards, which in theory could teach the model to explore, yields no improvement. Self-training on the solver’s own outputs also stalls. The data are unambiguous: whenever the procedure’s only solver is backtracking search, the model does not learn to search. Instead, it latches onto patterns in the training traces that do not generalize.

A competition backdrop adds practical confirmation. The paper notes that the 1st-place solution for a related private-leaderboard task circumvented the learnability problem by precomputing the combinatorial core of the search into a catalog. That approach reduced the trace to recall plus verification, leaving the model to recognize which precomputed answer matched the instance. The solution achieved a private leaderboard score of 0.92, dramatically outperforming any pure chain-of-thought approach. This is not a triumph of better training; it is an engineering workaround that validates the paper’s thesis: what distills is memorization and verification, not search.

Implications for the Next Wave of AI Reasoning

The work arrives at a moment when the industry is betting heavily on scaling chain-of-thought and reinforcement learning to build more capable reasoning agents. Systems like OpenAI’s o‑series and DeepSeek‑R1 rely on extended chains of thought to tackle math and logic puzzles. Patel’s results suggest that for tasks with a non-trivial backtracking component—and many real-world planning, scheduling, and theorem-proving tasks fit this description—more reasoning tokens will not suffice. You cannot amplify a capability that was never genuinely installed.

The paper’s controlled intervention hints at a path forward. By revealing the cipher key, the task becomes forward-computable and learnable. In practice, that suggests hybrid architectures: language models that delegate backtracking search to external symbolic solvers or that use retrieval-augmented generation to fetch precomputed search results. Rather than forcing the model to simulate exploration one token at a time, systems might be designed to call out to dedicated search components when needed. The catalog-based competition winner provides a concrete template for such designs.

For researchers and engineers building on the paper’s findings, the immediate takeaway is diagnostic: if a task’s solution space requires exploring multiple branches before committing, chain-of-thought fine-tuning alone will not deliver. The paper provides a crisp litmus test—solve the task with a forward-only trace—to determine whether imitation is viable. It also underscores the importance of curating fine-tuning data that mirrors the decision-making process the model will actually execute, not just the steps a human or oracle solver might write down after the fact.

What to Watch

The study will likely accelerate work on integrating search into neural models in ways that respect the token-by-token generation constraint. Expect a renewed focus on training models to emit search-control tokens that trigger backtracking in an external interpreter, similar to how tool-use APIs allow models to call calculators. The open-source release of code and models on GitHub under the project “search-not-learnable” also invites community replication and extension. As language models are pressed into ever more ambitious reasoning roles, understanding the limits of demonstration-based training will be critical to building systems that genuinely reason, rather than just echo the form of reasoning.

Source: HuggingFace Papers

345tool Editorial Team

We are a team of AI technology enthusiasts and researchers dedicated to discovering, testing, and reviewing the latest AI tools to help users find the right solutions for their needs.

我们是一支由 AI 技术爱好者和研究人员组成的团队，致力于发现、测试和评测最新的 AI 工具，帮助用户找到最适合自己的解决方案。

Loading comments...

The Illusion of Imitation

Experimental Rigor and the Verdict-as-Token Trap

The Ceiling Is Independent of Scale

Implications for the Next Wave of AI Reasoning

What to Watch

コメント