VibeThinker: 3B Model Outperforms Claude Opus 4.5 on Reasoning with SFT+GRPO

2026년 6월 23일 · 306 조회 · VibeThinker SFT+GRPO small language models reasoning arXiv

A research preprint quietly posted to arXiv this week is sending ripples through the machine learning community. The paper introduces VibeThinker, a 3‑billion‑parameter language model that the authors claim surpasses Anthropic's Claude Opus 4.5—a model believed to be at least 100 times larger—on several demanding reasoning benchmarks. Using a combination of supervised fine‑tuning and a reinforcement learning variant called group relative policy optimization, VibeThinker demonstrates that raw scale may no longer be the only path to advanced chain‑of‑thought reasoning. Within hours, the paper rocketed to over 300 points on Hacker News, where developers and researchers are now parsing its implications for the future of AI efficiency.

A New Recipe: SFT Meets GRPO

At the heart of VibeThinker is a two‑stage training recipe that marries supervised fine‑tuning with group relative policy optimization. In the first stage, a base model is fine‑tuned on a large corpus of human‑written reasoning traces—step‑by‑step solutions to math problems, logical deductions, and multi‑hop question‑answering. This supervised stage gives the model a solid foundation in structured thinking. The second stage applies GRPO, a relatively new reinforcement learning method that eschews the typical reward model needed in RLHF. Instead, GRPO generates a group of candidate responses for each prompt, assigns a score based on correctness or self‑consistency, and then updates the policy so that better‑than‑average answers are encouraged. The result is a model that iteratively refines its own reasoning without relying on a potentially expensive and brittle reward critic.

According to the paper, the GRPO step was crucial for pushing the 3B model past the reasoning ceiling typically observed after supervised fine‑tuning alone. The approach inherently encourages on‑policy exploration, allowing VibeThinker to discover novel solution strategies that were not explicitly present in the training data. The technique is computationally modest: the final model was trained on a dozen A100 GPUs for less than a week, a stark contrast to the thousands of accelerators used for frontier models.

Benchmark Showdown: Matching Billion‑Parameter Giants

The preprint reports VibeThinker’s performance on standard reasoning evaluations such as GSM8K, MATH, and several logical deduction datasets. While specific numbers are under peer review, the title itself asserts that the 3B model surpasses Opus 4.5, a model that—if identified correctly—is Anthropic's most performant Claude variant and ranks among the top closed‑source reasoning systems. On GSM8K, a dataset of grade‑school math word problems, VibeThinker reportedly scores within a few percentage points of the much larger model, and on a subset of MATH problems it even edges ahead. The paper also demonstrates strong results on new benchmarks designed to probe multi‑step reasoning and instruction following, suggesting the performance is not the result of overfitting to a narrow test set.

The comparison has drawn scrutiny from the Hacker News community, where commenters questioned whether Opus 4.5 was evaluated under the same zero‑shot or few‑shot prompt conditions. Some suspect the Opus baseline may have been suboptimally prompted, or that the specific version of the model used is not the latest release. Still, the mere fact that a model small enough to run on a single consumer GPU is in the same conversation as a multi‑billion‑dollar cloud‑hosted counterpart is a signal of how far small‑model reasoning has advanced.

Why a 3B Model That Reasons Matters

VibeThinker’s emergence highlights a growing trend in the AI world: the drive to pack maximum capability into the smallest feasible footprint. Models with 3 billion parameters can be quantized to 4‑bit and run in as little as 2 GB of VRAM, opening the door to private, offline reasoning on laptops, edge devices, and even smartphones. For developers, this means the ability to integrate advanced chain‑of‑thought processing into applications without sending sensitive data to a cloud API, addressing both latency and privacy concerns.

Furthermore, the training technique itself is reproducible by small research labs and startups. Unlike the proprietary RLHF pipelines used by large corporations, GRPO does not require a separate reward model or massive human‑feedback datasets; it operates directly on the policy outputs. This democratization of advanced training could accelerate innovation, enabling a broader set of contributors to improve reasoning capabilities without being locked out by compute budgets.

VibeThinker also adds weight to the argument that reasoning—once seen as the exclusive domain of hundred‑billion‑parameter models—can be learned effectively through clever optimization objectives. If 3B models can handle multi‑step math, what about code generation, theorem proving, or scientific analysis? The paper opens a new line of inquiry into which cognitive tasks genuinely require scale and which can be solved with smarter data and algorithms.

Caution and Community Reaction

Despite the optimism, the Hacker News thread reveals a healthy dose of skepticism. Several commenters pointed out that the paper’s title is almost too bold, and that beating a known large model on a limited set of benchmarks does not automatically translate to real‑world reliability. Others noted that the Opus model name is ambiguous—there is no official “Opus 4.5” listed by Anthropic, raising the possibility that the comparison is against an internal or deprecated variant. The paper's methodology for selecting the Opus baseline and its prompt format will need to withstand close inspection during peer review.

Still, the conversation underscores a broader shift in the AI landscape. Small, open‑weight reasoning models are no longer academic curiosities; they are becoming practical tools that can be embedded in production pipelines. The VibeThinker authors have reportedly open‑sourced their training code and model weights, inviting the community to replicate and challenge the results. That transparency itself is a departure from the opaque benchmarking typical of closed models.

What’s Next for Small Reasoning Models

VibeThinker is likely just the beginning of a wave of research that combines supervised and reinforcement‑learning techniques to squeeze reasoning out of compact architectures. The same GRPO method could be applied to other base models, such as Phi‑3 or Qwen2.5, potentially lifting their reasoning scores without requiring larger scale. Additionally, the paper’s group‑based optimization approach may be extended to self‑play scenarios where the model generates multiple reasoning chains and learns to critique its own outputs.

For the broader AI ecosystem, the VibeThinker preprint reinforces the message that inference efficiency and local deployment are attainable goals even for tasks once reserved for massive clusters. Whether the model can truly stand toe‑to‑toe with GPT‑4 class reasoning in diverse, real‑world settings remains an open question, but its existence lowers the barrier for entry and reshapes the debate over how much scale is strictly necessary. Developers and enterprise architects now have one more reason to watch the small‑model space closely—and perhaps to experiment with running a tiny reasoning powerhouse on their own machines.

Source: Hacker News

345tool Editorial Team

We are a team of AI technology enthusiasts and researchers dedicated to discovering, testing, and reviewing the latest AI tools to help users find the right solutions for their needs.

我们是一支由 AI 技术爱好者和研究人员组成的团队，致力于发现、测试和评测最新的 AI 工具，帮助用户找到最适合自己的解决方案。

Loading comments...

A New Recipe: SFT Meets GRPO

Benchmark Showdown: Matching Billion‑Parameter Giants

Why a 3B Model That Reasons Matters

Caution and Community Reaction

What’s Next for Small Reasoning Models

댓글