Technology@lemmy.mlEnglish · 7 hours ago

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

arxiv.org

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

arxiv.org

☆ Yσɠƚԋσʂ ☆@lemmy.ml to

Technology@lemmy.mlEnglish · 7 hours ago

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

What we have here is a massive reality check for the current obsession with blindly scaling up parameters to get better performance proving that you can squeeze frontier level logical reasoning into a tiny 3b parameter model. It managed to hit a score of 94.3 on the extremely difficult AIME26 math benchmark and got an 80.2 on LiveCodeBench v6 putting their incredibly small model in the exact same weight class as massive flagship models like Gemini 3 Pro.

They pulled it off using optimized post training pipeline based on their Spectrum to Signal paradigm starting with curriculum based supervised fine tuning to teach the model broad concepts before forcing it to focus on extremely hard and long reasoning problems. After that they ran multi domain reinforcement learning with a huge 64K context window to make sure the model could actually finish its long thoughts without getting artificially truncated. Another trick they used was to include a Long2Short reinforcement learning stage designed to force the model to be more token efficient in its math reasoning without losing accuracy. And tied it all together with offline self distillation to bake advanced reasoning skills into the base model.

The authors argue that the industry has been conflating two different types of artificial intelligence capabilities. Memorizing world knowledge and random facts naturally requires an expansive amount of parameters. However, pure verifiable reasoning like math and code is actually parameter dense because it is mostly just search, constraint satisfaction, and error correction. So you can tightly compress a world class reasoning engine into a tiny model without needing hundreds of billions of parameters to store random trivia. A big takeaway here is that small models aren’t just cheap fallbacks for when you cannot afford massive compute and can legitimately be used for building top tier reasoning systems.

https://huggingface.co/WeiboAI/VibeThinker-3B

a version fine tuned for tool calling oh even better https://huggingface.co/Shadow0482/mythos_fast

You must log in or register to comment.

Chat

geneva_convenience@lemmy.ml
link
fedilink
arrow-up
4·
4 hours ago
10/10 would model again
whatiswrongwithyou@lemmy.ml
link
fedilink
arrow-up
2·
3 hours ago
Thinkin’ bout those vibes

Technology@lemmy.ml

technology@lemmy.ml

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !technology@lemmy.ml

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

106 users / day
741 users / week
3.14K users / month
6.71K users / 6 months
1 local subscriber
42.8K subscribers
5.09K Posts
56.5K Comments
Modlog

mods:
MinutePhrase@lemmy.ml