Weekly Digest

Feb 16-22, 2026

Gemini 3.1 Pro doubles down on real reasoning

Stories80

Unverified10

Read time5 min read

80 Stories10 unverified5 min read

Listen as a podcast

Listen as podcast

0:00/5:40

The Big Picture

For years, “reasoning” in AI has meant impressive answers that sometimes collapse the moment the problem changes shape. Last week, a new Gemini 3.1 Pro result put a hard number on progress: 77.1% on ARC-AGI-2, more than double Gemini 3 Pro, on puzzles designed to punish memorization.

The rest of the week filled in the supporting cast. Researchers showed a counterintuitive training trick: under fixed compute, repeating a small set of high-quality step-by-step examples can beat simply scaling data. On the safety front, Labelbox found that many benchmarks still miss “intent laundering,” where users remove obvious trigger phrases and slip past filters, while a large “prefill attack” study showed reliable bypasses on open models.

Underneath it all, the infrastructure race kept accelerating. Meta’s multiyear partnership with NVIDIA points to “millions” of Blackwell GPUs headed into hyperscale data centers, and NVIDIA’s own GB300 NVL72 numbers claimed up to 50× better performance per watt and 35× lower cost per token for agentic inference. That combination pushes AI from chat into always-on tools that plan, execute, and pay for actions.

Next up: watch whether labs respond by publishing stronger real-world agent evaluations, and whether open models can harden against prompt-layer bypasses without sacrificing the new wave of reasoning gains.

AGI Probability Assessment

View TrackerTracker

63.5%+1.5%

Est. 18 months to AGI

Chance of production-ready AGI within 3 years, assessed by AI analysis of this week's developments

Last week’s momentum toward stronger generalization and long-horizon competence was reinforced by Google’s reported 77.1% on ARC-AGI-2 for Gemini 3.1 Pro, a benchmark explicitly designed to punish memorization and reward novel pattern adaptation. Efficiency signals also strengthened: the “repeat high-quality reasoning traces” result suggests algorithmic/data gains can still unlock capability under fixed compute, while NVIDIA’s GB300 NVL72 claims (50× perf/W, 35× lower cost per token) and Meta’s “millions of GPUs” buildout make always-on agentic inference more economically plausible. Offsetting this, the Labelbox “intent laundering” finding and the broad prefill-attack study emphasize that safety and robustness for production deployment remain behind capability progress, limiting how fast these systems can be trusted as autonomous general workers.

Last Week in Numbers

77.1%

Gemini 3.1 Pro score on ARC-AGI-2

50×

Claimed performance-per-watt gain for GB300 NVL72 agentic inference

235GB

Alleged size of leaked LLaMA 4 model files

77.1%

Gemini 3.1 Pro score on ARC-AGI-2

50×

Claimed performance-per-watt gain for GB300 NVL72 agentic inference

235GB

Alleged size of leaked LLaMA 4 model files

Key Developments

Major|x.com

Gemini 3.1 Pro posts 77.1% ARC-AGI-2

This is significant because ARC-style tests reward adapting to novel logic patterns, not just recalling familiar formats. Previously Gemini 3 Pro scored far lower; now Google is claiming more than a 2× jump, suggesting real improvements in generalization on tricky reasoning tasks.

For instance

More weeklies

AI Starts Writing Science, While Context Hits 1M TokensOlder $110B for compute, and agents that jailbreak themselvesNewer

Weekly Digest

Terminal

Weekly Digest

Weekly Digest

Weekly Digest

Weekly Digest

Gemini 3.1 Pro doubles down on real reasoning

Gemini 3.1 Pro posts 77.1% ARC-AGI-2

Repeated reasoning examples outperform scaling, under fixed compute

Meta and NVIDIA deepen buildout for millions of GPUs

Safety benchmarks fail against “intent laundering,” Labelbox finds

Large study finds “prefill” attacks bypass open LLM safeguards

NVIDIA’s GB300 NVL72 claims 50× perf/W, 35× lower cost

Alleged LLaMA 4 leak claim surfaces on dark web

Circle launches USDC “nanopayments” for AI agents