For years, “reasoning” in AI has meant impressive answers that sometimes collapse the moment the problem changes shape. Last week, a new Gemini 3.1 Pro result put a hard number on progress: 77.1% on ARC-AGI-2, more than double Gemini 3 Pro, on puzzles designed to punish memorization.
The rest of the week filled in the supporting cast. Researchers showed a counterintuitive training trick: under fixed compute, repeating a small set of high-quality step-by-step examples can beat simply scaling data. On the safety front, Labelbox found that many benchmarks still miss “intent laundering,” where users remove obvious trigger phrases and slip past filters, while a large “prefill attack” study showed reliable bypasses on open models.
Underneath it all, the infrastructure race kept accelerating. Meta’s multiyear partnership with NVIDIA points to “millions” of Blackwell GPUs headed into hyperscale data centers, and NVIDIA’s own GB300 NVL72 numbers claimed up to 50× better performance per watt and 35× lower cost per token for agentic inference. That combination pushes AI from chat into always-on tools that plan, execute, and pay for actions.
Next up: watch whether labs respond by publishing stronger real-world agent evaluations, and whether open models can harden against prompt-layer bypasses without sacrificing the new wave of reasoning gains.
Last week’s momentum toward stronger generalization and long-horizon competence was reinforced by Google’s reported 77.1% on ARC-AGI-2 for Gemini 3.1 Pro, a benchmark explicitly designed to punish memorization and reward novel pattern adaptation. Efficiency signals also strengthened: the “repeat high-quality reasoning traces” result suggests algorithmic/data gains can still unlock capability under fixed compute, while NVIDIA’s GB300 NVL72 claims (50× perf/W, 35× lower cost per token) and Meta’s “millions of GPUs” buildout make always-on agentic inference more economically plausible. Offsetting this, the Labelbox “intent laundering” finding and the broad prefill-attack study emphasize that safety and robustness for production deployment remain behind capability progress, limiting how fast these systems can be trusted as autonomous general workers.
This is significant because ARC-style tests reward adapting to novel logic patterns, not just recalling familiar formats. Previously Gemini 3 Pro scored far lower; now Google is claiming more than a 2× jump, suggesting real improvements in generalization on tricky reasoning tasks.
Gemini 3.1 Pro’s 77.1% ARC-AGI-2 result is a concrete, benchmarked jump on tasks aimed at measuring flexible rule induction, extending last week’s theme of models producing more genuinely structured problem-solving (e.g., research-grade outputs). The training finding that repeating high-quality reasoning exemplars can beat naive scaling under fixed compute also supports continued near-term reasoning gains.
ARC-AGI-2 is a salient addition to the evidence base because it targets distribution shift more directly than many exam-style leaderboards; the 77.1% report materially improves the benchmark picture relative to last week. However, it’s still a narrow benchmark family and doesn’t yet demonstrate full cross-domain expert reliability.
NVIDIA’s GB300 NVL72 claims (up to 50× perf/W and 35× lower cost per token) directly target the key bottleneck for agentic systems: long-running inference with tool use. Even allowing for marketing-optimistic figures, the direction is consistent with accelerating feasibility of continuous agents at scale.
No major new vision/audio/video/robotics capability was highlighted beyond broader infrastructure enabling more deployment. Multimodal progress therefore largely tracks last week’s level rather than showing a fresh step-change.
Cheaper inference plus scalable capacity (Meta/NVIDIA buildout) improves the practicality of always-on tool-using agents, and USDC nanopayments are a small enabler for machine-to-machine action loops. But the prefill-attack study and “intent laundering” result indicate current agent deployments remain fragile to prompt-layer bypass and misuse, constraining real autonomy in production.
A multiyear Meta–NVIDIA partnership pointing to “millions” of Blackwell GPUs is strong evidence that compute supply constraints are easing and long-horizon scaling continues. Combined with GB300-focused inference economics, it supports sustained scaling of both training and serving for agentic workloads.
A product engineer can prototype an internal “data analyst” agent that handles unfamiliar spreadsheet logic and edge-case rules with fewer manual patches than before, reducing the back-and-forth from hours of prompt tweaking to a first pass that often works.
This is significant because it argues that training efficiency is still improving through data strategy, not only bigger clusters. Previously, the default assumption was “more diverse data and more scale”; the study suggests carefully repeating high-quality reasoning traces can yield better results for the same compute budget.
A startup training a domain model for customer support can recycle a small, carefully curated set of step-by-step “hard case” conversations to get stronger policy-following than a much larger messy dataset, cutting training iterations compared to last year’s scale-first approach.
This is significant because it signals sustained, long-horizon capacity for training and serving larger models and agent workloads. Previously, GPU supply and short-term закупки limited rollouts; a multiyear partnership and “millions” of Blackwell GPUs implies more predictable scaling for production AI.
A consumer app team inside Meta can ship always-on multimodal features (search, assistive camera, translation) to far more users without rate limits, compared to earlier launches that had to throttle access due to inference capacity.
This is significant because it shows many evaluations are still tuned to obvious red-flag phrasing, not the underlying intent. Previously, passing a benchmark looked like meaningful safety; now it looks easier to “paraphrase around” defenses in realistic misuse attempts.
A platform safety team can update red-team tests so a user rewriting “how to make a weapon” into innocuous-sounding steps is still caught, instead of shipping a model that only blocks the blunt version and misses the laundered one.
This is significant because it documents repeatable prompt-layer bypass strategies across major open-weight families, not one-off jailbreak anecdotes. Previously, many teams treated jailbreaks as whack-a-mole; systematic evidence raises the bar for how open models should be evaluated and deployed.
An enterprise deploying an open model for internal IT helpdesk can add automated pre-deployment testing against multiple prefill strategies, instead of discovering after launch that employees can reliably coax the model into leaking secrets or unsafe commands.
This matters because agentic AI is often bottlenecked by inference cost and power, not just model quality. Previously, running long-context, tool-using agents at scale was expensive; NVIDIA is positioning GB300 as a step toward economically viable always-on agents.
A customer-support vendor can keep an agent running continuous “monitor, decide, act” loops for more users at the same power budget, compared to earlier GPU generations where long tool chains were too slow or costly to run broadly.
If true, it would accelerate uncontrolled distribution of high-end open(-ish) model weights and derivatives, forcing faster responses in model security and release strategy. Previously, major labs could stage releases; leaks collapse that timeline and shift risk onto downstream deployers.
A model host could suddenly face a flood of unofficial LLaMA-4-derived checkpoints to moderate and secure, instead of onboarding a single vetted release with known safety docs and provenance.
This is notable because machine-to-machine commerce needs payment rails that can handle tiny amounts and high frequency. Previously, paying agents per action was awkward or fee-heavy; Circle is pitching gas-free transfers down to $0.000001.
A developer can run an agent that buys a few cents of data from multiple APIs and pays per call automatically, instead of prepaying subscriptions or batching payments manually.