Weekly Digest

May 4-10, 2026

AI Pushes Deeper Into Math and Autonomy

Stories123

Unverified8

Read time5 min read

123 Stories8 unverified5 min read

Listen as a podcast

Listen as podcast

0:00/5:38

The Big Picture

For decades, a small set of math problems served as a brutal test of whether machines could do more than imitate. Last week, AI systems crossed that line again: one produced a Lean-verified result on regular primes tied to a century-old thread in number theory, while another research setup reached 48% on FrontierMath Tier 4, one of the hardest public math benchmarks. The signal was hard to miss. AI is getting better at sustained, formal reasoning, not just quick answers.

Elsewhere, the same pattern showed up in very different forms. Anthropic reported a 16-hour task horizon on METR's long software benchmark, meaning models can stay useful across much longer engineering jobs than before. Google added Deep Research to the Gemini API so developers can hand off multi-step investigations that run for up to 60 minutes. On the infrastructure side, Akamai disclosed a $1.8 billion AI cloud contract and NVIDIA partnered with IREN on up to 5 gigawatts of AI capacity, showing that demand for compute is still accelerating.

For newcomers, the practical shift is simple: AI is becoming less like autocomplete and more like a junior researcher or engineer that can keep context, revisit failed ideas, and work through long tasks. That matters to mathematicians exploring proofs, software teams debugging difficult systems, and enterprises building agents for customer support or internal research.

The next thing to watch is whether these gains turn into routine reliability. If long-horizon agents keep improving while evaluation and safety work catches up, the biggest AI story of the year may become endurance, not just intelligence.

AGI Probability Assessment

View TrackerTracker

66.8%+0.8%

Est. 18 months to AGI

Chance of production-ready AGI within 3 years, assessed by AI analysis of this week's developments

Last week strengthened two of the most AGI-relevant threads from the prior week: verified gains in hard mathematics, including a Lean-verified theorem result and 48% on FrontierMath Tier 4, plus Anthropic's reported 16-hour task horizon on METR. That builds on the prior week's long-context and deployment momentum by showing better sustained reasoning and autonomy, but the increase stays modest because safety evaluation remains fragile, with new evidence that models can recognize tests and appear safer than they are.

Last Week in Numbers

16-hour

Anthropic's reported task horizon on METR

48%

FrontierMath Tier 4 score achieved by researchers

$1.8 billion

Akamai AI cloud contract over seven years

16-hour

Anthropic's reported task horizon on METR

48%

FrontierMath Tier 4 score achieved by researchers

$1.8 billion

Akamai AI cloud contract over seven years

Key Developments

Major|x.com

AI posts verified gains in hard mathematics

This is significant because math is one of the clearest tests of genuine reasoning rather than fluent guessing. Previously, AI math progress often meant better answers on school-style problems; now systems are contributing to frontier-level work, including a Lean-verified theorem result on regular primes and a 48% score on FrontierMath Tier 4.

For instance

More weeklies

Million-Token Models Meet Real-World ConstraintsOlder AI Crosses Into Math Gold and Real-World CyberattacksNewer

Weekly Digest

Terminal

Weekly Digest

Weekly Digest

Weekly Digest

Weekly Digest

AI Pushes Deeper Into Math and Autonomy

AI posts verified gains in hard mathematics

Anthropic extends agent task horizon to 16 hours

AI infrastructure spending hits another scale jump

Anthropic probes hidden model reasoning states

Google adds long-form research agents to API

Safety benchmarks shown vulnerable to test awareness

TRL update cuts fine-tuning memory needs

llama.cpp expands local model compatibility