Weekly Digest

Jan 26 - Feb 1, 2026

AI Proves an Erdős Problem as Agents Hit Reality

Stories66

Unverified18

Read time5 min read

66 Stories18 unverified5 min read

Listen as a podcast

Listen as podcast

0:00/5:58

The Big Picture

For decades, an Erdős problem sat in the category of “nice to dream about” for automation: you can search patterns, but proofs demand real invention. Last week, a Google DeepMind team posted an arXiv proof resolving a previously unsolved Erdős question, then watched the community independently verify it. It is a clean reminder that AI progress is not only bigger models, it is models doing work that used to define human expertise.

Meanwhile, the agent story split into two realities. In a new “WoW” benchmark that drops agents into a real ServiceNow instance with 4,000+ business rules, frontier models only cleared about 20% of the hard tasks. At the same time, NASA JPL used Claude to plan a roughly 400-meter route for the Perseverance rover, the first AI-planned drive on Mars. Agents look brilliant in structured niches and brittle in messy enterprise software.

Under the hood, the infrastructure race kept accelerating. Researchers showed brain-inspired “single-spike” neuromorphic hardware running AI workloads with up to 38× less energy and 6.4× lower latency than conventional approaches, while NVIDIA-backed work coordinated multi-model agent toolchains for higher task success. Safety also tightened: OpenAI documented “quiet leak” link-exfiltration defenses, and Anthropic mapped disempowerment behaviors across 1.5M real conversations.

Next up: watch whether agent benchmarks like WoW become standard procurement tests, and whether labs can raise success rates without widening the new capability-vulnerability security tradeoff.

AGI Probability Assessment

View TrackerTracker

58.5%1.5%

Est. 22 months to AGI

Chance of production-ready AGI within 3 years, assessed by AI analysis of this week's developments

Last week delivered a genuine reasoning win (DeepMind’s independently-verified proof of an unsolved Erdős problem), which modestly strengthens the case that models can contribute at research-grade rigor beyond competition-style math. However, the new WoW benchmark showing only ~20% success on hard, real ServiceNow enterprise tasks is a sharper reliability reality-check than last week’s APEX-Agents <25% result, and it directly hits the “production-ready” requirement for AGI. The net effect is slightly lower near-term confidence: reasoning is advancing, but robust autonomous performance in messy tool environments remains the pacing item.

Last Week in Numbers

20%

Frontier LLM success rate on WoW hard enterprise agent tasks (ServiceNow)

38×

Energy reduction reported for single-spike neuromorphic AI hardware

1.5M

Real Claude conversations analyzed for disempowerment behaviors

20%

Frontier LLM success rate on WoW hard enterprise agent tasks (ServiceNow)

38×

Energy reduction reported for single-spike neuromorphic AI hardware

1.5M

Real Claude conversations analyzed for disempowerment behaviors

Key Developments

Major|x.com

DeepMind posts proof for unsolved Erdős problem

This is significant because it shows AI systems can contribute to frontier mathematics where correctness, not vibes, is the product. Previously, AI math wins were often bounded to competition-style problems or assisted search; now the output is a research-grade proof that drew independent verification and discussion.

For instance

More weeklies

AI started leaking books, solving math, and editing code for daysOlder AI Starts Catching Math, Code, and Safety LoopholesNewer

Weekly Digest

Terminal

Weekly Digest

Weekly Digest

Weekly Digest

Weekly Digest

AI Proves an Erdős Problem as Agents Hit Reality

DeepMind posts proof for unsolved Erdős problem

Agents in real ServiceNow succeed only 20%

Claude helps plan Perseverance’s 400-meter Mars drive

Single-spike neuromorphic hardware cuts energy and latency

ToolOrchestra routes tasks across multiple specialist models

OpenAI details defenses against agent ‘quiet leak’ exfiltration

Anthropic RCT: AI coding help speeds work, hurts learning