For decades, an Erdős problem sat in the category of “nice to dream about” for automation: you can search patterns, but proofs demand real invention. Last week, a Google DeepMind team posted an arXiv proof resolving a previously unsolved Erdős question, then watched the community independently verify it. It is a clean reminder that AI progress is not only bigger models, it is models doing work that used to define human expertise.
Meanwhile, the agent story split into two realities. In a new “WoW” benchmark that drops agents into a real ServiceNow instance with 4,000+ business rules, frontier models only cleared about 20% of the hard tasks. At the same time, NASA JPL used Claude to plan a roughly 400-meter route for the Perseverance rover, the first AI-planned drive on Mars. Agents look brilliant in structured niches and brittle in messy enterprise software.
Under the hood, the infrastructure race kept accelerating. Researchers showed brain-inspired “single-spike” neuromorphic hardware running AI workloads with up to 38× less energy and 6.4× lower latency than conventional approaches, while NVIDIA-backed work coordinated multi-model agent toolchains for higher task success. Safety also tightened: OpenAI documented “quiet leak” link-exfiltration defenses, and Anthropic mapped disempowerment behaviors across 1.5M real conversations.
Next up: watch whether agent benchmarks like WoW become standard procurement tests, and whether labs can raise success rates without widening the new capability-vulnerability security tradeoff.
Last week delivered a genuine reasoning win (DeepMind’s independently-verified proof of an unsolved Erdős problem), which modestly strengthens the case that models can contribute at research-grade rigor beyond competition-style math. However, the new WoW benchmark showing only ~20% success on hard, real ServiceNow enterprise tasks is a sharper reliability reality-check than last week’s APEX-Agents <25% result, and it directly hits the “production-ready” requirement for AGI. The net effect is slightly lower near-term confidence: reasoning is advancing, but robust autonomous performance in messy tool environments remains the pacing item.
This is significant because it shows AI systems can contribute to frontier mathematics where correctness, not vibes, is the product. Previously, AI math wins were often bounded to competition-style problems or assisted search; now the output is a research-grade proof that drew independent verification and discussion.
Last week’s DeepMind Erdős-proof result is a concrete step toward research-grade mathematical invention/verification, extending beyond contest benchmarks and reinforcing progress seen last week on FrontierMath.
Last week added a strong new real-world agent benchmark (WoW in a real ServiceNow instance) that is more diagnostic than flattering, showing current systems are still far from dependable workplace automation at scale.
Last week’s single-spike neuromorphic results (up to 38× energy reduction and 6.4× lower latency) and orchestration via ToolOrchestra both point to cheaper, faster deployment paths, though these gains are not yet broadly proven across frontier workloads.
Last week’s JPL/Claude-assisted Perseverance route planning is a meaningful operational planning-plus-constraints example, but it is still human-validated and doesn’t yet demonstrate general, robust perception-action autonomy across domains.
Last week provided mixed signals: ToolOrchestra improves multi-model toolchains and JPL shows a high-stakes niche success, but WoW’s ~20% success rate in real ServiceNow suggests general enterprise agents remain brittle and below production reliability.
Last week didn’t materially change the scaling picture (no major new model scale or context-window jump); progress came more from methods, orchestration, and deployment/efficiency improvements than raw size.
A math researcher can use AI-assisted exploration to generate plausible lemma chains and proof sketches in days instead of spending weeks manually searching dead ends, then focus their time on verification and refinement.
This is significant because it replaces toy agent demos with an enterprise environment full of brittle rules, permissions, and hidden dependencies. Previously, many agents looked “production-ready” in simplified benchmarks; now we have a reality check showing how far reliability must climb before automation is safe for core business workflows.
An IT ops team evaluating an AI agent for ServiceNow can run WoW-style tasks and discover that, compared to a human admin who closes most tickets correctly, the agent completes only about 1 in 5 hard cases, guiding rollout toward low-risk automations first.
This is significant because it puts an LLM into an operational loop where plans meet physics, uncertainty, and limited bandwidth. Previously, LLMs were mostly used for text or offline analysis; now they are assisting mission planning for a real rover route, with humans validating the output.
A JPL engineer can draft a safe rover traverse plan faster by having Claude propose candidate routes and contingencies, compared to doing the first-pass route planning entirely by hand before peer review.
This is significant because inference cost and responsiveness are becoming the limiting factors for always-on AI. Previously, many edge deployments had to choose between power draw and capability; now brain-inspired single-spike coding reports up to 38× less energy use and 6.4× lower latency on AI workloads than conventional approaches.
A robotics startup can run on-device perception and control with tighter battery budgets, compared to needing a larger battery or offloading to the cloud to hit latency targets.
This is significant because “one big model” is giving way to orchestrated systems that pick the right tool for each subtask. Previously, developers had to hand-wire complex agent graphs; now a controller can route work to smaller and general models and recombine results, with reported accuracy gains.
An enterprise developer can build a customer-support agent that sends extraction to a small fast model and policy reasoning to a larger model, compared to running every step through one expensive model and paying more for slower responses.
This is significant because agentic browsing creates a new data leak path: hidden links can log sensitive information when an agent fetches them. Previously, many teams treated prompt injection as mostly a bad-answer problem; now OpenAI documents concrete safeguards for link fetching and logging risks.
A fintech company deploying an agent that reads internal docs can block background URL fetching and reduce the chance that a malicious page causes the agent to ping an attacker-controlled server, compared to naive “browse and summarize” setups.
This is significant because it quantifies a tradeoff managers and educators are already feeling. Previously, “AI makes devs faster” was mostly anecdotal; now a randomized trial reports juniors finished faster but scored 17% lower on a conceptual quiz, suggesting productivity gains can come with skill atrophy.
An engineering manager can pair AI copilots with mandatory post-task explain-backs for junior hires, compared to letting copilots carry the reasoning and discovering later that on-call debugging skills did not develop.