For 80 years, a famous Erdős geometry conjecture stood as the kind of problem mathematicians chip away at for decades. Last week, AI systems didn’t just assist with the paperwork of research: one found stronger constructions for a 1946 problem, and another helped produce a counterexample that appears to kill the conjecture entirely. That is a sharp change from AI as a writing aid to AI as a source of new mathematical ideas.
The rest of the week pointed in the same direction. Google introduced Co-Scientist and ERA to help generate hypotheses and test scientific claims, while FutureHouse reported a dry AMD drug hit from its lab-in-the-loop workflow. On the product side, Alibaba pushed Qwen3.7-Max toward long-running coding agents, including a 35-hour kernel optimization run, and Anthropic expanded Claude distribution through KPMG to 276,000 employees.
For newcomers, the pattern is simple: AI is getting better at doing multi-step work that used to require sustained human attention. A researcher can use these systems to search papers, design experiments, and refine ideas faster; a software team can let an agent debug and optimize code for hours instead of asking for one-off snippets; a large firm can put that capability in front of nearly its entire workforce.
Now the question is less whether AI can help with serious work and more how far institutions will let it go. Watch the next wave of agent releases, scientific replications, and policy moves around model access and pre-release government testing.
Last week built directly on the prior reasoning momentum: after the reported Olympiad-gold result, AI systems were now credited with finding new geometry constructions and helping produce a counterexample to an 80-year-old Erdős conjecture, which is stronger evidence of idea generation rather than just benchmark performance. The spread of AI scientist workflows and a reported dry AMD drug hit also nudged the outlook upward, though weak video understanding and the lack of a broad reliability breakthrough still keep production-ready AGI far from certain.
This is significant because AI moved beyond checking proofs or summarizing papers and contributed to new geometry results. Previously, systems mostly helped mathematicians search literature or verify steps; now they are being credited with finding constructions and counterexamples in a classic Erdős problem area.
Last week extended the prior math-reasoning surge from contest performance into apparent real mathematical discovery, with AI contributing new constructions and a conjecture counterexample. AI scientist systems from multiple groups also suggest stronger cross-domain hypothesis generation, though reliability and replication remain open.
Last week did not add a major new standardized benchmark result comparable to the prior Olympiad-gold-style claim. The strongest evidence came from research outcomes rather than benchmark leaderboards, so this category stayed roughly flat.
Last week offered little direct evidence of major new training or inference efficiency gains. Open long-context weights are useful for access and deployment, but they do not by themselves demonstrate a step change in AGI-relevant cost curves.
Last week slightly weakened the multimodal picture because a paper showed video models still miss basic motion direction, highlighting persistent perception gaps. That undercuts confidence that flashy multimodal demos are translating into robust real-world understanding.
Last week improved the agent picture through longer-endurance coding agents, including a reported 35-hour kernel optimization run, and through end-to-end AI scientist workflows that chain literature search, hypothesis generation, and experiment interpretation. That is a continuation of the previous week's computer-use and cyber-defense trend, with more sustained autonomy.
Last week modestly reinforced scale through wider deployment and larger practical operating envelopes, including 262k-token open-weight context and Claude distribution to 276,000 KPMG staff. This is more about diffusion and usable scale than a fresh scaling-law breakthrough, so the score moved little.
A combinatorics researcher can now ask an AI system to explore huge families of geometric constructions overnight instead of spending weeks manually testing candidate patterns and dead ends.
This is significant because several groups showed AI systems that can generate hypotheses, search literature, propose experiments, and interpret results in one workflow. Previously, researchers used separate tools for each step; now labs are starting to test end-to-end scientific copilots.
A biomedical lab can have an AI review papers, suggest the next assay, and interpret incoming results in one loop, cutting days of literature review and experiment planning down to a single working session.
This is significant because autonomous coding is shifting from short prompt-response exchanges to sustained tool-using work. Previously, models often lost coherence over long sessions; now Alibaba is explicitly targeting agents that can work for hours and handle hundreds of tool calls.
A performance engineer can let an agent compile, benchmark, and revise a kernel through the night instead of manually running each profiling cycle and patch by patch during the workday.
This is significant because broad deployment matters as much as model quality for real-world AI adoption. Previously, many firms limited AI to small pilot groups; now KPMG plans to integrate Claude across core operations and more than 276,000 employees.
A tax advisor at KPMG can use Claude to draft analyses, summarize regulations, and prepare client materials as part of normal workflow instead of relying on a small internal AI pilot or consumer chatbot.
This is significant because it ties AI-driven scientific workflow software to a concrete drug-discovery outcome. Previously, many AI-for-science claims stopped at benchmark scores or literature suggestions; now the claim is a hit generated through a lab-in-the-loop process for a major eye disease area.
A small biotech team can use an AI system to generate candidate ideas and prioritize experiments faster, potentially reaching a promising hit before spending months on broad manual screening.
This is significant because policy is moving closer to the release pipeline itself. Previously, regulators mostly reacted after public launches; now the Commerce Department has agreements to test unreleased models before they reach users.
A model developer preparing a frontier release may now need to plan internal timelines around government evaluation, instead of shipping immediately after internal testing is complete.
This matters because open-weight vendors are still expanding the practical feature set available outside closed APIs. Previously, very long context windows were concentrated in proprietary systems; now 0G Labs is offering Apache 2.0 weights with a stated 262,000-token context window.
A startup building a document-analysis tool can load a large legal archive into one open model context instead of chopping files into many smaller pieces and stitching answers back together.
This matters because flashy multimodal demos can hide simple perception failures. Previously, many assumed video-language models understood movement naturally; this paper suggests many still struggle to tell whether an object is moving left, right, up, or down.
A robotics team using a video model for warehouse monitoring could still need classical vision checks for motion direction, instead of trusting the language model alone to interpret camera feeds correctly.