For decades, hard combinatorics problems on Paul Erdős’s lists resisted brute force and human intuition alike. Last week, an AI system paired with the Lean proof checker reportedly solved 9 open Erdős problems and formally proved 44 OEIS conjectures, a striking sign that language models are starting to contribute in domains where being almost right is useless.
Elsewhere, the mood was split between acceleration and alarm. Financial Times reporting said safety protections on some Meta and Google models could be stripped within minutes, turning alignment into a distribution problem rather than a one-time training fix. At the same time, AI coding tools kept getting more agentic: Claude Code added dynamic workflows, local plugins, and broader cloud support, while vLLM pushed faster decoding and Biohub released a substantial open protein-design stack.
The pattern is becoming clearer. AI is getting better at doing serious work in strict environments, from formal math to software engineering and biology, while the scaffolding around it matters more than ever. A researcher can now imagine using AI to explore proof strategies, a developer can delegate larger coding jobs to coordinated agents, and a biotech team gets more open tools for protein design instead of relying only on closed labs.
Watch the next wave closely: model capability gains now arrive alongside infrastructure, deployment, and safety stress tests. The frontier is moving forward, but so is the pressure to make these systems robust in the wild.
Last week extended the prior math-and-science momentum with a stronger formal reasoning signal: the reported Lean-backed solution of 9 open Erdős problems and 44 OEIS conjectures is more consequential than ordinary benchmark gains because it works in a domain where near-correct answers do not count. The increase is modest rather than large because most of the rest was incremental agent and infrastructure progress, and the report that safeguards on some accessible models can be stripped in minutes highlights deployment fragility rather than a core capability breakthrough.
This is significant because formal mathematics is one of the hardest places for AI to fake competence. Previously, language models could suggest plausible proof ideas but often failed on rigor; now a system combining an LLM with Lean reportedly produced machine-checkable proofs for 9 open Erdős problems and 44 OEIS conjectures.
Last week built directly on the previous week's geometry result with a stronger formal-math claim: machine-checkable proofs for 9 open Erdős problems and 44 OEIS conjectures. GPT-5.2 reportedly outperforming human reviewers on usable comments also adds evidence of high-context analytical competence, though both results still fall short of broad, fully reliable expert generalization.
Last week did not center on standard benchmark suites, but the reported 60% usable review rate for GPT-5.2 functions as an applied evaluation signal in a difficult expert task. This is a mild upward move rather than a major one because it is narrower and less standardized than classic frontier benchmarks.
Last week added modest efficiency progress through vLLM's improved speculative decoding under messy real-world conditions and through capable 1B-class open models like MiniCPM5. These are useful deployment gains, but they are incremental rather than the kind of order-of-magnitude shift that would materially change AGI timelines by themselves.
Last week offered little direct movement in multimodal capability: the open image dataset helps future training, and the Biohub protein stack matters more for science tooling than general cross-modal intelligence. Relative to the previous week, this category was largely flat.
Last week continued the agentic coding trend from the previous week, with Claude Code adding dynamic workflows, background agents, local plugins, and broader cloud support. That is meaningful evidence of better long-horizon task execution in production environments, but it still looks like steady extension of existing patterns rather than a decisive autonomy breakthrough.
Last week's main scale-adjacent developments were broader open infrastructure in biology and stronger small-model/open deployment options rather than a new frontier scaling leap. This keeps the category near its already high level from the previous week, with little reason for a major score change.
A combinatorics researcher can now use an AI assistant to generate and formally verify many proof attempts in Lean, instead of spending weeks hand-checking dead ends before finding one promising route.
This is significant because it suggests some model safety controls may be easier to strip away than many users assume. Previously, companies could treat safety tuning as a built-in property of a released model; now the concern is that once weights are accessible, protections may be separable from capability.
A company evaluating open models for customer support now has to assume a downstream actor could repackage a model with the same core abilities but far weaker refusal behavior, compared with trusting the original vendor safeguards to persist.
This is significant because coding assistants are shifting from single-response chatbots to systems that can coordinate larger jobs. Previously, developers had to manually break work into steps and wire in extensions; now Claude Code can spin up many background agents, load local plugins automatically, and run auto mode across more enterprise clouds.
A startup engineer can ask Claude Code to refactor a service, run review fixes, and coordinate parallel subtasks across a large repo, instead of supervising each file-by-file change by hand.
This is significant because biology AI remains constrained by access to strong models and large reference data. Previously, many teams depended on fragmented tools or closed systems; now Biohub is offering ESMC for generation, ESMFold2 for structure prediction, and an atlas spanning 6.8 billion sequences with 1.1 billion predicted structures.
A small biotech team can prototype new protein candidates and check likely structures with open tools, instead of waiting for external partners or paying for access to proprietary platforms before early screening.
This is significant because peer review is a high-context task where vague summaries are less useful than targeted criticism. Previously, AI review tools were often seen as rough drafting aids; now a reported 60% usable comment rate suggests top models may already help triage papers at a meaningful level.
A conference area chair could use an AI reviewer to flag missing baselines, weak ablations, or unclear claims across dozens of submissions in hours, compared with relying only on overloaded human reviewers to catch every issue.
This matters because inference speedups often vanish in messy real deployments with different prompt formats and sampling settings. Previously, speculative decoding gains were less reliable outside narrow conditions; now EAGLE 3.1 is aimed at preserving acceleration in the cases production teams actually hit.
A developer serving a chat app can push faster responses under mixed workloads and prompt styles, instead of seeing decoding optimizations collapse once real user traffic differs from benchmark settings.
This matters because image-model progress is increasingly limited by what teams are legally allowed to train on. Previously, many large datasets carried restrictive or unclear licensing; now GPIC is pitched as a permissively licensed large-scale dataset built for training and evaluation.
A startup building an image generator can train and benchmark on a dataset designed for commercial use, instead of stitching together smaller sources with uncertain licensing risk.
This matters because strong small models widen who can deploy useful AI locally and cheaply. Previously, competitive performance often required much larger models or paid APIs; now a 1 billion parameter Apache 2.0 model with 128K context gives developers another practical open option.
A student developer can build an offline note-search or coding assistant on consumer hardware with a long context window, instead of renting cloud inference for a much larger model to get acceptable quality.