For 50 years, Erdős Problem #728 on factorial divisibility stumped the world's top mathematicians. Last week, GPT-5.2 Pro paired with Harmonic's Aristotle cracked it autonomously in hours—and Terence Tao verified the novel proof.
Hardware surged ahead too: Sandia's Loihi 2 neuromorphic chips delivered 18x better performance per watt than GPUs on physics simulations, while NVIDIA unveiled the Rubin platform promising 5x faster AI training. A Chinese robot pulled off fully autonomous biliary surgery on a 30kg pig, navigating complex steps without human help. Anthropic's Constitutional Classifiers slashed jailbreaks by 4x while cutting refusals in half.
These advances hit real people hard. A solo researcher can now simulate climate flows at GPU speeds on a laptop, slashing weeks off projects. Rural surgeons gain a tireless assistant for routine ops that once demanded elite expertise. Drug hunters at small biotechs predict tissue responses zero-shot, speeding therapies from years to months.
Eyes on OpenAI's rumored January model drop and DeepSeek's V4 in February—reasoning leaps could redefine capabilities across the board.
Last week's agentic and math advances like ROME on SWE-bench and HAGeo's IMO performance were surpassed by this week's AI solving the 50-year-old Erdős #728 problem, with Terence Tao verifying the novel proof, marking a leap in autonomous mathematical discovery. The Chinese robot's fully autonomous biliary surgery on a pig demonstrates robust multimodal agentic capabilities in real-world physical tasks. Combined with neuromorphic efficiency gains like Loihi 2's 18x over GPUs, these justify a measured increase amid continued momentum.
This marks AI's first verified novel math proof on a long-open problem, shifting from pattern matching to genuine discovery. Previously reliant on human intuition, math research now leverages autonomous agents for breakthroughs. Terence Tao's confirmation elevates it beyond hype to peer-reviewed impact.
AI's solution to Erdős #728, verified by Terence Tao, represents a major step beyond last week's HAGeo IMO geometry feats toward novel proof generation. Falcon H1R's math benchmark leadership adds incremental support.
Falcon H1R topping AIME-24 and agent evals builds modestly on last week's SWE-bench and ARC-AGI highs, but lacks the scale of prior verified jumps. No new frontier benchmark shattering reported.
Loihi 2's 18x GPU efficiency on PDE sims, analog chip's 99.9% op reduction, and WISE's 6 fJ/MAC significantly advance beyond last week's unverified compute rumors. NVIDIA Rubin's 5x training promise reinforces trajectory.
Chinese robot's fully autonomous pig surgery integrates vision, robotics, and real-time adaptation, elevating from last week's software-focused agents. Arc Stack's zero-shot drug sims aid biological multimodal prediction.
Autonomous surgery robot executes multi-step physical tasks without oversight, extending last week's ROME and recursive agent software autonomy to embodied real-world ops. Anthropic's classifiers improve reliable deployment.
NVIDIA Rubin's 5x training speedup entering production builds on last week's unverified xAI expansion, but remains hardware-focused without demonstrated trillion-param trainings. Incremental context toward larger scales.
A PhD math student can now explore 100 proof variants overnight versus months of manual sketching, accelerating thesis timelines dramatically.
The robot executed multi-step biliary surgery without input, proving AI can handle real-time anatomical variability. This advances surgical robotics beyond teleoperation to full autonomy. Success on a 30kg pig signals readiness for human trials in routine procedures.
A surgeon in a remote clinic can now delegate gallbladder removals to the robot, operating 24/7 versus waiting days for urban specialists to travel.
Neuromorphic chips achieved near-perfect parallelization on physics workloads, crushing GPUs in perf/watt. This unlocks efficient edge computing for simulations too power-hungry for standard hardware. Sandia's results highlight neuromorphic's edge in real-world scientific computing.
A climate modeler at a university can run week-long river flow sims in hours on a single chip versus needing a full GPU cluster overnight.
The 40nm resistive memory chip uses electroforming for random weights and pruning for sub-networks, bypassing digital training costs. This analog approach slashes compute for inference by orders of magnitude. It proves hardware innovation can match software gains in efficiency.
A mobile app developer can deploy vision AI on wearables using milliwatts versus draining batteries in minutes like digital models.
Broadcasting weights over radio for passive analog compute hits ultra-low energy on MNIST and audio tasks. This wireless neuromorphic setup reimagines inference without power-hungry chips. Duke/MIT's breakthrough targets IoT devices where batteries last months, not hours.
An IoT sensor maker can add digit recognition to battery-powered cameras, running inferences for a year versus recharging weekly.
Constitutional Classifiers use 30x less compute to boost defenses 4x while halving false refusals. This balances safety and utility in frontier models. It sets a scalable standard for reliable AI deployment amid rising misuse risks.
A customer support startup can deploy chatbots that refuse harmful queries 4x better without blocking 50% more legit help requests than before.
The new architecture entering production delivers 5x training speed and 10x inference savings over Blackwell. Announced by CEO Huang with 10x efficiency jumps expected. This fuels the compute race, enabling larger models at lower costs faster.
A research lab can train a 1T-param model in weeks versus months, fitting more experiments into grant cycles.
TII's open 7B model scores 88.1% on AIME-24 math and leads agentic evals, beating bigger rivals. This democratizes high-reasoning access via small, efficient open weights. It challenges closed giants in specialized tasks.
An indie game dev can fine-tune the 7B for puzzle-solving NPCs that ace math quests, rivaling 32B models without massive servers.
Models treat cell groups as 'words' to forecast gene expression from 201 drugs across tissues using simulations only. This accelerates personalized medicine without wet-lab trials. Arc's release opens drug discovery to smaller teams.
A biotech startup screens 100 compounds for liver toxicity in days via sims, versus 6 months of animal tests costing $1M.