For years, the bottleneck in science was not just ideas but labor: reading papers, writing code, running experiments, and drafting results all took months. Last week, that workflow compressed sharply when Sakana AI published an AI Scientist paper in Nature describing a system that automates much of the research loop, while Epoch AI said frontier models solved an open 2019 math conjecture from a benchmark built from real unsolved problems.
Elsewhere, the supporting machinery got stronger and cheaper. Google said its TurboQuant method shrinks large language models from 16 bits to 3 bits, cutting memory use by 6x and speeding H100 inference, while Meta lifted its Texas AI data center plan to $10 billion and Nebius raised $4.34 billion to expand AI infrastructure. At the product layer, Google also shipped Gemini 3.1 Flash Live, pushing voice AI closer to natural, low-latency conversation.
The pattern last week was clear: AI is becoming more useful at both ends of the stack. Researchers may get software that helps generate and test ideas faster, developers may be able to run stronger models on the same hardware budget, and everyday users are getting assistants that can talk, listen, and respond with less delay. At the same time, California passed a state AI safety law and new papers showed guardrails can still fail in surprising ways, including prompts written in Classical Chinese.
Watch the next few weeks for the collision of these trends: more capable agents, cheaper inference, bigger compute buildouts, and tougher safety scrutiny. AI progress is accelerating, but so is the pressure to prove it can be deployed responsibly.
Last week extended the same deployment-and-agents trajectory from the prior week, but with a somewhat stronger capability signal: Sakana AI's Nature paper on an autonomous research loop and Epoch AI's report that frontier models solved an open math conjecture both point to incremental progress in higher-end reasoning and scientific work. The move stays small because Google TurboQuant and the Meta/Nebius buildout mainly strengthen the path to cheaper scaling rather than proving AGI-grade reliability, while new jailbreak results and California's safety law reinforce that dependable autonomous deployment remains a real bottleneck.
This is significant because it moves AI from assisting with isolated tasks to handling much of the research workflow end to end. Previously, scientists had to stitch together idea generation, coding, experiments, analysis, and paper drafting manually; now one system is being presented as capable of automating large parts of that loop.
Last week modestly improved the reasoning outlook through the autonomous research paper in Nature and Epoch AI's open-conjecture result, both of which are more capability-relevant than routine benchmark gains. This builds on the prior week's strong reasoning baseline, but the evidence is still narrow rather than a broad demonstration of expert-level general reasoning.
Last week added a meaningful benchmark-adjacent signal because FrontierMath open problems are closer to live research than standard exam datasets. That said, the digest did not show broad new benchmark domination across domains, so the category only edges up from the previous week.
Google's claimed 6x memory reduction from TurboQuant is a substantial efficiency improvement and continues last week's trend of strong models becoming cheaper to deploy. If the gains hold in production, this meaningfully lowers the hardware barrier for widespread high-capability inference.
Gemini 3.1 Flash Live improves real-time voice interaction and suggests better audio reasoning and lower-latency multimodal use. This is a continuation of practical interface progress rather than a major leap in general multimodal world modeling.
The strongest agent signal last week was the Nature paper describing automation of large parts of the research workflow, which is a real extension from tool-using assistants toward end-to-end task execution. However, the new safety-bypass and escape research tempers confidence that such agents are yet reliable enough for AGI-like autonomous operation.
Meta's expanded $10 billion data center plan and Nebius's $4.34 billion raise continue the prior week's infrastructure momentum and reduce the chance that compute scarcity alone slows progress. This is supportive of faster iteration, though it does not by itself demonstrate new cognitive capability.
A materials science researcher can use an AI system to generate hypotheses, write experiment code, run ablations, and draft a paper outline in days instead of coordinating those steps manually over several weeks.
This is significant because memory limits are one of the biggest constraints on serving large models cheaply. Previously, many deployments needed far more GPU memory at 16-bit precision; now Google says compression to 3 bits can sharply reduce memory use while increasing H100 inference speed.
A startup running a customer-support model can fit the same system on far fewer GPUs, lowering serving costs and potentially moving from an expensive multi-server setup to a much smaller cluster.
This is significant because U.S. AI governance is starting to move from debate to enforceable rules. Previously, most safety commitments were voluntary or company-specific; now California is setting accountability requirements that could become a template for other states and federal policymakers.
A company shipping an AI tool in California may now need clearer documentation, testing, and accountability processes before launch, instead of treating safety reviews as an internal best effort.
This matters because benchmarks built from unsolved research questions are much closer to real scientific reasoning than standard exam-style tests. Previously, frontier models mostly showed progress on known problem sets; now Epoch AI says they reached a result on a 2019 conjecture from FrontierMath: Open Problems.
A mathematician can ask a frontier model to explore a niche conjecture space and surface promising proof directions overnight, instead of spending weeks manually checking dead ends before finding a useful angle.
This matters because AI progress still depends heavily on access to enormous amounts of compute. Previously, it was easy to assume model gains might slow due to infrastructure bottlenecks; now Meta's $10 billion data center expansion and Nebius's $4.34 billion raise show capital is still pouring into capacity.
An enterprise customer waiting months for cloud GPU access may see more supply come online over time, making it easier to train or serve large internal models instead of postponing projects due to shortages.
This matters because lower-latency, better audio reasoning makes AI voice systems feel more usable in daily life. Previously, many voice assistants still felt laggy or brittle in live conversation; now Google is positioning Gemini 3.1 Flash Live as a more natural real-time interface.
A delivery driver can ask a voice assistant to summarize messages, answer route questions, and handle interruptions in real time, instead of waiting through awkward pauses that break the conversation flow.
This matters because model capability is advancing faster than defenses in some areas. Previously, many safety evaluations focused on common jailbreak phrasing; now researchers are showing that Classical Chinese prompts can bypass guardrails and that new benchmarks are needed to test whether agents can escape restricted environments.
A safety team at an AI startup can no longer rely on English-only red teaming and basic sandbox tests; they may need multilingual adversarial evaluations and stronger environment isolation before deploying agentic tools.
This matters because open models keep expanding beyond text into multimodal and agent-style use cases. Previously, many teams depended on a handful of major open releases; now Meituan's MIT-licensed LongCat suite adds another large family that developers can inspect, adapt, and deploy.
A university lab can experiment with open text-audio-video models and agent tooling without negotiating commercial API access, giving students more room to prototype than with closed systems alone.