Problems that used to demand expert intuition are starting to yield to machines, while the cost of using those machines is falling fast. Last week, GPT-5.5 was reported solving research-level math questions that stumped specialists online, and OpenAI plus NVIDIA said the same model family can now run at far lower cost on long, multi-step jobs. Smarter models are arriving at the same moment they are getting cheaper to deploy.
Elsewhere, the race shifted from models alone to the full stack around them. Google agreed to back Anthropic with up to $40 billion, while Amazon expanded Anthropic’s access to Trainium compute to as much as 5 gigawatts. DeepSeek also released V4 Preview open-weight models with huge context windows, giving developers another serious option outside the closed-model leaders.
For newcomers, this matters because AI progress is no longer just about chatbots sounding better. A researcher can use stronger reasoning models for advanced math and coding, a large company can justify wider rollouts when token costs drop, and governments and enterprises now have more incentive to demand sovereign AI stacks they can control locally, as seen in Cohere’s new partnership with Aleph Alpha.
Watch the next few weeks for two signals: whether open-weight challengers like DeepSeek keep narrowing the gap, and whether lower inference costs turn today’s impressive demos into everyday software that can reason, code, and operate for hours at a time.
Last week extended the same acceleration story from a different angle: GPT-5.5 reportedly handled MathOverflow-level problems, while OpenAI and NVIDIA claimed 35x cheaper long-horizon reasoning workloads, which makes stronger cognition more deployable rather than merely more impressive. Compared with the prior week’s trust bottleneck theme, the newest evidence shifts momentum back toward capability and adoption, though reliability concerns still keep the 3-year AGI estimate well below certainty.
This is significant because better AI is most disruptive when people can afford to use it repeatedly, not just admire benchmark results. Previously, long multi-step reasoning runs were expensive enough to limit production use; now OpenAI and NVIDIA say GPT-5.5 workloads on GB200 NVL72 systems can cut token costs by 35x for jobs that require many steps and longer execution.
Last week’s strongest capability signal was GPT-5.5 being credited with solving research-level math questions, a meaningful continuation of the prior week’s longer-horizon expert-task progress. That is not full general expert reasoning, but it pushes the frontier higher in a domain that has historically been a good stress test for abstraction and planning.
Last week did not bring a major new standardized benchmark win, but the Stanford AI Index agent-task improvement and OpenAI’s release of reasoning-trace safety evaluations add useful evidence and measurement infrastructure. This modestly offsets the prior week’s concerns that many existing evaluations are gameable, but it does not fully resolve them.
Last week’s clearest shift was the reported 35x reduction in token cost for GPT-5.5 reasoning workloads on GB200 systems, building directly on the prior week’s hyperscale performance-per-dollar gains. Cost is increasingly less of a blocker for production deployment of long, multi-step cognitive work.
Last week offered little direct multimodal progress, so this category is essentially flat relative to the prior week. Longer-context open models and stronger agents may indirectly help cross-modal systems later, but there was no major new vision, audio, video, or robotics advance in the digest.
Last week reinforced agent progress through the AI Index report showing real computer-task success rising from 12% to 66%, plus cheaper reasoning that should let agents run deeper workflows economically. Compared with the prior week’s 15-hour task-horizon signal, the new evidence is more about practical execution and deployability than pure endurance.
Last week continued the scale story with Anthropic securing vast backing and access to up to 5 gigawatts of compute, alongside DeepSeek’s large open-weight MoE models and 1 million-token context support. This is a direct continuation of the prior week’s infrastructure expansion, showing no real slowdown in frontier compute buildup.
A software startup building an autonomous customer-support agent can let the model investigate a complex billing case across dozens of steps instead of forcing a quick, shallow reply, because the same kind of long-running workload now costs far less than before.
This is significant because it suggests frontier models are getting useful on problems closer to real mathematical research, not just textbook exercises. Previously, AI math demos often centered on contest-style answers; now GPT-5.5 is being credited with solving MathOverflow-level ring theory and Galois-group questions discussed by mathematicians.
A graduate student in algebra can use the model to explore candidate lemmas and check unusual cases overnight, instead of spending days manually testing promising directions before even knowing which path is worth pursuing.
This is significant because frontier AI progress increasingly depends on financing and access to enormous training infrastructure, not just model ideas. Previously, labs had to piece together capital and compute separately; now Google is committing up to $40 billion while Amazon is offering Anthropic up to 5 gigawatts of Trainium capacity, including a cluster of more than 1 million chips.
An enterprise customer evaluating Claude for internal tools can expect faster model iteration and greater capacity for large deployments than a smaller lab could support a year ago, because Anthropic now has unusually deep backing on both cash and hardware.
This is significant because strong open-weight models give developers and governments more control over where AI runs and how it is customized. Previously, many top-tier capabilities were concentrated in closed APIs; now DeepSeek’s V4 Preview includes very large mixture-of-experts models, and NVIDIA says DeepSeek-V4-Pro can handle up to 1 million tokens of context on NIM API.
A legal-tech company can analyze a massive merger archive or long regulatory history in one pass instead of chopping documents into many fragments, because the new model class supports far longer context windows than earlier open alternatives.
This is significant because many countries and regulated industries want advanced AI without handing core systems and data to a single foreign platform. Previously, sovereign AI was more slogan than product stack; now Cohere’s $600 million raise and partnership with Aleph Alpha point to a concrete attempt to offer locally controlled models, infrastructure, and enterprise tooling.
A European public-sector IT team can plan an AI rollout that keeps sensitive records under regional control instead of sending every workflow to a U.S.-hosted frontier API, which was often the practical default before.
This is significant because it tracks progress on practical computer use instead of polished demo chats. Previously, agents succeeded on only a small share of real computer tasks; now Stanford’s AI Index says success rose from 12% to 66% in one year, suggesting the field is moving from novelty toward dependable automation.
An operations analyst can ask an agent to gather figures across dashboards, update a spreadsheet, and prepare a first-pass report with a much better chance of completion than last year, when these workflows often failed midway through.
This matters because coding assistants become more useful when they fit directly into existing software pipelines. Previously, teams often used AI code help interactively and manually; now Claude Code v2.1.120 adds a non-interactive CI review command that makes automated code review easier to wire into development processes.
A small engineering team can have pull requests automatically reviewed in CI before a human touches them, instead of relying only on ad hoc chat sessions to catch issues after code is already under discussion.
This matters because more capable reasoning models create new oversight problems alongside new capabilities. Previously, developers had fewer shared tools for checking whether a model’s internal reasoning text hinted at cheating or reward hacking; now OpenAI has released open-source evaluations and datasets aimed at making those failure modes easier to detect.