For decades, Olympiad gold in math meant a level of reasoning that clearly separated top human problem-solvers from machines. Last week, a 30B model reportedly hit the gold-medal threshold on both IMO 2025 and USAMO 2026, while Google said it detected the first known case of attackers using AI to build a working zero-day exploit in an active campaign. In one week, AI looked more capable at abstract reasoning and more dangerous in the hands of attackers.
Elsewhere, enterprise adoption kept accelerating. PwC expanded Claude across its global workforce and plans to train 30,000 professionals, while OpenAI launched a new deployment unit backed by a strategic partnership that includes a $500 million Brookfield investment and up to $4 billion for customer buildouts. Anthropic also introduced Glasswing, a program and toolset aimed at using frontier models to find software flaws before criminals do.
The pattern is getting hard to miss: AI is moving from demos into infrastructure. A consultant can now use AI as part of daily client work, a large bank can hire OpenAI engineers to wire models directly into operations, and security teams may soon rely on AI both to discover bugs and to defend against AI-assisted exploitation. At the same time, researchers showed how fragile some safeguards remain, including a result suggesting one neuron can disable refusal behavior in several models.
Watch the next few weeks for two fronts: whether math-level reasoning results hold up under broader scrutiny, and whether labs can harden models quickly enough as AI capability spills into cybersecurity and high-stakes enterprise systems.
Last week extended the prior momentum in reasoning with a reported 30B model reaching Olympiad gold thresholds, but that result still needs broader validation and does not by itself show reliable general expert autonomy. The most concrete shifts were outside core AGI capability: Google’s report of an AI-assisted zero-day in the wild and OpenAI’s deeper enterprise deployment show real-world usefulness and risk rising fast, yet they do not substantially reduce the remaining gaps in robustness, breadth, and minimal-oversight operation.
This is significant because Olympiad-level math is a strong test of multi-step reasoning, not just memorization. Previously, top benchmark gains often came from narrow test optimization; now a relatively compact 30B model is being claimed to clear gold-medal thresholds on two elite contests.
Last week built directly on the prior week’s hard-math gains with a stronger reported result at Olympiad level, which is highly relevant to abstract multi-step reasoning. The score only inches up because the claim is reported rather than deeply replicated, and math strength still does not imply fully general expert cognition.
Last week effectively added a stronger elite-problem benchmark signal through the claimed IMO 2025 and USAMO 2026 gold-threshold performance. That continues the previous week’s FrontierMath momentum, though the category remains below the frontier because broader standardized validation and cross-domain benchmark closure were limited.
The standout efficiency signal was that a relatively compact 30B model reportedly achieved extremely high math performance, suggesting more capability per parameter. Still, last week brought little direct evidence on training or inference cost reductions, so the score stays essentially unchanged near the prior high level.
Last week did not materially advance vision, audio, video, or robotics integration. Copilot Studio’s computer-use feature touches interface interaction, but it is more an agentic workflow development than a clear multimodal capability jump, so this category stays flat.
Last week showed continued agent progress through Copilot Studio computer use, OpenAI’s deployment push into enterprise workflows, and Anthropic’s Glasswing cyber-defense effort. That builds on the prior week’s 16-hour task-horizon result by showing more real operational embedding, but not a major new autonomy breakthrough.
OpenAI’s new deployment unit, strategic partnership, and Brookfield-backed investment reinforce the prior week’s infrastructure and commercialization momentum. However, last week did not feature a distinct model-scale or compute-scaling breakthrough beyond continued evidence that deployment capacity and demand are still expanding.
A math researcher or contest coach could use a 30B reasoning model to generate and compare full proof strategies for hard geometry or combinatorics problems in minutes, instead of spending an evening exploring only a handful of approaches by hand.
This is significant because it suggests AI has crossed from assisting with common malware tasks into helping produce novel exploit code that worked against previously unknown software flaws. Previously, zero-day creation was largely treated as an elite human capability; now defenders have a concrete case that AI can contribute to that pipeline.
A corporate security team defending a widely used VPN or browser now has to assume attackers may discover and weaponize new vulnerabilities faster than before, shrinking the time available to patch from weeks to potentially days.
This is significant because OpenAI is moving beyond selling model access and into embedding engineers inside customer organizations to build production systems. Previously, many enterprises experimented with pilots on their own; now a major lab is offering hands-on deployment support backed by large capital commitments.
A Fortune 500 insurer can bring in OpenAI's deployment team to connect models to claims, document, and call-center workflows directly, instead of spending months stitching together vendors and internal prototypes that never leave pilot mode.
This matters because it shows AI adoption is shifting from isolated power users to organization-wide tooling with formal training and change management. Previously, many firms gave small teams chatbot access; now one of the world's largest professional-services networks is standardizing usage across hundreds of thousands of employees.
A PwC tax advisor can use Claude to summarize regulations, draft client-ready memos, and prepare first-pass analysis during a normal workday, instead of bouncing between search, spreadsheets, and manual drafting for each request.
This matters because frontier labs are starting to treat model-enabled cybersecurity as an operational domain, not just a red-team exercise. Previously, bug hunting relied heavily on human researchers and scattered automation; now Anthropic is positioning an advanced model to autonomously surface serious flaws at scale.
A software company with a lean security team could use Glasswing-style tooling to scan codebases and identify high-severity bugs continuously, instead of waiting for annual audits or expensive external penetration tests.
This matters because it suggests some alignment behaviors may depend on surprisingly fragile internal mechanisms. Previously, safety features were often discussed as if they were distributed and robust; now researchers report that altering one feed-forward neuron can disable refusal behavior across several aligned models.
A model safety team can no longer assume that standard fine-tuning alone makes refusal behavior durable, and may need to add stronger monitoring and evaluation before releasing weights that an attacker could modify in minutes.
This matters because AI agents become much more useful when they can operate ordinary business software through interfaces people already use. Previously, many automations required APIs or custom integrations; now agents can click through desktop apps and websites to complete workflows directly.
An operations manager can deploy an agent that copies order data between a legacy desktop system and a web portal overnight, instead of assigning staff to repetitive screen-by-screen data entry every morning.
This matters because governance is catching up in areas where AI affects jobs, loans, and admissions. Previously, people could be judged by automated systems with little transparency; now Colorado is requiring disclosure in high-stakes decision settings.
A job applicant in Colorado may be told when AI helps screen their application, giving them more visibility than before into a process that might otherwise have felt completely opaque.