AI Newsletter

AI Newsletter

🤖AI Agents Weekly: GPT-5.2, Devstral 2, Measuring Agents in Production, Gemini Deep Research API, Deep RL Course

GPT-5.2, Devstral 2, Measuring Agents in Production, Gemini Deep Research API, Deep RL Course

Dec 13, 2025
∙ Paid

In today’s issue:

  • GPT-5.2 scores 100% on AIME

  • Study shows 68% agents need humans

  • OpenAI releases enterprise AI report

  • Mistral launches Devstral 2

  • Gemini Deep Research API launches

  • MCP joins Linux Foundation

  • OpenAI co-founds AAIF

  • ARC Prize 2025 results announced

  • Scaling agent systems study released

  • Agentic AI adaptation survey released

  • Stanford releases deep RL course

  • And all the top AI dev news, papers, and tools.



Top Stories

GPT-5.2

OpenAI releases GPT-5.2, its most capable model series yet, just four weeks after GPT-5.1 in response to competition from Google’s Gemini 3 and Anthropic’s Claude Opus 4.5. The release includes three variants: GPT-5.2 Instant for quick tasks, GPT-5.2 Thinking for complex work, and GPT-5.2 Pro for demanding queries. OpenAI claims 30% fewer hallucinations than GPT-5.1 and results at 11x the speed and less than 1% the cost of human experts.

  • Professional knowledge work: On GDPval, measuring tasks across 44 occupations, GPT-5.2 Thinking scores 70.9% - beating or tying human experts on 70.9% of comparisons. This compares to Claude Opus 4.5 at 59.6% and Gemini 3 Pro at 53.3%.

  • Perfect math score: GPT-5.2 achieves 100% on AIME 2025, the first major model to reach this milestone. On FrontierMath (Tier 1-3), GPT-5.2 Thinking solves 40.3% of expert-level problems, setting a new state of the art.

  • Coding advances: GPT-5.2 Thinking hits 55.6% on SWE-Bench Pro, setting a new state of the art for software engineering tasks.

  • Science and reasoning: GPT-5.2 Pro achieves 93.2% on GPQA Diamond (graduate-level Q&A) and 86.2% on ARC-AGI 1, ahead of Gemini 3 Pro’s 75.0%.

Blog


Measuring Agents in Production

Researchers present the first large-scale systematic study of AI agents in production, surveying 306 practitioners and conducting 20 in-depth case studies across 26 domains. The findings reveal that successful production agents rely on simple, controllable approaches rather than complex autonomous systems, with reliability remaining the top challenge.

  • Simple approaches win: 68% of production agents execute at most 10 steps before requiring human intervention. Complex multi-step autonomous workflows remain rare in successful deployments.

  • Prompting over fine-tuning: 70% rely on prompting off-the-shelf models instead of weight tuning, suggesting that effective agent development focuses on prompt engineering and orchestration rather than model customization.

  • Human evaluation dominates: 74% depend primarily on human assessment for evaluation. Automated evaluation methods for agents remain underdeveloped for production use cases.

  • Reliability is the top challenge: Ensuring and evaluating agent correctness drives most development difficulties. Organizations struggle to verify agent behavior meets production requirements.

  • Research-practice gap: The study bridges academic research and real-world deployment, offering researchers visibility into production constraints while providing practitioners with proven patterns from successful deployments.

Paper

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 elvis · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture