🤖 AI Agents Weekly: GPT-5, Genie 3, gpt-oss, Cursor CLI, Opus 4.1, Efficient AI Agents
GPT-5, Genie 3, gpt-oss, Cursor CLI, Opus 4.1, Efficient AI Agents
In today’s issue:
OpenAI announces GPT-5
DeepMind introduces Genie 3
OpenAI releases gpt-oss models
Gemini CLI Deep Dive
Cursor release Cursor CLI
Anthropic announces Opus 4.1
Groq has released Groq Code CLI
New research on designing efficient AI Agents
Top AI dev news, papers, tools, and much more.
Top Stories
GPT-5
OpenAI’s GPT-5 is the company’s most advanced model to date, unifying fast responses and deep reasoning within a single system that adapts dynamically to task complexity. A built-in router selects between a lightweight model for simple queries and a “GPT-5 thinking” mode for harder problems, with a mini fallback once usage limits are hit. GPT-5 significantly improves factual accuracy, instruction following, and style control while reducing hallucinations, sycophancy, and deceptive outputs.
It achieves state-of-the-art results in math (94.6% AIME 2025), coding (74.9% SWE-bench Verified), multimodal reasoning (84.2% MMMU), and health (46.2% HealthBench Hard), with GPT-5 Pro delivering even higher performance on complex, expert-level tasks.
Key advances include:
Domain-specific improvements – Stronger coding capabilities (especially in complex front-end generation and repo-scale debugging), richer creative writing, and enhanced health advice with proactive question-asking and contextual adaptation.
Evaluation gains – Outperforms GPT-4o and o3 across math, visual reasoning, agentic tool use, and economically valuable work, with GPT-5 Pro preferred by experts 67.8% of the time on challenging prompts.
Reliability and safety – Hallucinations reduced by ~45% vs GPT-4o and ~80% vs o3; less than half the deception rate of o3; new “safe completions” training enables nuanced responses in dual-use domains; robust safeguards in high-risk bio/chemistry contexts.
User experience enhancements – Reduced unnecessary agreement, improved custom instruction following, and new preset personalities for conversational style control.
Efficiency – Delivers better results with 50–80% fewer reasoning tokens than o3, dynamically allocating computation.
Keep reading with a 7-day free trial
Subscribe to AI Newsletter to keep reading this post and get 7 days of free access to the full post archives.