🤖AI Agents Weekly: LLMs in 2025, YOLO in the Sandbox, Plan Caching for Agents, DeepTutor

LLMs in 2025, YOLO in the Sandbox, Plan Caching for Agents, DeepTutor

Jan 03, 2026

∙ Paid

In today’s issue:

Simon Willison’s 2025: The Year in LLMs
Prime Intellect announces Recursive Language Models
AgentReuse: Plan Caching for LLM-Driven Agents
Meta Acquires Manus AI for $2B+
YOLO in the Sandbox: How AI Agents Bypass Restrictions
OpenHands IDE Integration via Agent Client Protocol
TurboDiffusion: 119x Video Generation Speedup
DeepTutor: AI-Powered Learning Assistant
SAGA: Goal-Evolving Agents for Scientific Discovery
Step-DeepResearch: Cost-Effective Deep Research Agent

And all the top AI dev news, papers, and tools.

Top Stories

Simon Willison’s 2025: The Year in LLMs

Simon Willison’s annual year-end review covers 26 major trends that defined 2025 in the LLM space. This comprehensive analysis spans reasoning models, coding agents, Chinese open-weight dominance, and the normalization of risky AI practices.

Coding agents breakthrough: Claude Code launched quietly in February but became the most impactful event of 2025, reaching $1bn in run-rate revenue by December. Every major lab released CLI coding agents: Claude Code, Codex CLI, Gemini CLI, Qwen Code, and Mistral Vibe.
Reasoning models unlock agents: The real value of reasoning turned out to be driving tools. Models can now plan multi-step tasks, execute them, and reason about results to update plans. AI-assisted search actually works now, and reasoning models excel at debugging complex codebases.
Chinese models dominate open-weight: GLM-4.7, Kimi K2 Thinking, MiMo-V2-Flash, DeepSeek V3.2, and MiniMax-M2.1 now top the Artificial Analysis rankings. DeepSeek R1’s January release triggered NVIDIA’s $593bn market cap loss.
YOLO mode and normalization of deviance: Running agents without safety confirmations feels like a different product but poses serious risks. Johann Rehberger warns we’re approaching a Challenger disaster moment as risky behaviors become normalized.
Long task capabilities exploding: METR data shows that the task duration AI can complete is doubling every 7 months. GPT-5 and Claude Opus 4.5 can now perform tasks that previously took humans multiple hours, down from under 30 minutes in 2024.

Blog

AI Newsletter

🤖AI Agents Weekly: LLMs in 2025, YOLO in the Sandbox, Plan Caching for Agents, DeepTutor

LLMs in 2025, YOLO in the Sandbox, Plan Caching for Agents, DeepTutor

Top Stories

Simon Willison’s 2025: The Year in LLMs

This post is for paid subscribers