🤖 AI Agents Weekly: Claude Opus 4.8, Claude Code Dynamic Workflows, Chrome DevTools for Agents 1.0, DeepSWE, Agent Harness Scaling Laws, and More
Claude Opus 4.8, Claude Code Dynamic Workflows, Chrome DevTools for Agents 1.0, DeepSWE, Agent Harness Scaling Laws, and More
In today’s issue:
AutoScientists self-organize agent teams
Anthropic ships Claude Opus 4.8
Claude Code adds dynamic workflows
Chrome DevTools for agents hits 1.0
DeepSWE raises the coding-agent bar
xAI opens grok-build-0.1 in beta
Microsoft open-sources Webwright for agents
Scaling laws for agent harnesses land
Harness sensitivity proves non-monotone
SIA co-updates harness and weights
CUA-Gym scales computer-use RL data
Polar trains agents on real harnesses
Anthropic details how it contains Claude
Xiaomi slashes MiMo-V2.5 API prices
Language models learn to sleep
a16z maps the AI application layer
And all the top AI dev news, papers, and tools.
Top Stories
AutoScientists Self-Organize for Long-Running Science
Harvard’s Zitnik Lab introduced AutoScientists, a decentralized multi-agent system for long-running computational science where agents self-organize around promising research directions instead of following a fixed plan.
Self-organizing teams: Agents form around promising directions and vet proposals before allocating resources, so compute goes only to ideas that survive review.
Learning from failure: The system documents failures as well as successes, building a record that steers future exploration.
Validated broadly: Reaches a 74.4% mean leaderboard percentile on biomedical ML, 1.9x faster convergence on language model training, and gains on protein fitness.
Claude Opus 4.8 Sharpens Agentic Judgment
Anthropic released Claude Opus 4.8, an incremental upgrade over Opus 4.7 tuned for sharper judgment, more honesty about its own progress, and longer independent runs.
Agentic gains: Posts 84% on Online-Mind2Web for computer-use and browser-agent tasks, and the team reports it is roughly 4x less likely than its predecessor to overlook code flaws.
Self-correction and honesty: Early testers cite improved reliability, better self-correction, and more accurate reporting of how far it has actually gotten on a task.
New controls: Ships alongside dynamic workflows, an effort control to dial response intensity, and a Systems API update that lets you change mid-task instructions without breaking the prompt cache.
Why it matters: The honesty and judgment gains target the exact failure modes that break long-horizon agents, where a model that overstates progress derails an entire run.
Available today via the claude-opus-4-8 API identifier at the same price as before ($5/$25 per million tokens), with a 3x cheaper Fast mode.


