AI Newsletter

AI Newsletter

🤖 AI Agents Weekly: Claude Opus 4.8, Claude Code Dynamic Workflows, Chrome DevTools for Agents 1.0, DeepSWE, Agent Harness Scaling Laws, and More

Claude Opus 4.8, Claude Code Dynamic Workflows, Chrome DevTools for Agents 1.0, DeepSWE, Agent Harness Scaling Laws, and More

May 30, 2026
∙ Paid

In today’s issue:

  • AutoScientists self-organize agent teams

  • Anthropic ships Claude Opus 4.8

  • Claude Code adds dynamic workflows

  • Chrome DevTools for agents hits 1.0

  • DeepSWE raises the coding-agent bar

  • xAI opens grok-build-0.1 in beta

  • Microsoft open-sources Webwright for agents

  • Scaling laws for agent harnesses land

  • Harness sensitivity proves non-monotone

  • SIA co-updates harness and weights

  • CUA-Gym scales computer-use RL data

  • Polar trains agents on real harnesses

  • Anthropic details how it contains Claude

  • Xiaomi slashes MiMo-V2.5 API prices

  • Language models learn to sleep

  • a16z maps the AI application layer

And all the top AI dev news, papers, and tools.



Top Stories

AutoScientists Self-Organize for Long-Running Science

AutoScientists

Harvard’s Zitnik Lab introduced AutoScientists, a decentralized multi-agent system for long-running computational science where agents self-organize around promising research directions instead of following a fixed plan.

  • Self-organizing teams: Agents form around promising directions and vet proposals before allocating resources, so compute goes only to ideas that survive review.

  • Learning from failure: The system documents failures as well as successes, building a record that steers future exploration.

  • Validated broadly: Reaches a 74.4% mean leaderboard percentile on biomedical ML, 1.9x faster convergence on language model training, and gains on protein fitness.

Paper


Claude Opus 4.8 Sharpens Agentic Judgment

Claude Opus 4.8

Anthropic released Claude Opus 4.8, an incremental upgrade over Opus 4.7 tuned for sharper judgment, more honesty about its own progress, and longer independent runs.

  • Agentic gains: Posts 84% on Online-Mind2Web for computer-use and browser-agent tasks, and the team reports it is roughly 4x less likely than its predecessor to overlook code flaws.

  • Self-correction and honesty: Early testers cite improved reliability, better self-correction, and more accurate reporting of how far it has actually gotten on a task.

  • New controls: Ships alongside dynamic workflows, an effort control to dial response intensity, and a Systems API update that lets you change mid-task instructions without breaking the prompt cache.

  • Why it matters: The honesty and judgment gains target the exact failure modes that break long-horizon agents, where a model that overstates progress derails an entire run.

Available today via the claude-opus-4-8 API identifier at the same price as before ($5/$25 per million tokens), with a 3x cheaper Fast mode.

Blog

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 elvis · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture