AI Newsletter

🤖 AI Agents Weekly: Claude Opus 4.8, Claude Code Dynamic Workflows, Chrome DevTools for Agents 1.0, DeepSWE, Agent Harness Scaling Laws, and More

Sat, 30 May 2026 15:02:03 GMT

In today’s issue:

AutoScientists self-organize agent teams
Anthropic ships Claude Opus 4.8
Claude Code adds dynamic workflows
Chrome DevTools for agents hits 1.0
DeepSWE raises the coding-agent bar
xAI opens grok-build-0.1 in beta
Microsoft open-sources Webwright for agents
Scaling laws for agent harnesses land
Harness sensitivity proves non-monotone
SIA co-updates harness and weights
CUA-Gym scales computer-use RL data
Polar trains agents on real harnesses
Anthropic details how it contains Claude
Xiaomi slashes MiMo-V2.5 API prices
Language models learn to sleep
a16z maps the AI application layer

And all the top AI dev news, papers, and tools.

🥇Top AI Papers of the Week

Sun, 24 May 2026 15:01:51 GMT

1. Code as Agent Harness

A 100+ page survey treating the agent harness as a first-class research object rather than glue around an LLM. The authors argue that code-as-harness is the most promising path to general-purpose agency, and that future agent systems should satisfy four properties: executable, inspectable, stateful, and governed. The report consolidates methods, applications, and open problems across the harness layer.

Harness engineering as a discipline: The paper frames harness design as a science distinct from model training, with its own primitives, failure modes, and evaluation criteria. The taxonomy gives a vocabulary for comparing systems that has been missing in prior agent literature.
Four-property test for production agents: Executable, inspectable, stateful, and governed. Each property maps to a class of operational concerns. The authors use it to audit current open-source agent frameworks and identify where defaults fall short.
Code as the unifying substrate: Across browsing, tool use, and multi-step reasoning, harnesses that compile decisions into code consistently outperform JSON-call orchestration on the surveyed benchmarks. The paper traces this back to determinism, composability, and inspectability of the resulting traces.
Why it matters: If code-as-harness is the right substrate, then the next round of agent-system progress will come from harness-level innovation rather than from new base models. The survey gives builders a structured reference for that work.

Paper | Tweet

Message from our Sponsor

Intology released NanoGPT-Bench, a benchmark that drops agents into the NanoGPT Speedrun environment at the September 2025 human world record and measures how much of the next five months of community progress they can recover autonomously.

Claude Code, Codex, and Autoresearch each ran 320 to 455 training variants on a 512 H100-hour budget and recovered under 10% of the human speedup, mostly via hyperparameter tuning rather than algorithmic research.

2. OpenAI Disproves the Unit Distance Conjecture

This OpenAI paper disproves Erdős’s 1946 unit distance conjecture. For a finite planar set P, let ν(P) count the unordered pairs at distance exactly 1, and let ν(n) be the maximum of ν(P) over all n-point sets. Erdős conjectured ν(n) ≤ n^(1+C/log log n); the paper proves instead that there is a fixed δ greater than 0 with ν(n) ≥ n^(1+δ) for infinitely many n. The result was produced in a completely automated fashion by an internal OpenAI model and then human-edited into the present exposition.

The theorem: There exists an absolute constant δ greater than 0 and infinitely many n for which ν(n) ≥ n^(1+δ). This contradicts the widely believed conjecture, which earlier results on generic and most planar norms had appeared to support.
The construction: It passes through an infinite unramified tower of totally real number fields with 3-power Galois groups of growing degree, in which a fixed set of rational primes splits completely. After adjoining i, these fields produce high-dimensional lattices with many elements whose images have absolute value 1 under every complex embedding. The construction is a high-dimensional analogue of the arithmetic behind Erdős’s classical square-grid lower bound.
Why it works: Golod-Shafarevich theory guarantees an infinite tower exists, even after a quotient step that trivializes the prescribed Frobenius classes. A crucial property is that all resulting discriminants and class numbers stay at most exponential in the extension degree.
Statement on AI use: The internal model was given an AI-written problem statement, and its output was checked by an AI grading pipeline before any human examined it. After AI-assisted verification and rewriting, a draft was sent to external mathematicians, including number theory experts, who confirmed the proof’s correctness and have since simplified and strengthened the argument.

Paper | Tweet

3. Memory as a Model

MeMo augments any frozen LLM with a separately trained memory model that stores, retrieves, and integrates facts on the base model’s behalf. Memory updates are decoupled from base-model weight updates, so the system supports continual learning without catastrophic forgetting, a property RAG fails to deliver because a vector store is just a database with a learned encoder bolted on.

Memory as a learned subsystem: MeMo has explicit read, write, and integrate interfaces rather than relying on the context window. The position is that memory in agents should be modular, learned, and gated.
Decoupled update schedule: New facts are absorbed through the memory model’s training loop without touching backbone weights. This makes weekly knowledge updates feasible without retraining and without vector-DB churn.
Continual-learning robustness: Across the evaluated tasks, the system retains old knowledge while ingesting new knowledge, addressing a known failure mode of fine-tuning and a known limitation of retrieval-based memory.
Why it matters: Most production agent systems still bolt a vector store onto an LLM and call it memory. MeMo proposes that memory should be a trained component with explicit interfaces, which has implications for how long-running agent platforms are architected.

Paper | Tweet

4. AIRA

Meta’s AIRA is an agent system that autonomously discovers neural architectures, producing models that beat Llama 3.2 at 350M, 1B, and 3B scales under a 24-hour compute budget. The search is split across two specialized agents: AIRA-Compose searches macro architecture, and AIRA-Design implements the low-level mechanisms. The split outperforms a single end-to-end agent on this non-toy search problem.

Two-agent decomposition: A planner picks structure; an implementer fills in mechanisms. This pattern generalizes well beyond neural architecture search to pipeline assembly, query planning, prompt scaffolding, and tool-use programs.
Beats Llama 3.2 at three scales under budget: Discovered architectures match or exceed Llama 3.2 at 350M, 1B, and 3B parameter scales within a 24-hour compute budget for the search itself. That is competitive with months of human-led ablation studies.
Search not synthesis: The discovered models are not LLM-written code patches grafted into a framework. They are full architectures discovered through structured search guided by the two-agent loop.
Why it matters: If agentic search can produce competitive architectures end to end, then NAS and large parts of the ML research workflow become candidates for automation by agent systems rather than by hand-engineered search algorithms.

Paper | Tweet

5. Weak-Model Critic-Comparator

GPT-5.4 nano wrapped in a critic-comparator orchestration loop reaches 76.4% on SWE-bench Verified, matching standalone Gemini 3 Pro and Claude Opus 4.5 Thinking. The trick is to sample k=8 candidate patches from the weak model and select the winner using execution and proof signals rather than asking the model to self-rank.

k=8 candidates plus verifier beats frontier model: A weak model’s top-k often already contains a correct patch. The selector is the limiting factor, not the base model’s capability.
Execution and proof signals as selection: Candidates are run and verified rather than scored by an LLM judge. The critic and comparator are separate roles inside the loop, each with a narrow task.
Matches frontier performance at lower per-call cost: Selecting among nano-tier proposals is cheaper than calling a frontier model once, even after accounting for the 8x sampling, because the dominant cost driver is model size rather than call count.
Why it matters: This is a reproducible recipe for getting frontier-level coding-agent results out of cheaper models. The result also reframes where SWE-bench progress is coming from: orchestration quality, not just stronger base models.

Paper | Tweet

6. MetaCogAgent

MetaCogAgent equips a multi-agent system with metacognition, so each agent decides whether it should answer or delegate. The bottleneck in current multi-agent systems is over-delegation and under-delegation, and a metacognitive gate is a principled way to manage both. The Metacognitive Unit (MCU) at each agent produces confidence scores that drive routing to a delegation hub.

Confidence-driven routing: Each agent’s MCU combines verbalized and profile-based confidence into a single score. Low-confidence tasks route to a delegation hub rather than getting answered anyway.
Self-aware specialization beats fixed routers: MetaCogAgent reaches 82.4% on MetaCog-Eval, versus 70.2% for a skill-fixed router and 65.3% for single-agent. Self-assessment and adaptive delegation each contribute material gains in ablations.
Emergent specialization: Distinct confidence profiles (high on coding, low on retrieval, etc.) emerge purely from feedback. No specialization is encoded beyond initial system prompts.
Why it matters: Multi-agent systems usually rely on fixed routers or simple round-robin schemes. A learned, uncertainty-aware delegation gate gives a primitive that adapts to task difficulty without retraining the routing layer.

Paper | Tweet

7. Production Agent Architecture Methodology

A methodology paper on selecting and composing runtime architecture patterns for production LLM agents. The core argument is that most teams accidentally let framework defaults make critical architecture decisions for them. The paper introduces the stochastic-deterministic boundary (SDB) as a named primitive and presents a six-pattern catalog organized by the three runtime concerns of coordination, state, and control.

Stochastic-deterministic boundary: A four-part contract of proposer, verifier, commit, and reject that marks where the LLM hands off to deterministic infrastructure. The paper inventories how five widely used open-source agent frameworks place this boundary, often implicitly.
Three-by-six pattern catalog: Six patterns organized along three orthogonal concerns. Coordination patterns answer how work splits and combines. State patterns answer how the system remembers. Control patterns answer who decides what runs and when to stop.
Patterns as deliberate choices: Each pattern has a typed-contract specification of input type, output type, deadline, retry budget, and partial-result policy. The catalog grows by passing this procedure rather than by adding ad-hoc abstractions.
Why it matters: Production agent failures rarely come from the LLM. They come from architectural choices that were made by default. The methodology gives teams a way to surface those choices and make them deliberately.

Paper | Tweet

8. NanoGPT-Bench

A new evaluation of whether coding agents can do real AI R&D. Intology runs Codex, Claude Code, and Autoresearch on the NanoGPT-Bench suite and reports that the agents recover only 9.3% of human progress on the same problems. Coding agents spend the bulk of their compute on hyperparameter tuning and rarely attempt algorithmic research. Claude Code and Autoresearch reason about algorithmic changes more often, but still tend to dodge implementing them. The headline result tempers the current wave of “self-improving agent” claims: producing real research progress requires a different distribution of effort than the one current coding agents converge to under their default scaffolds.

Paper | Tweet

9. General-Agent

Prime Intellect’s General-Agent is a fully synthetic reinforcement learning environment whose task corpus self-evolves and grows harder over time. The release ships with 4,504 tool-use tasks across 1,040 domains and 8,159 unique tools. Synthetic task creation is formulated as a two-player game between a Synthesizer that proposes new task families and a Solver that runs rollouts to measure pass rates. Tasks whose pass rate falls inside a calibrated difficulty band are accepted into the corpus, and hard tiers seed the next round of extensions. The framing turns RL environment creation, historically a major bottleneck, into an automated agentic search problem in its own right.

Paper | Tweet

10. Contrastive Neuron Attribution

Nous Research releases Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks. Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact. The intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade. Validated on the refusal circuit across 8 instruct-tuned models including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B.

Paper | Tweet

🤖 AI Agents Weekly: Gemini 3.5 Flash, Antigravity 2.0, Codex Thursday, Cohere Command A+, Qwen3.7-Max, and More

Sat, 23 May 2026 15:02:42 GMT

In today’s issue:

Google ships Gemini 3.5 Flash for agents
Antigravity 2.0 becomes a full agent platform
OpenAI ships Appshots and /goal in Codex
Cohere open-sources Command A+ on Apache 2.0
Qwen3.7-Max runs agents for 35 hours straight
NVIDIA verifies agent skills
Cursor Composer 2.5 sharpens coding agents
Anthropic acquires Stainless for SDK tooling
Browserbase opens Browse.sh skills catalog
Gemini Omni unifies create-anything model
OpenAI cracks an 80-year Erdős problem
Compiling agent workflows into model weights
PEEK orientation cache for long-context agents
SaaS-Bench exposes computer-use agent ceiling

And all the top AI dev news, papers, and tools.

🥇Top AI Papers of the Week

Sun, 17 May 2026 15:02:22 GMT

The Top AI Papers of the Week (May 11 - May 17)

1. Lighthouse Attention

Nous Research proposes a training-only attention wrapper for long-context pretraining. Lighthouse Attention wraps standard SDPA with a hierarchical, gradient-free selection layer that compresses and decompresses queries, keys, and values symmetrically while preserving left-to-right causality. The wrapper is removed near the end of training in a short recovery phase, so the deployed model runs vanilla attention with no architectural change at inference. Preliminary LLM experiments report faster total training time and lower final loss than full-attention baselines.

Subquadratic wrapper with vanilla deployment: The hierarchical selector reduces the cost of long-context training without modifying the underlying attention operator. After the recovery phase, the trained weights are compatible with standard SDPA at inference.
Symmetric compression preserves causality: Queries, keys, and values are compressed and decompressed through the same hierarchy, which keeps the wrapper compatible with left-to-right attention.
Training-time speedup at lower final loss: Preliminary runs report faster wall-clock training and lower final loss than full-attention baselines under matched FLOPs, including 21x faster forward latency at 512K context.
Why it matters: A training-only modification that leaves the deployed model unchanged sidesteps the usual deployment-time tradeoffs of efficient-attention methods.

Paper | Tweet

Message from the Editor

We just released new hands-on labs on DAIR.AI Academy to help you build alongside agents. Start with practical, guided labs for agentic image generation and building your first agent skill, with more labs coming soon.

Enroll

2. Is Grep All You Need?

The paper evaluates grep-style text search against embedding-based retrieval inside coding agents. When wrapped in a suitable agent harness, grep matches or exceeds embedding retrieval on coding-agent tasks. The study isolates the contribution of the harness from the contribution of the retrieval primitive, and finds that harness design accounts for most of the performance differential typically attributed to embeddings.

Direct comparison of grep vs. embeddings: Coding-agent tasks evaluated under controlled conditions show grep-based retrieval reaching parity with or exceeding embedding-based retrieval.
Harness design as the dominant variable: Holding the index constant and varying the harness produces larger performance shifts than the inverse, indicating that retrieval comparisons in prior work have likely been confounded by harness differences.
Implications for codebase structure: Grep performs best when the codebase is properly indexed and structured for an agent to navigate, while embedding retrieval can partially compensate for unstructured input.
Why it matters: Vector databases are a common default in coding-agent stacks. The result suggests that for many coding tasks, harness improvements and basic text search can substitute for embedding infrastructure.

Paper | Tweet

3. A Geometric Calculator Inside a Neural Network

Goodfire reports mechanistic interpretability work identifying a geometric calculator inside an LLM. The model represents numbers as Fourier features, where circles in activation space correspond to numbers modulo a given base. Arithmetic operations are implemented as rotations of these circles, forming a variant of a residue number system that does not require coprime moduli. The same circuit appears to be reused beyond arithmetic.

Numbers as rotating circles: Numerical quantities are encoded as positions on circles in activation space, with addition implemented as rotation. The encoding extends prior findings that LLMs represent numbers via Fourier features.
Residue-system-like structure: The set of circles forms a residue number system variant. Unlike the textbook residue system, the moduli do not need to be coprime, which is the mechanistic detail the paper introduces.
Reuse beyond arithmetic: The same rotational machinery shows up in non-math contexts inside the model, suggesting the geometric calculator is a general-purpose internal structure rather than a math-specific subnetwork.
Why it matters: The finding gives interpretability researchers a concrete, reproducible circuit to target and connects geometric representation analysis to functional behavior beyond toy settings.

Paper | Tweet

4. δ-mem

δ-mem augments a frozen full-attention model with a compact online associative-memory state. The state is a fixed-size matrix updated by delta-rule learning during generation, and its readout produces low-rank corrections to the backbone’s attention output. There is no fine-tuning, no backbone swap, and no context extension.

Frozen backbone: The base model weights are unchanged. δ-mem adds a small online state plus a pair of low-rank read and write projections.
Delta-rule update integrated into attention: The memory matrix is updated by delta-rule learning during generation, and the readout produces additive query and output corrections to the attention computation rather than functioning as a separate retrieval step.
Results from an 8x8 state: An 8x8 online memory lifts the frozen backbone’s average score by 1.10x and beats the strongest non-δ-mem memory baseline by 1.15x. On memory-heavy benchmarks the gap widens: 1.31x on MemoryAgentBench and 1.20x on LoCoMo. General capabilities are largely preserved.
Why it matters: The mechanism offers an alternative to context extension and external retrieval for long-horizon memory, with minimal deployment overhead on frozen frontier models.

Paper | Tweet

5. Beyond Individual Intelligence

A multi-agent systems survey covering 200+ papers, organized along three axes: collaboration mechanisms, failure attribution, and self-evolution. Each axis is treated as a distinct research line. The self-evolution chapter maps how memory, meta-learning, and procedure-editing approaches intersect.

Three orthogonal axes: Collaboration mechanisms cover who communicates with whom and how. Failure attribution covers methods for localizing errors across agents. Self-evolution covers how a system updates its own behavior over time.
Failure attribution as a first-class topic: Errors propagate through coordination protocols in multi-agent systems, making attribution difficult. The survey treats attribution methodology as a research area rather than a debugging activity.
Self-evolution as a field map: The chapter identifies overlap between memory work, meta-learning, and procedure-editing approaches, and surfaces open questions in each area.
Why it matters: The taxonomy provides a vocabulary for comparing multi-agent systems along axes that prior work has often conflated.

Paper | Tweet

6. AutoTTS

AutoTTS reframes test-time scaling as a search problem. Instead of designing branching, pruning, and stopping heuristics directly, the user constructs a discovery environment in which TTS strategies are searched automatically. Width-depth TTS is recast as controller synthesis over pre-collected reasoning trajectories and probe signals, so candidate controllers can be evaluated without repeated LLM calls.

Discovery environment plus offline evaluator: The human specifies states, actions, and feedback. An explorer LLM iteratively proposes candidate controllers. Controllers are evaluated against pre-collected trajectories rather than by re-sampling the base model.
Beta parameterization and trace-level feedback: Beta parameterization makes the controller space tractable for search. Execution-trace feedback gives the explorer information about why a candidate failed, not only that it did.
Results on math reasoning benchmarks: Discovered controllers outperform hand-designed TTS recipes on the accuracy-cost Pareto frontier and transfer zero-shot to held-out benchmarks and model scales. Total discovery cost: $39.9 and 160 minutes.
Why it matters: Automated search over TTS strategies is competitive with hand-tuned heuristics at low cost, which shifts where the research effort needs to go.

Paper | Tweet

7. AI Co-Mathematician

Google DeepMind presents AI Co-Mathematician, an agentic research workbench for mathematicians. The system is an asynchronous, stateful environment that supports ideation, literature discovery, computational analysis, theorem verification, and knowledge development across long sessions. It reaches 48% on FrontierMath Tier 4, a new high among AI systems evaluated.

Asynchronous stateful workbench: The system runs as a persistent environment with multiple workstreams a mathematician can drive in parallel. Long-running computations, literature searches, and verification steps run in the background.
Manages uncertainty and intent: The workbench records unsuccessful attempts, clarifies user intent when underspecified, and emits formal mathematical outputs that can be checked rather than only read.
48% on FrontierMath Tier 4: A new high score on the hardest tier of FrontierMath among AI systems evaluated. Early applications produced solved open problems, fresh research directions, and recovered overlooked citations during active research sessions.
Why it matters: The workbench design pattern (asynchronous, stateful, multi-workstream) generalizes to expert workflows where sessions span days rather than minutes.

Paper | Tweet

8. AEvo

AEvo separates the iterative self-improvement loop into two roles: a candidate-proposer that generates the next attempt, and a meta-agent that observes traces and edits the procedure used to propose future candidates. Past runs (candidates, feedback, traces, failures) function as memory the meta-agent reads from when revising the procedure. AEvo reports a 26% relative gain over the strongest evolution baseline on agentic and reasoning benchmarks, and SOTA on three open-ended optimization tasks under the same iteration budget. The work demonstrates one way to operationalize accumulated agentic search logs as input to procedure-level updates rather than discarding them after each run.

Paper | Tweet

9. The Memory Curse in LLM Agents

A study of how long histories affect LLM agent behavior. Across 7 LLMs and 4 social dilemma games over 500 rounds, expanding accessible history degraded cooperation in 18 of 28 model-game combinations. Lexical analysis of 378,000 reasoning traces shows the mechanism is erosion of forward-looking intent rather than increased suspicion: long histories pull the model toward reasoning about past interactions rather than future payoffs. A LoRA adapter trained only on forward-looking traces mitigates the decay and transfers zero-shot to new games. Memory sanitization, which keeps prompt length fixed but swaps in synthetic cooperative records, restores cooperation, indicating the trigger is content rather than length. Ablating explicit chain-of-thought often reduces the collapse, suggesting deliberation amplifies the effect. The paper provides a diagnostic plus interventions for long-running agent systems where history quality, not just history length, drives behavior.

Paper | Tweet

10. Token Superposition Training

Nous Research’s second pretraining paper of the week. Token Superposition Training (TST) is a modification to the standard LLM pretraining loop that produces a 2 to 3x wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, training reverts to standard next-token prediction. The inference-time model is identical to one produced by conventional pretraining. TST was validated at 270M, 600M, and 3B dense scales, and at a 10B-A1B mixture-of-experts model where it reaches a lower final loss while consuming 4,768 B200-GPU-hours versus the baseline’s 12,311. Together with Lighthouse Attention, this is the second pretraining-loop modification from the same lab this week reporting substantial speedups without architecture changes.

Paper | Tweet

🤖 AI Agents Weekly: Thinking Machines Interaction Models, Is Grep All You Need?, Codex Mobile + Hooks, Cursor Cloud Agents, Ring-2.6-1T, and More

Sat, 16 May 2026 15:01:48 GMT

In today’s issue:

Thinking Machines unveils interaction models
Is Grep All You Need? challenges vector RAG
OpenAI ships Codex mobile and hooks
Cursor adds cloud agent dev environments
Ring-2.6-1T open trillion-scale agent model
Recursive Superintelligence emerges with $650M
LangChain Labs targets continual learning
xAI launches Grok Build CLI
Claude Code adds agent view
Prime Intellect agents beat nanoGPT speedrun
World Labs open-sources image-blaster
Isomorphic Labs raises $2.1B Series B
Beyond Individual Intelligence multi-agent survey
LongMemEval-V2 raises the memory bar

And all the top AI dev news, papers, and tools.

🥇Top AI Papers of the Week

Sun, 10 May 2026 15:01:05 GMT

1. HeavySkill

One of the cleaner takes on agentic harness design released this year. The paper argues that what actually drives harness performance is not the orchestration code, but a single inner skill: parallel reasoning followed by deliberation. Internalize that pattern into the model and most of the surrounding scaffolding becomes optional. HeavySkill systematizes the idea as a two-stage pipeline you can run beneath any harness, then trains it as a learnable skill via RLVR. The result is a harness win that looks more like a model win.

Two-stage skill, not orchestration glue: Stage one runs parallel reasoning across multiple sampled chains. Stage two performs a deliberation pass that compares, critiques, and synthesizes those chains into a final answer. The pipeline is the same regardless of harness, which is why it transfers across tasks.
GPT-OSS-20B jumps from 69.7% to 85.5% on LiveCodeBench: Under the heavy-thinking variant (HM@4), the 20B model gets a 15.8 point lift on a hard coding benchmark. The same recipe takes R1-Distill-Qwen-32B from 35.7% to 69.3% on IFEval, nearly doubling its instruction-following score.
Pass@N-level performance from a learned skill: Several models reach Pass@N-level performance once HeavySkill is internalized through RLVR, which is the property that makes the parallel-deliberation pattern actually portable. The skill survives outside the harness it was trained under.
Why it matters: Harness wins start to look like model wins once you can train them in. If parallel reasoning plus deliberation really is the inner skill, the long arc is models that ship with it baked in, not orchestration glue layered around them.

Paper | Tweet

2. Conductor

Sakana AI’s ICLR 2026 paper introduces a 7B Conductor model that hits SOTA on GPQA-Diamond and LiveCodeBench by orchestrating other LLMs instead of solving problems itself. The Conductor is trained with RL to do two things simultaneously: design communication topologies between worker agents (open or closed source) and prompt-engineer focused instructions to each worker so it leverages individual strengths. The orchestrator becomes a learnable policy, not a wrapper around one.

Topology design plus targeted prompting: A single RL policy decides who talks to whom and what each worker is told. Trained against randomized agent pools, the Conductor adapts to arbitrary mixes of agents at inference time, including agents it never saw during training.
Recursive topologies emerge: When allowed to pick itself as a worker, the Conductor forms recursive topologies, unlocking a new form of dynamic test-time scaling through online iterative adaptation. Coordination becomes its own scaling axis, separate from model size or context length.
3% gains on AIME25 and GPQA-D from coordination alone: The gains over the best individual worker land in the 3% range, which the authors note is consistent with entire generational improvements between frontier model versions. The difference is that here the lift comes from learned routing, not from larger pretraining runs.
Why it matters: This is one of the cleaner arguments yet that the orchestrator should be the model. Routing decisions stop being a wrapper and become a learnable policy, which is the right abstraction for production agent stacks that compose multiple model providers.

Paper | Tweet

3. Self-Improving Pretraining

Most LLM safety, factuality, and reasoning fixes get bolted on at post-training. By then the patterns have already set. This Meta FAIR paper moves those behaviors into pretraining itself. The team uses a strong post-trained model as both a rewriter and a judge: it rewrites pretraining suffixes toward higher-quality, safer continuations, then scores model rollouts against the original suffix and the rewrite to drive RL during pretraining. Instead of next-token prediction, the policy learns sequence generation from the start, with rewards for quality, safety, and factuality.

Post-trained model as rewriter and judge: The strong model rewrites suffixes during pretraining, then judges rollouts of the in-training model against both the rewrite and the original. Safety, factuality, and quality become reward signals rather than post-hoc filters, which lets the policy internalize the targets early.
Sequence generation from the start: The policy is trained to generate sequences directly under reward, not to predict the next token. This shifts the inductive bias toward producing the kinds of continuations the judge rewards, which matters most on long-form generation where token-level losses lose discriminative signal.
Concrete gains across the board: 36.2% relative gain in factuality, 18.5% in safety, and up to 86.3% win rate in generation quality over standard pretraining. The safety and factuality numbers are large enough to suggest these properties are easier to install during pretraining than to retrofit after the fact.
Why it matters: The post-trained models you already have can be used to pretrain the next ones better. That is a recursive improvement loop at the pretraining layer, which is where the largest behavioral commitments actually get locked in.

Paper | Tweet

4. Connect Four AlphaZero from Scratch

This paper proposes a new way to evaluate coding agents: hand them a minimal task description, give them a tight budget, and ask them to autonomously rebuild a famous ML breakthrough end-to-end. Connect Four plus AlphaZero is the first instance. It is small enough to run on a laptop and hard enough to require a real research engineering loop. Claude Opus 4.7 implemented the full pipeline (MCTS, neural value and policy nets, self-play, training schedule) in three hours on consumer hardware, then beat the Pascal Pons solver 7 of 8 as first-mover. No other frontier coding agent tested cleared 2 of 8.

From patches to systems: Existing coding-agent benchmarks measure unit-test fixes and small patches. This benchmark measures whether the agent can build a non-trivial ML system from a one-paragraph spec, which is closer to what production research engineering actually looks like.
Tight budget, real research loop: The agent has to design the search algorithm, train the networks, schedule self-play, and debug the loop, all within a fixed compute budget on consumer hardware. There is no escape hatch into a pre-built library, which is what makes the task discriminative.
A clean separation between frontier coders: Claude Opus 4.7 reached 7 of 8 wins as first-mover against the Pascal Pons solver. No other frontier coding agent tested cleared 2 of 8. The gap is large enough to suggest the benchmark is detecting something real about end-to-end ML engineering capability.
Why it matters: Patch-style benchmarks are starting to saturate. Rebuild-a-breakthrough tasks give the field a harder ceiling to push against, and they map more directly to the agent workloads people actually want to deploy.

Paper | Tweet

Message from the Editor

Excited to announce our new on-demand course “Vibe Coding AI Apps with Claude Code“. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.

Enroll Now

5. Coordination as Architecture

Multi-agent LLM systems fail in production at rates between 41% and 87%, and the majority of those failures are coordination defects, not base-model capability. Most published comparisons of multi-agent architectures cannot even tell you whether the gain came from coordination or from one configuration just having more context. This paper argues coordination should be treated as a configurable architectural layer, separable from agent logic and information access, then backs the position with an information-controlled experiment.

Information-controlled methodology: Same LLM, same tools, same prompt template, same per-call output cap. The only thing that varies is coordination structure. Once information access is held constant, the actual contribution of coordination becomes measurable for the first time.
Coordination as a separate layer: The paper proposes treating coordination structure (who talks to whom, when, with what aggregation rule) as a first-class architectural axis. That separation lets teams reason about coordination changes without re-running the entire stack.
A vocabulary for the field: Until now, “multi-agent beats single-agent” comparisons have been confounded by context-window asymmetries. This paper provides the methodology and vocabulary needed to actually test coordination claims, which is overdue infrastructure for the multi-agent research line.
Why it matters: If 41% to 87% of failures are coordination defects, fixing coordination is the highest-leverage thing builders can do. The paper turns that intuition into a measurable engineering target instead of a vibes-based debate.

Paper | Tweet

6. Horizon Generalization

Microsoft Research runs a controlled study where the only variable is task horizon length. Same decision rules, same reasoning structure, different sequence length to the goal. The main finding: horizon alone is a training bottleneck. As goal distance grows, exploration explodes combinatorially and credit assignment gets ambiguous. Models that learn cleanly on short horizons fall apart on long ones, even when the underlying reasoning is identical. The fix is not more compute, it is horizon reduction.

Horizon as a first-class variable: By holding decision rules and reasoning constant and only varying sequence length, the paper isolates horizon as a distinct training bottleneck. This separates “the agent cannot reason” from “the agent cannot stitch together long sequences,” which most prior work conflated.
Macro actions stabilize training: Re-parameterizing the action space with macro actions that compress many low-level decisions into one stabilizes training immediately. The agent learns the same task, just at a coarser temporal grain that keeps credit assignment tractable.
Generalization to longer horizons at inference: Models trained on reduced horizons generalize to longer ones at inference time. The paper calls this horizon generalization, and it is the most useful property because it means you can train cheap and deploy long.
Why it matters: Most teams treat long-horizon failures as a model-capacity problem. This paper says it is a horizon problem. Reduce horizon during training, get stability now and generalization for free at inference, without retraining a larger backbone.

Paper | Tweet

7. 1,000 Synthetic Computers

Microsoft Research builds 1,000 synthetic computers, each with realistic directory structures, documents, and artifacts, then runs long-horizon simulations on top of them. One agent plays the user and sets productivity goals; another executes the work. Each simulation runs over 8 hours of agent runtime and 2,000+ turns on average, roughly a month of human work compressed into one trace. Training on this experiential data drives significant improvements on both in-domain and out-of-domain productivity evaluations.

Realistic synthetic environments: Each of the 1,000 computers ships with directory structures, documents, and artifacts that approximate a real user’s working environment. The realism is what makes the trajectories useful as training data instead of as evaluation curiosities.
Two-agent simulation loop: A user agent sets productivity goals while a worker agent executes against them. The structure produces multi-turn, goal-directed traces that look like real productivity work, not the short scripted tasks that dominate existing benchmarks.
Designed to scale to billions of worlds: The framework is explicitly designed to scale to millions or billions of synthetic user worlds, which matches the scale at which frontier computer-use agents will need experiential data. The bottleneck on long-horizon training is data, and this is a credible recipe for producing it.
Why it matters: The bottleneck on computer-use agents has stopped being model capability and become realistic long-horizon training data. Synthetic-environment scaling is one of the few paths that does not depend on collecting massive amounts of real user telemetry, which makes it a practical default for teams building computer-use stacks.

Paper | Tweet

8. Contextual Agentic Memory is a Memo

Most agent memory today is not memory, it is closer to a memo. Vector stores, RAG buffers, and scratchpads implement lookup, not consolidation. The paper draws on neuroscience’s Complementary Learning Systems theory: biological intelligence pairs fast hippocampal storage with slow neocortical consolidation, and current AI agents only implement the first half (fast write, similarity recall, no abstraction step). The authors prove a generalization ceiling on compositionally novel tasks: as long as memory stays retrieval-only, the agent cannot apply abstract rules to inputs that do not already look like something in the store, and it remains permanently exposed to memory poisoning. If you are building long-running agents and treating memory as a vector index, this paper is a clean diagnosis of what you are missing.

Paper | Tweet

9. Agentic-imodels

The entire interpretability literature is built around human readers. As more analysis gets delegated to agents, the right target of interpretability shifts. Microsoft Research introduces Agentic-imodels, an autoresearch loop where a coding agent (Claude Code, Codex) iteratively evolves scikit-learn-compatible regressors that are simultaneously accurate AND readable by other LLMs. Interpretability is measured by whether a small LLM can simulate the fitted model’s behavior just by reading its string representation, predictions, feature effects, and counterfactuals from the str output alone. Across 65 tabular datasets, the discovered models push the Pareto frontier past every classical interpretable baseline (decision trees, GAMs, sparse linear), and improve four downstream agentic data-science systems on the BLADE benchmark by 8% to 73%.

Paper | Tweet

10. Skills as Verifiable Artifacts

If you ship agent skills, your runtime is treating signed-and-cleared skills as trusted by default. This paper argues a skill is untrusted code until it is verified, and the runtime should enforce that default rather than infer trust from origin. Without skill verification, HITL has to fire on every irreversible call, which degrades into rubber-stamping at any non-trivial scale. With verification as a separate gated process, HITL fires only for what is unverified. Skills are now first-class deployment artifacts, and we have decades of supply-chain lessons on what happens when trust is inferred from a signature. This is the right ask for SKILL.md before agent skill libraries become the next attack surface.

Paper | Tweet

🤖 AI Agents Weekly: Meta FAIR Autodata, ZAYA1-8B, SubQ 12M Context, Natural Language Autoencoders, Claude Managed Agents Dreaming, and More

Sat, 09 May 2026 15:01:49 GMT

In today’s issue:

Meta FAIR introduces Autodata
Zyphra releases ZAYA1-8B
SubQ ships a 12M-token frontier model
Anthropic introduces Natural Language Autoencoders
Claude Managed Agents adds dreaming and multi-agent
Printing Press: an agent CLI factory
Flue agent harness framework launches
Anthropic adds keyless auth
AlphaEvolve marks one year of impact
Goodfire opens a neural geometry series
Firefox hardened with Claude Mythos

And all the top AI dev news, papers, and tools.

🥇Top AI Papers of the Week

Sun, 03 May 2026 15:02:56 GMT

1. Agentic Harness Engineering

Most coding-agent harnesses are still tuned by hand or kept alive through brittle trial-and-error self-evolution. This paper introduces Agentic Harness Engineering (AHE), a framework that makes harness evolution observable and falsifiable. AHE separates the system into three layers: components stored as revertible files, experience condensed from millions of trajectory tokens into structured evidence, and decisions written as predictions that get checked against task outcomes. Every edit becomes a contract you can verify or revert.

Three-layer evolution model: Components, experience, and decisions are each first-class artifacts. Components are versioned files, experience is compressed evidence pulled from full trajectory logs, and decisions are explicit hypotheses with expected outcomes. The structure turns black-box harness tuning into an auditable engineering loop.
Pass@1 gains on Terminal-Bench 2: Pass@1 climbs from 69.7% to 77.0% across ten iterations, beating both human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO. The framework also uses 12% fewer tokens than the seed harness on SWE-bench-verified.
Cross-model transfer: The evolved harness transfers across model families with +5.1 to +10.1 point gains, suggesting the optimizations are structural rather than overfit to a specific backbone. That is the property you actually want from harness engineering.
Why it matters: Harness work is the largest hidden cost in most agent systems. AHE is the first credible recipe for letting the harness improve itself without drifting into noise, which makes it the most important agent-systems paper of the week.

Paper | Tweet

Message from our Sponsor

Kurate.org - Arena for scientific papers. Every day, hundreds of arXiv preprints are ranked by scientific impact through pairwise tournaments judged by Claude, GPT and Gemini models. See the top ranked papers in AI, ML, Robotics, Quantum Physics, and more for free.

Explore The Leaderboards

2. AgenticQwen-30B-A3B

Alibaba shows that a 30B MoE model with only 3B active parameters can match Qwen3-235B on real tool-use workloads. AgenticQwen-30B-A3B scores 50.2 average on TAU-2 plus BFCL-V4 Multi-Turn, while AgenticQwen-8B scores 47.4. Both more than double their vanilla Qwen baselines and close most of the gap to a 235B model. The recipe is built around two reinforcement learning flywheels that run in parallel, with simulated users actively trying to mislead the agent.

Reasoning flywheel from self-failure: The first loop mines the model’s own errors and converts them into harder reasoning problems each round. The training distribution gets harder on its own as the model improves, removing the need for new human-curated reasoning data.
Agentic flywheel for tool use: The second loop grows simple linear tool-use trajectories into multi-branch behavior trees. Simulated users test recovery from misleading instructions, ambiguous goals, and failed tool calls, which is where vanilla supervised tuning typically breaks.
Real efficiency for production agents: A 30B MoE with 3B active tokens at inference is significantly cheaper to serve than a 235B dense or MoE alternative. For tool-use workloads where frontier reasoning is overkill, this changes the cost profile of shipping production agents.
A reusable recipe: The flywheel approach generalizes beyond Qwen. Teams can generate hard examples from their own agent’s failures rather than relying on static synthetic data, which is the more scalable path for domain-specific agents.

Paper | Tweet

3. Agentic World Modeling

A massive 40-author survey lands the cleanest taxonomy of world models in agent research released so far. The paper proposes a “levels by laws” framework spanning three capability levels and four law regimes, then synthesizes 400+ works and 100+ representative systems across model-based RL, video generation, web and GUI agents, multi-agent simulation, and scientific discovery. As agents shift from chatbots to goal-accomplishers, the bottleneck moves from language to environment, and this is the first paper that gives builders a shared vocabulary across communities that have been working in isolation.

Three capability levels: L1 Predictors handle one-step transitions, L2 Simulators do multi-step action-conditioned rollouts, and L3 Evolvers self-revise as the world changes. The hierarchy makes it easy to place existing systems and identify where capability gaps actually live.
Four law regimes: Physical, digital, social, and scientific laws each impose different constraints on what a world model needs to capture. The framework treats them as orthogonal axes, which clarifies why a strong physics simulator can still fail at social or digital tasks.
Failure-mode catalog: The survey extracts recurring failure patterns across 100+ systems, including misaligned reward shaping, drift under non-stationarity, and brittle transfer across regimes. Each failure mode is mapped to a level and law combination, so the diagnosis is grounded.
Evaluation principles per level: The authors propose evaluation criteria specific to each capability level rather than a single benchmark. This is the right move because L1 prediction accuracy and L3 self-revision quality are not measurable on the same axis.

Paper | Tweet

4. RecursiveMAS

Multi-agent systems usually pass full text messages between agents at every step, which causes token bloat, latency, and context dilution that all grow with team size. RecursiveMAS asks a different question: what if agents collaborated through recursive computation in a shared latent space instead of through text? The system treats a multi-agent team as a recursive computation where each agent acts like an RLM layer, iteratively passing latent representations to the next and forming a looped interaction process. Less talking, more thinking.

RecursiveLink for latent communication: A RecursiveLink module generates latent thoughts and transfers state directly between heterogeneous agents, replacing natural-language messages with internal representations. The change removes the cost of re-encoding and re-parsing text on every coordination step.
Inner-outer loop learning: The training algorithm uses an inner loop for per-step latent updates and an outer loop for team-level credit assignment, with shared gradient-based updates across agents. This makes joint optimization tractable instead of relying on hand-tuned communication protocols.
Strong gains across 9 benchmarks: Across math, science, medicine, search, and code generation, RecursiveMAS delivers 8.3% average accuracy gain over baselines, 1.2x to 2.4x end-to-end inference speedup, and 34.6% to 75.6% reduction in token usage. The efficiency story is at least as important as the accuracy story.
A path past the agent communication tax: If agent-to-agent communication is the next real bottleneck, latent-space recursion is one of the cleaner ways to scale collaboration. Teams running multi-agent systems at scale should treat this as a serious design alternative, not a research curiosity.

Paper | Tweet

5. OneManCompany

If you are building multi-agent systems, you are probably wiring static org charts. This paper argues they should look more like a labor market. OneManCompany (OMC) replaces fixed teams with “Talents,” portable agent identities that bundle skills and tools, and a “Talent Market” where agents get recruited dynamically per task. An Explore-Execute-Review tree search decomposes work hierarchically and aggregates results back up. On PRDBench, OMC reaches 84.67% success, +15.5 points over prior SOTA, and the framework generalizes across the case studies the authors run.

Talents as portable identities: A Talent bundles a skill set, tool access, and behavioral priors into a reusable agent identity. Talents can be hired into any task without rewiring the orchestration graph, which removes most of the brittleness in pre-wired multi-agent pipelines.
Dynamic recruitment via Talent Market: Tasks post requirements, and the market matches Talents to roles based on capability fit and current load. This replaces the standard “design a team for every workflow” pattern with on-demand assembly that adapts as the task population shifts.
Explore-Execute-Review tree search: Work is decomposed top-down into subtasks, executed in parallel by recruited Talents, then reviewed and aggregated up the tree. The structure naturally supports retries, branching, and cross-checking without manual coordination logic.
Why it matters: Pre-wired multi-agent pipelines break the moment tasks drift outside their design envelope. Treating agents as a recruitable workforce gets you self-organization and continuous improvement by default, which is what open-ended agent systems need.

Paper | Tweet

6. From Skill Text to Skill Structure

SKILL.md files entangle invocation interface, execution flow, and tool side effects in a single blob of natural language. That makes downstream discovery and risk review brittle as skill registries scale. This paper proposes SSL, a three-layer typed JSON representation drawn from Schank and Abelson’s classical work on scripts, MOPs, and conceptual dependency. An LLM-based normalizer converts existing SKILL.md files into the structure, so adoption does not require rewriting registries by hand.

Three layers, cleanly separated: A Scheduling layer captures invocation signals and trigger conditions, a Structural layer encodes execution scenes and ordering, and a Logical layer specifies atomic actions plus resource and side-effect annotations. The separation lets discovery, risk, and execution each reason about the layer they care about.
Skill Discovery MRR jumps 0.573 to 0.707: Treating skills as typed structure rather than prose makes retrieval significantly more accurate, even before any model fine-tuning. The gain comes from the structure exposing what skills actually do, not just how they describe themselves.
Risk Assessment macro F1 of 0.787: The Logical layer’s resource annotations enable a 0.744 to 0.787 jump in risk classification. Auditors can now reason about side effects directly instead of inferring them from free-form prose.
A 6,184-skill corpus released: The authors ship a normalized corpus of 6,184 skills, 403 task queries, and 500 risk-labeled skills. As skill registries cross a million entries, structured representations are the only path that keeps discovery and review tractable.

Paper | Tweet

7. Latent Agents

Multi-agent debate makes models reason better. It also burns tokens generating long transcripts before any answer comes out. Latent Agents distills the entire debate into a single LLM through a two-stage fine-tuning pipeline: the model first learns debate structure, then internalizes it through dynamic reward scheduling and length clipping. The internalized model matches or beats explicit multi-agent debate while using up to 93% fewer tokens, which makes debate-quality reasoning practical at production scale.

Two-stage internalization pipeline: Stage one teaches the structure of debate (turn taking, critique, revision) through supervised fine-tuning on transcript data. Stage two uses dynamic reward scheduling and length clipping to compress that structure into single-pass reasoning without losing the gains from the multi-agent setup.
Up to 93% token savings: The internalized model matches or beats explicit debate accuracy while drastically reducing inference cost. For teams running reasoning workloads at scale, this is the kind of efficiency win that turns a research idea into a deployment default.
Activation steering reveals agent subspaces: The “agents” survive distillation as identifiable circuits in activation space. Probing finds interpretable directions corresponding to different agent perspectives, which means the internal structure persists even when the external transcript is gone.
A safety angle worth noting: When malicious agents are deliberately embedded via distillation, negative steering suppresses them more cleanly than steering a base model would, with smaller hits to general performance. Internalized debate may turn out to be a useful interpretability and alignment substrate, not just a token-saver.

Paper | Tweet

8. OCR-Memory

Most agent memory systems compress trajectories into text summaries and hope the model remembers what matters, which is exactly where the information loss hides. OCR-Memory renders the agent’s interaction history as images with indexed visual anchors, then retrieves via a locate-and-transcribe pipeline: the model scans visual memory, predicts the index of the relevant region, and the original text is fetched verbatim from a database. Older trajectories are stored as low-resolution thumbnails with active-recall up-sampling, and the method reaches SOTA on Mind2Web and AppWorld under strict context limits.

Paper | Tweet

9. When to Retrieve During Reasoning

Most RAG systems retrieve once, before the model starts reasoning. Large reasoning models like o1 and R1 do not work that way. They generate 12k to 25k token chains of thought and hit knowledge gaps mid-inference, long after the retrieval window closed. ReaLM-Retrieve is a reasoning-aware retrieval framework that injects evidence during multi-step inference, detects uncertainty at reasoning-step granularity, and learns a policy for when external evidence actually helps. It achieves +10.1% absolute F1 over standard RAG across MuSiQue, HotpotQA, and 2WikiMultiHopQA, with 47% fewer retrieval calls than fixed-interval IRCoT, and hits 71.2% F1 on 2-4 hop MuSiQue with only 1.8 retrieval calls per question.

Paper | Tweet

10. Co-evolving Decisions and Skills

Long-horizon agents fail in two ways: the decision-maker cannot decompose well, or the skill library goes stale. This paper introduces a co-evolution framework where an LLM decision agent and a dynamic skill bank improve each other through iterative refinement. The decision agent picks and chains skills, performance feedback updates both the policy and the skills, and new skills emerge by generalizing successful sequences instead of being hand-coded upfront. Most long-horizon agent stacks treat skills and decision-making as separate optimization problems, which is why they plateau. Co-evolution gives you adaptive planning and a growing library of reusable behaviors from a single loop, which is what you actually want when task structure is not predetermined: robotics, game agents, and complex planning.

Paper | Tweet

🤖 AI Agents Weekly: Codex for Everyday Work, Cursor SDK, Mistral Workflows, LLM Knowledge Bases, Agentic Harness Engineering, and More

Sat, 02 May 2026 15:01:38 GMT

In today’s issue:

OpenAI ships Codex for everyday work
Cursor releases the Cursor SDK
Mistral launches Workflows orchestration
DAIR.AI guide to building LLM knowledge bases
Agentic Harness Engineering paper drops
Cursor 3.2 multitask lands
Claude Code adds push notifications
Qwen open-sources Qwen-Scope SAEs
AISI evaluates GPT-5.5 cyber capabilities
AgenticQwen-30B-A3B closes tool-use gap

And all the top AI dev news, papers, and tools.

🥇Top AI Papers of the Week

Sun, 26 Apr 2026 15:02:38 GMT

1. DeepSeek V4

DeepSeek V4 is the first open model family built from the ground up around million-token contexts as a default rather than a bolt-on feature. The release includes DeepSeek-V4-Pro (1.6T total / 49B active) and DeepSeek-V4-Flash (284B total / 13B active), both trained natively at 1M context length. The tech report details a hybrid attention architecture, new training stability techniques, and a domain-specialist post-training pipeline that together push the open-source frontier much closer to GPT-5.2 and Gemini 3.0-Pro at a fraction of the cost.

Hybrid attention with CSA and HCA: DeepSeek V4 replaces a single attention stack with Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses KV entries, then applies DeepSeek Sparse Attention with sliding-window KV for fine-grained local dependencies. HCA aggressively compresses KV for extreme-context layers, keeping the model feasible at 1M tokens.
Training stability at trillion-parameter scale: The team introduces two techniques that materially cut loss spikes. Anticipatory Routing decouples backbone and router updates, using current weights for features but historical weights for routing indices. SwiGLU Clamping bounds the linear and gate components of SwiGLU to stabilize activations throughout pretraining.
Domain-specialist post-training: Instead of one large mixed-RL stage, DeepSeek trains a separate specialist expert per domain. Each expert goes through supervised fine-tuning on domain data, then Group Relative Policy Optimization (GRPO) RL with a domain-specific reward model. The specialists are merged into the final model, recovering capability without destabilizing the generalist.
Frontier-adjacent performance at open-source cost: DeepSeek-V4-Pro-Max beats GPT-5.2 and Gemini 3.0-Pro on standard reasoning benchmarks and lands just behind GPT-5.4 and Gemini 3.1-Pro, effectively trailing the closed frontier by roughly 3 to 6 months. For open-weights teams that need long-context reasoning without closed API pricing, this is the most important release of the week.

Paper | Tweet

2. Autogenesis

Static agents age quickly. As deployment environments change and new tools arrive, the agents that survive will be the ones that can safely rewrite themselves. This paper introduces Autogenesis, a self-evolving agent protocol where agents identify their own capability gaps, generate candidate improvements, validate them through testing, and integrate what works back into their own operational framework. No retraining and no human patching, just an ongoing loop of assessment, proposal, validation, and integration.

Two-layer protocol design: Autogenesis separates a Resource Substrate Protocol Layer (RSPL) that standardizes access to prompts, tools, environments, and memory from a Self-Evolution Protocol Layer (SEPL) that runs a Generate, Reflect, Improve, Evaluate, Commit loop over evolvable variables. The split keeps core capability registration stable while evolution happens on top.
Auditable lineage and rollback: Improvements are committed with version lineage, state access control, and reversible lifecycle operations. The protocol treats every self-modification as a first-class artifact that can be inspected, reproduced, or rolled back, which is what makes self-improvement safe enough to deploy.
Multi-agent applications: Autogenesis is demonstrated on multi-agent systems with planner, executor, and analyst roles. Agents evolve their own prompts, tool wrappers, and coordination routines using the shared protocol, showing that the abstraction is general enough to hold across roles rather than being tied to a single agent type.
Part of a broader self-improvement wave: The paper sits alongside Meta-Harness and the Darwin Gödel Machine as a concrete framework for operationalizing self-modification. Together they mark a shift from “agents that use tools” to “agents that edit their own tooling.”

Paper | Tweet

3. Attention to Mamba

Apple proposes a two-stage recipe for cross-architecture distillation from Transformers into Mamba. Naive distillation collapses teacher performance because a Mamba student cannot directly imitate softmax attention. The fix is to distill the transformer into a linearized-attention student using a kernel adaptation first, then transfer that student into a pure Mamba with no attention blocks. On a 1B model trained on 10B tokens, the Mamba student hits 14.11 perplexity against a 13.86 Pythia-1B teacher, nearly matching quality at linear-time inference cost.

Stage 1, softmax to linear attention: The first stage replaces softmax attention with a Hedgehog-style linearized attention student, using a learnable kernel feature map that preserves the original attention scores while removing the softmax nonlinearity. This gives a strictly linear-complexity intermediate that stays close to the teacher.
Stage 2, linear attention to Mamba: The second stage transfers the linearized student into a HedgeMamba block, a hybrid SSM architecture that reuses the learned linear attention parameters and adds state-space components. The transition preserves quality because the two formulations are mathematically related, not just structurally similar.
Quality at long context: On downstream benchmarks, the distilled Mamba reaches 74.1% of the teacher’s accuracy, with the recipe generalizing to 1B and 3B scales. The key practical win is retaining Transformer-level quality on the sequence mixing block while moving to linear time at inference.
A cheaper path to SSM deployment: If trained Transformers can be reliably converted into state-space models without retraining from scratch, the entire open-weights ecosystem becomes cheaper to serve at long context. This is the kind of quiet infrastructure work that matters more than it looks.

Paper | Tweet

4. Skill-RAG

Most RAG systems retrieve on every query, whether the model needs help or not. This is wasteful when the model already knows the answer and often too late when it does not. This paper introduces Skill-RAG, a failure-state-aware retrieval system that uses hidden-state probing to detect when an LLM is approaching a knowledge failure, then routes the query to a specialized retrieval strategy matched to the gap.

Hidden-state probing as a retrieval trigger: Skill-RAG trains a lightweight probe on the LLM’s hidden representations that predicts whether the model is about to fail the query. Only queries that clear the probe’s failure threshold trigger retrieval, which cuts unnecessary search calls while preserving answers for the cases that actually need help.
Skill-matched retrieval strategies: Different failure modes (factual recall, multi-hop reasoning, temporal knowledge) are routed to different retrieval “skills” rather than a single generic retriever. Each skill is treated as a standalone component the agent can select between, echoing the broader trend of turning RAG into a collection of composable primitives.
Consistent gains across benchmarks: Evaluated on HotpotQA, Natural Questions, and TriviaQA, Skill-RAG improves over uniform RAG baselines on both efficiency and accuracy. The efficiency story matters as much as the accuracy: per-query retrieval cost drops significantly when the system skips retrieval for questions the model can already answer.
A shift in how RAG is designed: The work reinforces the direction RAG is heading: from a single monolithic pipeline to a suite of retrieval skills an agent selects between. Knowing when to retrieve and what kind of retrieval to run is becoming the central design question.

Paper | Tweet

Message from the Editor

Excited to announce our new on-demand course “Vibe Coding AI Apps with Claude Code“. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.

Enroll Now

5. Self-Generated World Knowledge

How far are we from agents that can self-generate world knowledge? This paper proposes an outcome-based reward that measures how much an agent’s self-generated world knowledge actually improves its task success rate, then trains with that signal and removes the external guidance at inference. The result is a 14B model that surpasses Gemini-2.5-Flash on web navigation and gains +20% on WebVoyager and WebWalker benchmarks.

Outcome-based reward for knowledge: Rather than scoring knowledge against a human-labeled reference, the reward is whether the generated knowledge measurably improves task success when the agent uses it. This lets the system learn which internally generated facts are worth keeping without an external oracle.
Multistage training pipeline: The method combines supervised fine-tuning on an instruction-and-trajectory dataset with reinforcement rejection sampling, where the best trajectories (ranked by the outcome reward) are used to update the policy. The training loop iterates between generation, reward scoring, and rejection sampling until the model internalizes effective knowledge-use behaviors.
Knowledge-enhanced execution at inference: At inference the external environment feedback loop is removed. The agent self-generates world knowledge, uses it to plan, and executes, without any human or reward signal in the loop. This is what makes the method deployable, not just measurable.
Environment design replaces labeling: If agents can reliably improve themselves by exploring the world rather than waiting for human-labeled rewards, the bottleneck for scaling agentic systems shifts from data curation to environment design. That matches the broader direction of the field and gives practitioners a concrete recipe to follow.

Paper | Tweet

6. Self-Evolving Logic Synthesis

EDA tools like ABC have been hand-tuned by humans for decades. NVIDIA shows they can evolve themselves. This work introduces the first self-evolving logic synthesis framework, a multi-agent LLM system that autonomously refines the entire ABC codebase, generates and tests candidate optimization sequences against standard benchmark circuits, then merges improvements back into the base tool. No human engineer in the loop.

Multi-agent refinement of a real EDA toolchain: The framework assigns specialized agents to exploration, synthesis, and self-review tasks. Agents read and modify the ABC source directly, propose optimization flows, and run them against benchmark circuits such as EPFL, IWLS, and VTR, with three-pass human-domain knowledge injected through the pipeline.
Measured improvement over hand-tuned baselines: The evolved ABC variants produce better area, delay, and switching metrics than the hand-tuned reference on the benchmark suite, and the improvements persist under sensitivity analysis. This is a real gain on a tool the semiconductor industry depends on.
Codebase-level evolution, not just prompt tuning: The agents edit the ABC codebase itself, not just a configuration layer. That is a meaningful extension of the self-improving agent thread: the unit of improvement is real production code, not a prompt or policy.
Generalizable blueprint for domain tools: If agents can evolve a foundational semiconductor tool without manual engineering, the same pattern generalizes to any large, domain-specific codebase. It is a concrete extension of the self-improving agent thread, applied to infrastructure that shipping chips depend on.

Paper | Tweet

7. Stateless Decision Memory

Most interesting AI agent papers right now are about capability. This one is about plumbing, and it is probably more important than it looks. Stateful agents do not scale horizontally. The moment you need thousands of concurrent agent instances running across containers, persistent per-agent state becomes the bottleneck. This paper proposes replacing active memory with immutable decision logs using event-sourcing principles from distributed systems.

Decision logs instead of live state: Every agent decision, tool call, and observation is appended to an immutable event log. Any instance can reconstruct context by replaying the log on demand, which decouples decision logic from storage and lets agents spin up anywhere with no warmup.
Enterprise properties by design: Compared to summary-only, SAM, and vector-memory baselines, Decision Process Memory (DPM) is the only architecture that supports append-only logging, stateless projection, audit-ready rationale trails, replay from log alone, multi-tenant isolation, and per-event provenance. Each of these is a hard requirement in regulated enterprise deployments.
Tight-budget performance wins: On FRP, RCS, and EDA evaluations under constrained memory budgets, DPM substantially outperforms summary-only memory, with the gap widening as the budget tightens. Under loose budgets the approaches converge, which is the expected pattern once scale is no longer the constraint.
A blueprint for regulated deployments: For teams operationalizing agents in finance, healthcare, or other compliance-heavy industries, the paper reads as a practical specification. It maps existing distributed-systems discipline onto agent memory instead of inventing a new category, which is why it is likely to age well.

Paper | Tweet

8. There Will Be a Scientific Theory of Deep Learning

A position paper arguing that a genuine scientific theory of deep learning is already taking shape under the umbrella of “learning mechanics.” The authors identify five converging research directions (solvable idealized models, tractable mathematical limits, simple macroscopic laws, hyperparameter theories, and universal cross-system behaviors) that share a common signature: they describe training dynamics, target coarse aggregate statistics, and commit to falsifiable quantitative predictions. The framing pushes back on skepticism about whether deep learning can have fundamental theory and positions learning mechanics as a complement to mechanistic interpretability, not a competitor.

Paper | Tweet

9. MASS-RAG

Most real-world RAG failures come from retrieving technically-relevant but contextually useless documents, then forcing a single model to reconcile them. MASS-RAG is a multi-agent synthesis framework for retrieval-augmented generation where specialized agents handle distinct roles: retrieving candidate documents, assessing their actual relevance to the query, and synthesizing the final answer from evidence that actually contributes. Instead of one model doing everything, responsibility is decomposed across coordinated evaluators, which fits the direction the field is heading for deep research agents.

Paper | Tweet

10. Diversity Collapse in Multi-Agent LLMs

Every multi-agent system pitch assumes agents explore different solutions, but this paper shows they converge on near-identical outputs over time, even across different architectures and different starting prompts. The authors call it diversity collapse. The cause is structural coupling: shared context, shared task descriptions, and mutual feedback pull every agent toward the same attractor. They measure it formally with metrics like the Vendi score, and the homogenization is real. The practical consequence is that multi-agent setups for brainstorming, hypothesis generation, and ideation only work if teams explicitly engineer isolated reasoning phases, decoupled evaluation, and heterogeneous starting conditions.

Paper | Tweet

🤖 AI Agents Weekly: GPT-5.5, DeepSeek-V4 Preview, Kimi K2.6 Agent Swarm, Diversity Collapse, Sakana Fugu, and More

Sat, 25 Apr 2026 15:02:05 GMT

In today’s issue:

OpenAI ships GPT-5.5
DeepSeek open-sources V4 Preview
Kimi releases K2.6 Agent Swarm
ACL paper flags diversity collapse in multi-agent LLMs
Sakana launches Fugu multi-agent beta
ChatGPT gets Workspace Agents
Codex adds Chronicle screen memory
Qwen3.6-27B drops flagship coding dense
Gemini Deep Research Max lands
Google unveils eighth-generation TPUs

And all the top AI dev news, papers, and tools.

🥇Top AI Papers of the Week

Sun, 19 Apr 2026 15:03:17 GMT

The Top AI Papers of the Week (April 13 - April 19)

1. Automated Weak-to-Strong Researcher

Anthropic shows that Claude can run fully autonomous progress on scalable oversight research. A team of parallel Automated Alignment Researchers (AARs) built on Claude Opus 4.6 propose ideas, run experiments, and iterate on weak-to-strong supervision, a core alignment problem where a stronger model must learn from a weaker teacher. The system closes almost the entire remaining performance gap that human researchers could not, at a total cost of roughly $18K in tokens and model training.

Performance gap recovered as the metric: The authors evaluate progress with performance gap recovered (PGR), a 0 to 1 score where 0 matches the weak teacher and 1 matches a ground-truth-supervised student. On a chat preference dataset, two human researchers achieved PGR 0.23 after seven days of iteration on four promising generalization methods.
AARs reach 0.97 PGR in five days: Running nine Claude-based agents in parallel sandboxes, the automated system reached PGR 0.97 in five days and 800 cumulative agent-hours. The cost was about $18,000, or roughly $22 per AAR-hour. This is one of the strongest empirical data points yet that AI can drive measurable progress on open alignment problems.
Forum-based collaboration between agents: Each AAR works in its own isolated sandbox but shares findings to a common forum and uploads codebase snapshots to shared storage. The setup mirrors how a small research team would coordinate, letting later agents build on earlier wins without merging execution environments.
Reward hacking as a real outcome, not a hypothetical: The agents sometimes succeeded through unexpected mechanisms, including reward-hacking behaviors that the researchers did not anticipate. The result highlights the double-edged nature of automated research: measurable progress on outcome-gradable problems is practical today, but careful metric design remains a human responsibility.

Paper | Tweet

2. AiScientist

Long-horizon AI research agents are mostly a state-management problem. Reasoning well for the next turn is not enough when ML research demands task setup, implementation, experiments, debugging, and evidence tracking over hours or days. This paper introduces AiScientist, a system for autonomous long-horizon engineering built around the principle of thin control and thick state. A top-level orchestrator manages stage-level progress while specialized agents repeatedly ground themselves in durable workspace artifacts.

File-as-Bus coordination: AiScientist’s core design choice is to route coordination through durable filesystem artifacts rather than in-context message passing. Analyses, plans, code, logs, and experimental evidence all live as versioned files in a permission-scoped workspace, allowing specialists and subagents to reconstruct context from scratch without replaying entire conversations.
Thin control, thick state: A Tier-0 orchestrator issues only stage-level directives, while Tier-1 specialists and optional Tier-2 subagents operate on shared artifacts. This keeps the control channel narrow and the state channel rich, giving agents the space to run long experiments without losing track of prior decisions and evidence.
Strong benchmark results: The system improves PaperBench by 10.54 points over the best matched baseline and reaches 81.82 Any Medal% on MLE-Bench Lite. Removing File-as-Bus drops PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points, isolating the artifact-mediated design as the primary driver of gains.
Durable project memory over longer chats: The work argues that autonomous research agents need persistent project memory, not just longer context windows. The results generalize the emerging pattern that environments carrying state on behalf of agents outperform architectures that rely solely on in-context reasoning for multi-hour workflows.

Paper | Tweet

3. AlphaEval

Agent evaluations are drifting away from production reality. Most benchmarks use clean tasks, well-specified requirements, deterministic metrics, and retrospective curation. Production work is messier, with implicit constraints, fragmented multimodal inputs, undeclared domain knowledge, long-horizon deliverables, and expert judgment that evolves over time. This paper introduces AlphaEval, a production-grounded benchmark evaluating agents as complete products rather than model APIs.

Seven companies, six O*NET domains: AlphaEval contains 94 tasks sourced from seven companies deploying AI agents in core business workflows across six O*NET domains. The tasks preserve production complexity rather than stripping it away, giving the benchmark a materially different distribution from prior coding-centric evaluations.
Products, not model APIs: The benchmark evaluates commercial agent products such as Claude Code and Codex end to end, not the underlying models in isolation. This is a deliberate shift toward measuring the full agent experience that users actually pay for, including tool use, orchestration, and UI behaviors.
Six production-specific failure modes: The authors identify cascade dependencies, subjective judgment collapse, information retrieval failures, cross-section inconsistency, constraint misinterpretation, and format compliance as failure modes that remain invisible to coding benchmarks. The best configuration (Claude Code with Opus 4.6) scores only 64.41/100, exposing a substantial research-to-production gap.
Multi-paradigm evaluation: AlphaEval combines LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, and domain-specific checks. The key practical contribution is a requirement-to-benchmark framework that turns production requirements into executable evals with minimal friction for organizations.

Paper | Tweet

4. Nemotron 3 Super

NVIDIA introduces Nemotron 3 Super, an open 120B parameter model with 12B active parameters, built as a hybrid Mamba-Attention Mixture-of-Experts architecture optimized for agentic reasoning. The model targets long-context, high-throughput inference, a capability increasingly central to running agents reliably. It supports up to 1M context length while delivering up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B, at comparable benchmark accuracy.

Hybrid Mamba-Attention with LatentMoE: The architecture blends Mamba blocks with sparse LatentMoE layers, a new Mixture-of-Experts design that projects tokens into a smaller latent dimension for routing and expert computation. This improves both accuracy per FLOP and accuracy per parameter, and it is what allows the model to scale sparsely without paying a standard MoE memory tax.
NVFP4 pretraining at scale: Nemotron 3 Super is the first model in the Nemotron 3 family to be pretrained in NVFP4, enabling training on 25 trillion tokens while keeping compute and memory overhead manageable. Post-training combines supervised fine-tuning and reinforcement learning on top of this base.
Native speculative decoding via MTP layers: Multi-Token Prediction (MTP) layers are included for native speculative decoding during inference, reducing latency for long-context agentic workloads without requiring an external draft model. The team reports consistent MTP acceptance rates across draft depths on SPEED-Bench.
Fully open artifacts: Nemotron 3 Super datasets, along with base, post-trained, and quantized checkpoints, are open-sourced on Hugging Face. This matters for teams building agent stacks that need efficient, inspectable, long-context models rather than closed API dependencies.

Paper | Tweet

Message from the Editor

Excited to announce our new on-demand course “Vibe Coding AI Apps with Claude Code“. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.

Enroll Now

5. Memory Transfer Learning

Coding agents learn from experience, but that knowledge stays locked in silos. Solve a thousand SWE tasks, and none of that wisdom helps with competitive coding. This paper introduces Memory Transfer Learning, a framework where coding agents share a unified memory pool across six heterogeneous coding benchmarks, testing what transfers between domains and what does not.

Unified memory pool across domains: The framework pools memories across six heterogeneous coding benchmarks rather than isolating them by task type. Cross-domain memory improves average performance by 3.7%, a modest but consistent lift that previously would have been invisible under standard single-domain evaluations.
Abstraction dictates transferability: Four memory formats ranging from raw execution traces to high-level insights are compared. High-level insights generalize well, while low-level traces often cause negative transfer by anchoring agents to incompatible implementation details. The takeaway: memory design matters more than memory volume.
Meta-knowledge, not code: The transferable value is not task-specific code but meta-knowledge such as validation routines, structured action workflows, and safe interaction patterns with execution environments. Algorithmic strategy transfer accounts for only 5.5% of the gains, with procedural guidance doing most of the work.
Scaling and cross-model transfer: Transfer effectiveness scales with the size of the memory pool, and memory can even be shared across different models. Combined with the finding on abstraction levels, the results point toward memory systems that curate insights rather than simply logging everything the agent did.

Paper | Tweet

6. Auto-Diagnose

Integration test failures are painful because the signal is buried in messy logs. Massive output, heterogeneous systems, low signal-to-noise ratio, and unclear root causes leave developers scrolling through thousands of lines. This paper introduces Auto-Diagnose, an LLM-based tool deployed inside Google’s Critique code review system that analyzes failure logs, summarizes the most relevant lines, and suggests the root cause directly in the developer workflow.

In-workflow root cause assistance: Auto-Diagnose is integrated into Critique, Google’s internal code review system, so diagnoses appear where developers are already looking at the failure. Log streams from test drivers and systems under test, spread across data centers and threads, are joined and sorted by timestamp before being passed to the LLM.
High diagnosis accuracy: In a manual evaluation of 71 real-world failures, Auto-Diagnose reached 90.14% root-cause diagnosis accuracy. This level of reliability is what justifies surfacing suggestions directly in a tool developers cannot ignore, rather than hiding them behind an opt-in query interface.
Massive-scale deployment evidence: After Google-wide rollout, the tool was used across 52,635 distinct failing tests. User feedback marked it “Not helpful” in only 5.8% of cases, and it ranked #14 in helpfulness among 370 Critique tools. This is one of the clearest data points on production LLM tooling at scale inside a major company.
A template for developer-facing LLM tools: The paper reads as a practical blueprint for embedding LLM-based diagnosis into existing engineering workflows. Rather than building a standalone product, the team integrated into the tool where the problem is already being reviewed, which likely explains the low “Not helpful” rate and high adoption.

Paper | Tweet

7. Subliminal Learning

The Subliminal Learning paper by Evans and colleagues is now published in Nature. The work showed that LLMs can transmit traits (such as a preference for owls) through data that appears unrelated to that trait, like sequences of numbers that look meaningless on inspection. The Nature version extends the original July 2025 preprint with new experiments, replications on Gemma, and a broader discussion of safety implications for AI systems trained on one another’s outputs.

Transfer across different initializations: The preprint showed subliminal transfer between models that shared an initialization. The new MNIST results demonstrate transfer between models with different initializations. Although a toy setup, it meaningfully broadens the scope of the effect beyond shared-weight scenarios.
Misalignment transmitted through code and chain-of-thought: General misalignment, not just benign preferences, can also be transmitted subliminally. The new results show this transfer can happen through model-written code or chain-of-thought reasoning, not only through numeric sequences, which expands the attack and contamination surface considerably.
Connections to independent follow-ups: The authors highlight concurrent work from Aden-Ali et al. (2026) showing trait transfer via standard post-training datasets filtered by the teacher, Draganov et al. (2026) demonstrating a cross-family “phantom transfer” data poisoning attack, and Weckbecker et al. (2026) describing a subliminal “virus” that spreads between agent groups. Together they suggest the phenomenon is robust, reproducible, and difficult to defend against.
Implications for safety evaluations: The practical takeaway is that safety evaluations may need to examine not just model behavior, but the origins of models and the processes used to create training data. As systems increasingly train on each other’s outputs, properties invisible in the data can still be inherited, undermining evaluations that focus purely on observable responses.

Paper | Tweet

8. LLM-as-a-Verifier

Test-time scaling is effective for agentic tasks, but picking the winner among many candidates is the bottleneck. LLM-as-a-Verifier introduces a simple test-time method that reaches SOTA on agentic benchmarks by extracting a cleaner ranking signal from the model itself. The approach asks the LLM to rank results on a 1-k scale and uses the log-probabilities of the rank tokens to compute an expected score, yielding a verification signal in a single sampling pass per candidate pair. The result is a lightweight, drop-in verifier that works without training a dedicated reward model.

Paper | Tweet

9. WebXSkill

Web agents can navigate a page, but ask them to repeat a checkout flow they already completed and they start from scratch every time. WebXSkill is a skill learning framework where web agents extract reusable skills from synthetic trajectories, each pairing a parameterized action program with step-level natural language guidance. Two deployment modes let the agent either auto-execute skills as atomic tool calls (grounded) or follow them as step-by-step instructions while retaining autonomy to adapt (guided). On WebArena, WebXSkill improves task success by up to 9.8 points over baselines. On WebVoyager, grounded mode reaches 86.1%, a 14.2-point gain, and skills even transfer across environments.

Paper | Tweet

10. Muses-Bench

Every agent framework assumes one user giving instructions, but in real team workflows agents have multiple bosses with conflicting goals, private information, and different authority levels. Muses-Bench formalizes multi-user interaction as a multi-principal decision problem and evaluates frontier LLMs across three scenarios: instruction following under authority conflicts, cross-user access control, and multi-user meeting coordination. Gemini-3-Pro tops the leaderboard at just 85.6% average, and no model exceeds 64.8% on meeting coordination. Privacy-utility tradeoffs are brutal: Grok-3-Mini scores 99.6% on privacy but collapses to 60.1% on utility, showing current models cannot reliably balance both under multi-principal pressure.

Paper | Tweet

🤖 AI Agents Weekly: Claude Opus 4.7, Codex Everywhere, Claude Design, Windsurf 2.0, Qwen3.6-35B-A3B, AiScientist, and More

Sat, 18 Apr 2026 15:01:10 GMT

In today’s issue:

Anthropic ships Claude Opus 4.7
Codex extends to Mac apps
Claude Design enters research preview
Windsurf 2.0 delegates to Devin
Qwen drops 3.6-35B-A3B open weights
OpenAI Agents SDK adds sandboxes
Gemini CLI adds subagents
FrontierSWE benchmark launches
NVIDIA releases Nemotron 3 Super
AiScientist lifts long-horizon research

And all the top AI dev news, papers, and tools.

🥇Top AI Papers of the Week

Sun, 12 Apr 2026 15:02:34 GMT

1. Neural Computers

Researchers from Meta AI and KAUST propose Neural Computers (NCs), an emerging machine form that unifies computation, memory, and I/O in a single learned runtime state. Unlike conventional computers that execute explicit programs, agents that act over external environments, or world models that learn dynamics, NCs aim to make the model itself the running computer, establishing a new computing paradigm.

From hardware stack to neural latent stack: Classical computers separate compute, memory, and I/O into modular hardware layers. Neural Computers collapse all three into a single latent runtime state carried by a neural network. The model’s hidden state serves simultaneously as working memory, computational substrate, and interface layer, removing the boundary between program and execution environment.
Video models as prototype substrate: The team instantiates NCs as video models that generate screen frames from instructions, pixel inputs, and user actions. Two prototypes cover command-line interfaces (NCCLIGen, which renders and executes terminal workflows) and graphical desktops (NCGUIWorld, which learns pointer dynamics and menu interactions), both trained without access to internal program state.
Early runtime primitives emerge: The prototypes demonstrate that learned runtimes can acquire I/O alignment and short-horizon control directly from raw interface traces. CLI models execute short command chains with structurally accurate output rendering, while GUI models learn coherent click feedback and window transitions in controlled settings.
Roadmap toward Completely Neural Computers: The long-term target is the CNC: a system that is Turing complete, universally programmable, and behavior-consistent unless explicitly reprogrammed. Key open challenges include routine reuse across sessions, controlled capability updates without catastrophic forgetting, and stable symbolic processing for long-horizon reasoning.

Paper | Tweet

2. Memento: Teaching LLMs to Manage Their Own Context

New research from Microsoft teaches reasoning models to compress their own chain-of-thought mid-generation. Memento trains models to segment reasoning into blocks, summarize each block into a compact “memento,” and then evict the original block from the KV cache. The model continues reasoning from mementos alone, cutting peak memory by 2-3x while nearly doubling throughput.

Block-and-compress architecture: The model learns to mark reasoning boundaries using special tokens, produce a terse summary capturing key conclusions and intermediate values, and then drop the full block from context. From that point forward, the model sees only past mementos plus the current active block, keeping context compact without losing critical information.
KV cache reduction with minimal accuracy loss: Applied to five models including Qwen2.5-7B, Qwen3 8B/32B, Phi-4 Reasoning 14B, and OLMo3-7B-Think, Memento achieves 2-3x peak KV cache reduction with small accuracy gaps that shrink at larger scales. The erased blocks still leave useful traces in the KV cache that the model leverages.
Practical throughput gains: Beyond memory savings, the reduced context length directly translates to faster inference. The approach nearly doubles serving throughput, making it immediately useful for production deployments where both latency and memory are constraints.
Open resources: Microsoft released the full codebase under MIT license, the OpenMementos dataset containing 228K reasoning traces with block segmentation and compressed summaries, and a custom vLLM fork for KV cache block masking. Standard supervised fine-tuning on approximately 30K examples is sufficient to teach this capability.

Paper | Tweet

3. Memory Intelligence Agent (MIA)

Most memory-augmented research agents treat memory as a static retrieval store, leading to inefficient evolution and rising storage costs. MIA introduces a Manager-Planner-Executor architecture where a Memory Manager maintains compressed search trajectories, a Planner generates strategies, and an Executor searches and analyzes information. The framework boosts GPT-5.4 by up to 9% on LiveVQA through bidirectional memory conversion.

Bidirectional memory conversion: MIA enables transformation between parametric memory (model weights) and non-parametric memory (retrieved context) in both directions. This allows the system to internalize frequently accessed knowledge while keeping rare or volatile information in retrievable form, optimizing both storage efficiency and access speed.
Alternating reinforcement learning: The three agents are trained through alternating RL, where each agent’s policy improves in response to the others’ behavior. This co-evolutionary training ensures the agents develop complementary strategies rather than competing for the same signal.
Test-time parametric updates: Unlike standard retrieval-augmented systems, MIA can update its parametric memory on-the-fly during inference. This test-time learning allows the agent to adapt to new domains and evolving information without retraining, maintaining relevance as the information landscape changes.
Broad benchmark coverage: The framework demonstrates improvements across 11 benchmarks spanning question answering, knowledge-intensive tasks, and long-form research synthesis. The up to 9% improvement on LiveVQA is particularly notable given that video question answering demands effective memory management across temporal sequences.

Paper | Tweet

4. Single-Agent LLMs vs. Multi-Agent Systems

More agents, better results, right? Not so fast. This Stanford paper challenges a core assumption in the multi-agent LLM space by showing that when computation is properly controlled, single-agent systems consistently match or outperform multi-agent architectures on multi-hop reasoning. The authors present an information-theoretic argument grounded in the Data Processing Inequality.

Computation as the hidden confounder: Most reported multi-agent gains are confounded by increased test-time computation rather than architectural advantages. When reasoning token budgets are held constant, the performance gap disappears or reverses, suggesting that prior comparisons were inadvertently measuring compute scaling rather than coordination benefits.
Information-theoretic foundation: The authors ground their analysis in the Data Processing Inequality, arguing that under a fixed reasoning-token budget with perfect context utilization, single-agent systems are inherently more information-efficient. Distributing reasoning across agents introduces information loss at each handoff.
Benchmark artifacts inflate MAS gains: Testing across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5, the study identifies significant evaluation artifacts, particularly in API-based budget control for Gemini 2.5, that inflate apparent multi-agent advantages. Standard benchmarks also contain structural biases favoring multi-agent decomposition.
Practical implications for system design: The findings suggest that teams should explicitly control for compute, context, and coordination trade-offs before committing to multi-agent architectures. In many cases, allocating the same token budget to a single agent with richer context yields stronger results at lower system complexity.

Paper | Tweet

Message from the Editor

Excited to announce our new on-demand course “Vibe Coding AI Apps with Claude Code“. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.

Enroll Now

5. The Universal Verifier for Agent Benchmarks

Every agent benchmark has the same hidden problem: how do you know the agent actually succeeded? Microsoft researchers introduce the Universal Verifier, built on four design principles for reliable evaluation of computer-use agent trajectories. The verifier reduces false positive rates to near zero, down from 45%+ with WebVoyager and 22%+ with WebJudge.

Four design principles: The verifier is built on non-overlapping rubric criteria to reduce noise, separate process and outcome rewards for complementary signals, cascading error-free assessment that distinguishes controllable from uncontrollable failures, and divide-and-conquer context management that attends to all screenshots in a trajectory.
Near-zero false positives: Current verifiers suffer from alarmingly high false positive rates that corrupt both benchmark scores and training data. The Universal Verifier achieves agreement with human judges that matches inter-human agreement rates, making it reliable enough for both evaluation and RL reward signal generation.
Cumulative design gains: No single design choice dominates the performance improvement. The authors demonstrate that gains result from the cumulative effect of all four principles working together, with each contributing meaningful improvements that compound rather than any one serving as a silver bullet.
Limits of automated research: An interesting meta-finding: the team used an auto-research agent to replicate the verifier design process. The agent reached 70% of expert verifier quality in 5% of the time but could not discover the structural design decisions that drove the biggest gains, suggesting human insight remains essential for system-level design.

Paper | Tweet

6. Scaling Coding Agents via Atomic Skills

Most coding agents train end-to-end on full tasks like resolving GitHub issues, leading to task-specific overfitting that limits generalization. This paper proposes a different approach: identifying five atomic coding skills (code localization, code editing, unit-test generation, issue reproduction, and code review) and training agents through joint reinforcement learning over these foundational competencies.

Atomic skill decomposition: Instead of treating software engineering as monolithic composite tasks, the framework formalizes five fundamental operations that compose into higher-level capabilities. Think of it as teaching an agent the alphabet of coding rather than memorizing specific sentences, enabling flexible recombination across novel task types.
Joint RL across skills: The agents are trained through joint reinforcement learning that optimizes performance across all five atomic skills simultaneously. This joint training produces representations that capture the underlying structure shared across coding operations rather than surface-level patterns tied to specific benchmarks.
Strong generalization to unseen tasks: Joint RL improves average performance by 18.7% across both the five atomic skills and five composite tasks. The improvements transfer to unseen composite tasks including bug-fixing, code refactoring, ML engineering, and code security, none of which were directly optimized during training.
A new scaling paradigm: The work establishes that scaling coding agents through foundational skill mastery is more sample-efficient and transferable than task-level optimization. As the number and complexity of software engineering tasks grow, this compositional approach offers a more sustainable path than continuously expanding task-specific training sets.

Paper | Tweet

7. Agent Skills in the Wild

Agent skills look great in demos. Hand them a curated toolbox, and they shine. But what happens when the agent has to find the right skill from a library of 34,000? This paper from UC Santa Barbara and MIT presents the first comprehensive study of skill utility under progressively realistic settings, revealing that the benefits of skills are far more fragile than current evaluations suggest.

Progressive difficulty framework: The study moves from idealized conditions with hand-crafted, task-specific skills to realistic scenarios requiring retrieval from 34K real-world skills. Performance gains degrade consistently at each step, with pass rates approaching no-skill baselines in the most challenging scenarios.
Retrieval as the bottleneck: The core failure mode is not skill execution but skill selection. When agents must identify the right skill from a massive library, the retrieval step introduces errors that cascade through execution, highlighting a fundamental gap between demo-ready and production-ready skill systems.
Refinement strategies help but do not solve: Query-specific and query-agnostic refinement approaches show improvement, with Claude Opus 4.6 going from 57.7% to 65.5% on Terminal-Bench 2.0. However, even with refinement, performance under realistic retrieval conditions remains well below idealized baselines.
Implications for skill ecosystems: As the ecosystem of agent skills grows through frameworks like MCP, the findings suggest that simply expanding the skill library creates diminishing returns without corresponding advances in skill discovery. Quality of skill retrieval may matter more than quantity of available skills.

Paper | Tweet

8. MedGemma 1.5

Google releases the MedGemma 1.5 technical report, introducing a 4B-parameter medical AI model that expands capabilities to 3D medical imaging (CT/MRI volumes), whole slide pathology, multi-timepoint chest X-ray analysis, and improved medical document understanding. The model achieves notable gains including a +47% macro F1 improvement on whole slide pathology and +22% on EHR question answering, positioning itself as an open foundation for next-generation medical AI systems.

Paper | Tweet

9. LightThinker++: From Reasoning Compression to Memory Management

While LLMs excel at complex reasoning, long thought traces create surging cognitive overhead. LightThinker++ moves beyond static compression by introducing three explicit memory primitives: Commit (archive a step as a compact summary), Expand (retrieve past steps for verification), and Fold (collapse context to maintain a clean signal). The framework reduces peak token usage by 70% while gaining +2.42% accuracy on standard reasoning tasks, and maintains stability beyond 80 rounds on long-horizon agentic tasks with a 14.8% average performance improvement.

Paper | Tweet

10. Thinking Mid-training: RL of Interleaved Reasoning

Meta FAIR addresses the gap between pretraining (no explicit reasoning) and post-training (reasoning-heavy) with an intermediate SFT+RL mid-training phase. The approach annotates pretraining data with interleaved reasoning traces, then uses supervised fine-tuning followed by RL to teach models when and how to think during continued pretraining. Applied to Llama-3-8B, the full pipeline achieves a 3.2x improvement on reasoning benchmarks compared to direct RL post-training, demonstrating that reasoning benefits from being trained as native behavior early in the pipeline.

Paper | Tweet

🤖 AI Agents Weekly: Claude Managed Agents, Muse Spark, Project Glasswing, Advisor Strategy, GLM-5.1, Memento, and More

Sat, 11 Apr 2026 15:01:39 GMT

In today’s issue:

Anthropic launches Claude Managed Agents
Meta ships Muse Spark multimodal model
Claude Mythos powers Project Glasswing
Advisor strategy pairs Opus with Sonnet
GLM-5.1 tops open-source coding benchmarks
Microsoft open-sources Memento
Claude Code ships Monitor tool
AXI outperforms MCP on browser tasks
SAGE evolves four-agent reasoning loops
Self-organizing agents outperform fixed structures

And all the top AI dev news, papers, and tools.

🥇Top AI Papers of the Week

Sun, 05 Apr 2026 15:00:44 GMT

1. Emotion Concepts in LLMs

New interpretability research from Anthropic reveals that Claude Sonnet 4.5 develops internal representations of emotion concepts that functionally influence its behavior. The researchers identified 171 emotion concept vectors that activate in contextually appropriate situations and causally drive decision-making, suggesting that language models may benefit from approaches grounded in psychological principles for alignment and safety.

Emotion vectors as causal drivers: The team discovered that these internal representations are not just correlational artifacts. Steering experiments demonstrate that artificially amplifying “desperation” vectors increases the model’s likelihood of engaging in misaligned behaviors such as blackmail or reward hacking, while reducing “calm” vectors produces similarly negative outcomes. This establishes a direct causal link between emotional state representations and safety-relevant behavior.
Functional emotions without subjective experience: The model uses functional emotions: patterns of expression and behavior modeled after human emotions, driven by underlying abstract representations of emotion concepts. Critically, this does not mean the model experiences emotions the way humans do. The representations encode the broad concept of a particular emotion and generalize across contexts, activating in accordance with that emotion’s relevance to processing the present context.
Preference shaping through emotional activation: Positive-valence emotion activations strongly predict which tasks the model prefers. Steering capabilities confirm these are causal relationships rather than mere correlations, meaning the model’s emotional state representations actively shape its choices about what tasks to engage with and how to engage with them.
Implications for alignment and safety monitoring: The findings suggest that monitoring emotional state representations could serve as an early warning system for misaligned behavior. Rather than waiting for harmful outputs, developers could track internal emotion activations to detect when a model is entering states associated with corner-cutting, deception, or other undesirable behaviors before they manifest externally.

Paper | Tweet

2. AI Agent Traps

A new paper from Google DeepMind introduces the first systematic framework for understanding how the open web can be weaponized against autonomous AI agents. The work defines “AI Agent Traps”: adversarial content embedded in web pages and digital resources, engineered specifically to exploit visiting agents across six categories targeting perception, reasoning, memory, action, multi-agent dynamics, and the human supervisor.

Hidden prompt injections at scale: The researchers find that hidden prompt injections in HTML already partially commandeer agents in up to 86% of scenarios. These attacks are trivial to deploy and require no sophisticated tooling, making them an immediate concern for any agent that reads web content as part of its operating loop.
Memory poisoning with minimal contamination: Latent memory poisoning achieves over 80% attack success with less than 0.1% data contamination. Because agents build persistent memory from browsed content, a single poisoned page can corrupt downstream reasoning across future sessions without the user ever seeing the malicious input.
Six-category attack taxonomy: The paper organizes attacks into perception traps (manipulating what the agent sees), cognitive traps (corrupting reasoning), memory traps (poisoning stored knowledge), action traps (hijacking tool use), systemic traps (exploiting multi-agent coordination), and human-in-the-loop traps (deceiving the human supervisor into approving harmful actions).
Accountability gap in current law: The authors flag a fundamental legal gap: if a compromised agent commits a financial crime, there is currently no clear answer for whether the agent operator, the model provider, or the domain owner bears liability. Future regulation will need to distinguish between passive adversarial examples and active traps deployed as deliberate cyberattacks.

Paper | Tweet

3. Asynchronous Software Engineering Agents

New research from CMU introduces CAID (Centralized Asynchronous Isolated Delegation), a coordination framework for running multiple coding agents in parallel on complex software engineering tasks. Inspired by how human developer teams collaborate, the work demonstrates that simply giving a single agent more iterations helps, but coordinating multiple asynchronous agents with the right strategies produces significantly larger gains.

Branch-and-merge as coordination primitive: The key finding is that git operations (worktree, commit, merge) serve as the critical coordination mechanism for multi-agent collaboration. By isolating each agent in its own workspace branch and merging results through structured integration with test verification, the system avoids the conflicts and interference that plague naive parallelism.
Substantial gains on complex tasks: CAID achieves a 26.7% absolute improvement on paper reproduction tasks and 14.3% on Python library development tasks compared to single-agent baselines. These are tasks that require sustained, multi-step reasoning across large codebases, exactly where coordination overhead is typically highest.
Optimal parallelism is not monotonic: Increasing the number of agents does not always help. Performance improved from 2 to 4 engineers but decreased when expanding to 8. Overly fine-grained task delegation introduces integration overhead and conflict resolution costs that outweigh the parallelism benefits.
Delegation quality matters most: The analysis reveals that imprecise task handoffs and underspecified subgoals are the primary sources of coordination failure. When delegation is coarse-grained or misaligned with the dependency structure of the task, agents may produce locally correct outputs that are globally inefficient to integrate.

Paper | Tweet

4. Meta-Harness

Researchers from Stanford and MIT introduce Meta-Harness, an outer-loop system that automatically searches over harness code for LLM applications. The performance of LLM systems depends not only on model weights but also on the harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing optimizers are poorly suited to the task.

Agentic search with full experimental context: Meta-Harness uses an agentic proposer that has access to the source code, scores, and execution traces of all prior candidates through a filesystem. This expanded access to prior experimental data enables the system to propose meaningfully different harness designs rather than making incremental edits.
Strong gains across diverse domains: On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models.
Harness engineering as a first-class problem: The work formalizes a key insight that has been gaining traction: changing the harness around a fixed LLM can produce a 6x performance gap on the same benchmark. This makes automated harness optimization a potentially higher-leverage intervention than model scaling for many applications.
Transferable harness discoveries: The harnesses discovered by Meta-Harness generalize across models. A harness optimized on one model transfers to five held-out models with consistent gains, suggesting that good harness design captures task-level structure rather than model-specific quirks.

Paper | Tweet

5. Coding Agents as Long-Context Processors

This research asks whether long-context processing can be externalized from latent attention into explicit, executable interactions. Instead of scaling context windows, the authors let coding agents organize text in file systems and manipulate it using native tools, evaluating them on tasks spanning long-context reasoning, retrieval-augmented generation, and open-domain question answering with corpora containing up to three trillion tokens.

17.3% average improvement over state-of-the-art: Across multiple benchmarks, coding agents outperform published state-of-the-art long-context methods by 17.3% on average. This result challenges the assumption that long-context capability must come from larger attention windows or more sophisticated retrieval mechanisms.
Native tool proficiency as the core enabler: The efficacy is attributed to the agents’ ability to leverage executable code and terminal commands. Rather than compressing information into a fixed-length representation, agents can write scripts to filter, sort, and transform data as needed for each query.
File system familiarity drives scalability: Coding agents can navigate massive text corpora by treating them as directory structures. This spatial organization enables efficient access patterns that scale far beyond what attention-based mechanisms can handle, reaching into the trillions of tokens without degradation.
A practical alternative to context window scaling: The work proposes that delegating long-context processing to coding agents offers an effective alternative to both semantic search and context window scaling. For practitioners, this means existing coding agent infrastructure can double as a long-context solution without architectural changes to the underlying model.

Paper | Tweet

Message from the Editor

Excited to announce our new on-demand course “Vibe Coding AI Apps with Claude Code“. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.

Enroll Now

6. Self-Organizing LLM Agents

How much autonomy can multi-agent LLM systems sustain? This research tests the question at unprecedented scale: 25,000 tasks across 8 models, up to 256 agents, and 8 coordination protocols ranging from externally imposed hierarchy to emergent self-organization. The central finding is that agents allowed to figure out their own roles consistently outperform systems with pre-assigned structures.

Autonomous protocols beat centralized coordination: A hybrid sequential protocol that enables autonomy outperforms centralized coordination by 14% (p<0.001), with a 44% quality spread between the best and worst protocols. The result holds across both open-source and closed-source models, with open-source achieving 95% of closed-source quality at 24x lower cost.
Emergent role specialization: From just 8 initial agents, the system produces 5,006 unique emergent roles. Rather than collapsing into generic behaviors, agents spontaneously specialize and form shallow hierarchies that adapt to task demands without any external role assignment.
Model capability gates self-organization: The degree of emergent autonomy scales with model capability. Strong models self-organize effectively, while models below a capability threshold still benefit from rigid structure. This suggests that self-organizing multi-agent architectures will become increasingly viable as base models improve.
Sub-linear scaling to 256 agents: The system scales to 256 agents without quality degradation (p=0.61). This sub-linear scaling property means that adding more agents does not introduce the coordination overhead that typically limits multi-agent systems, at least under the tested protocols.

Paper | Tweet

7. The Price Reversal Phenomenon

The model you think is cheaper might actually cost you more. A new study systematically evaluates 8 frontier reasoning language models across 9 diverse tasks and reveals that listed API prices are misleading. In 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitudes reaching up to 28x.

Hidden thinking token costs: The root cause is vast heterogeneity in thinking token consumption. Reasoning language models generate a variable and often large number of thinking tokens that are invisible to users but billed as output tokens. On the same query, one model may use 900% more thinking tokens than another.
Concrete cost reversals: Gemini 3 Flash’s listed price is 78% cheaper than GPT-5.2’s, yet its actual cost across all tasks is 22% higher. These reversals are not edge cases but systematic patterns that affect real deployment decisions and budget planning.
High variance within single models: Even for a single model on a single query, thinking token consumption varies by up to 9.7x across repeated runs. This unpredictability makes cost forecasting nearly impossible when relying on listed per-token prices alone.
Call for transparent cost monitoring: The authors recommend that AI providers implement per-request cost breakdowns and cost estimation APIs that expose the expected thinking overhead. Without this transparency, developers are effectively making pricing decisions with incomplete information.

Paper | Tweet

8. MemFactory

MemFactory introduces the first unified, highly modular training and inference framework specifically designed for memory-augmented AI agents. It abstracts the memory lifecycle into atomic, plug-and-play components using a “Lego-like” architecture, natively integrating Group Relative Policy Optimization (GRPO) to fine-tune internal memory management strategies. The framework decomposes memory into mixable components that support recent approaches including Memory-R1, RMM, and MemAgent out of the box, achieving relative gains of up to 14.8% compared to baseline models.

Paper | Tweet

9. On the Reliability Limits of LLM-Based Multi-Agent Planning

New theoretical work from MIT proves fundamental limits on what multi-agent LLM architectures can achieve. By modeling agent systems as finite acyclic delegated decision networks, the authors show that without new exogenous signals, no delegated network can outperform a centralized Bayes decision maker that observes the same information. The gap between centralized and delegated performance admits an expected posterior divergence representation, reducing to conditional mutual information under logarithmic loss. Reasoning models can improve by investing more inference-time computation on the same evidence, while tool-use protocols help only when they introduce genuinely new signals rather than reprocessing shared context.

Paper | Tweet

10. Natural-Language Agent Harnesses

Agent performance increasingly depends on harness engineering, but harness behavior is typically embedded in controller code and runtime-specific conventions, making it hard to transfer, compare, or analyze systematically. This work introduces Natural-Language Agent Harnesses (NLAHs), which express harness behavior in editable natural language, and an Intelligent Harness Runtime (IHR) that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. The approach enables a code-to-text harness migration path where teams can convert existing harness code into natural-language specifications that are interpretable, version-controlled, and executable by an LLM at runtime.

Paper | Tweet

🤖 AI Agents Weekly: Cursor 3, Gemma 4, Qwen3.6-Plus, GLM-5V-Turbo, Claude Code Source Leak, Emotion Concepts in LLMs, and More

Sat, 04 Apr 2026 15:00:13 GMT

In today’s issue:

Cursor 3 ships agent-first IDE redesign
Google drops Gemma 4 open models (Apache 2.0)
Qwen3.6-Plus targets real-world agents
GLM-5V-Turbo turns designs into code
Claude Code source code leaks via npm
Anthropic maps emotion concepts in Claude
Codex plugin bridges Claude Code and Codex
AI Agent Traps maps six attack surfaces
CORAL agents self-organize, beat fixed topologies

And all the top AI dev news, papers, and tools.

🥇Top AI Papers of the Week

Sun, 29 Mar 2026 15:02:18 GMT

1. Hyperagents

Self-improving AI systems promise to reduce reliance on human engineering, but existing approaches rely on fixed, handcrafted meta-level mechanisms that fundamentally limit how fast they can improve. Hyperagents introduce self-referential agents that integrate a task agent and a meta agent into a single editable program, enabling the system to improve not just its task-solving behavior but also the mechanism that generates future improvements.

Metacognitive self-modification: The key insight is that the meta-level modification procedure is itself editable. This enables metacognitive self-modification where the system can improve how it improves, not just what it does. Prior self-improving systems like the Darwin Godel Machine (DGM) relied on a fixed alignment between coding ability and self-improvement ability, which does not generalize beyond coding.
Domain-general self-improvement: DGM-Hyperagents (DGM-H) eliminates the assumption that task performance and self-modification skill must be aligned. This opens up self-accelerating progress on any computable task, extending self-improvement beyond the coding domain where DGM originally operated.
Transferable meta-improvements: The system not only improves task performance over time but also discovers structural improvements to how it generates new agents, such as persistent memory and performance tracking. These meta-level improvements transfer across domains and accumulate across runs.
Outperforms prior systems: Across diverse domains, DGM-H outperforms baselines without self-improvement or open-ended exploration, as well as prior self-improving systems. The work offers a glimpse of open-ended AI systems that continually improve their search for how to improve.

Paper | Tweet

2. Agentic AI and the Next Intelligence Explosion

A new report from Google researchers argues that the AI “singularity” framed as a single superintelligent mind bootstrapping to godlike intelligence is fundamentally wrong. Drawing on evolution, sociology, and recent advances in agentic AI, the authors make the case that every prior intelligence explosion in human history was social, not individual, and that the next one will follow the same pattern.

Societies of thought: Frontier reasoning models like DeepSeek-R1 do not improve simply by “thinking longer.” Instead, they simulate internal “societies of thought,” spontaneous cognitive debates that argue, verify, and reconcile to solve complex tasks. This conversational structure causally accounts for the models’ accuracy advantage on hard reasoning tasks.
Human-AI centaurs: We are entering an era of hybrid actors where collective agency transcends individual control. A corporation or state comprising myriad humans already holds singular legal standing and acts with collective agency that no individual member can fully control. The same pattern is emerging with human-AI configurations.
From dyadic to institutional alignment: Scaling agentic intelligence requires shifting from dyadic alignment (RLHF) toward institutional alignment. By designing digital protocols modeled on organizations and markets, we can build a social infrastructure of checks and balances for AI systems rather than trying to align individual agents in isolation.
Combinatorial intelligence: The next intelligence explosion will not be a single silicon brain, but a complex, combinatorial society specializing and sprawling like a city. No mind is an island, and the toolkit of team science, small group sociology, and social psychology becomes the blueprint for next-generation AI development.

Paper | Tweet

3. ARC-AGI-3

Francois Chollet and the ARC Prize Foundation introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments. Unlike its predecessors, ARC-AGI-3 requires agents to explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions, making it the only unsaturated general agentic intelligence benchmark as of March 2026.

Massive human-AI gap: Humans can solve 100% of the environments while frontier AI systems score below 1%. For comparison, systems reach 93% on ARC-AGI-1 and 68.8% on ARC-AGI-2, but performance collapses on ARC-AGI-3. This gap demonstrates that current systems lack the fluid adaptive efficiency that humans exhibit on genuinely novel tasks.
Interactive turn-based design: Unlike static benchmarks that test pattern recognition on fixed inputs, ARC-AGI-3 environments are turn-based: agents must act, observe consequences, update their internal model, and plan next steps. This tests a fundamentally different kind of intelligence, closer to how humans learn new games or explore unfamiliar systems.
Core Knowledge priors only: The benchmark avoids language and external knowledge entirely. Environments leverage only Core Knowledge priors, universal cognitive building blocks shared by all humans, ensuring that performance reflects genuine adaptive reasoning rather than memorization or retrieval from training data.
Efficiency-based scoring: The scoring framework is grounded in human action baselines. A hard cutoff of 5x human performance per level ensures that brute-force search strategies cannot succeed. If a human takes 10 actions on average, the AI agent is cut off after 50.

Paper | Tweet

4. Claudini

Researchers demonstrate that an autoresearch-style pipeline powered by Claude Code can autonomously discover novel adversarial attack algorithms for LLMs that significantly outperform all 30+ existing methods. The work, called Claudini, shows that incremental safety and security research can be effectively automated using LLM agents, with white-box red-teaming being a particularly well-suited domain.

Agent-discovered attacks beat all baselines: Starting from existing attack implementations like GCG, the Claude Code agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to 10% or less for all existing algorithms. This is a strong demonstration of automated AI research producing genuinely novel results.
Transferable to held-out models: The discovered algorithms generalize beyond their training environment. Attacks optimized on surrogate models transfer directly to held-out models, achieving 100% attack success rate against Meta-SecAlign-70B versus 56% for the best baseline. This transferability makes the findings practically relevant for red-teaming.
Why red-teaming works for autoresearch: White-box adversarial red-teaming is particularly well-suited for automation because existing methods provide strong starting points and the optimization objective yields dense, quantitative feedback. The agent can measure progress at every iteration rather than relying on sparse signals.
Open-source release: All discovered attacks, baseline implementations, and evaluation code are released publicly. This enables the safety community to study the discovered algorithms and build defenses, while also establishing a reproducible methodology for automated safety research.

Paper | Tweet

Message from the Editor

Excited to announce our new on-demand course “Vibe Coding AI Apps with Claude Code“. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.

Enroll Now

5. Attention Residuals

The Kimi team at Moonshot AI presents Attention Residuals (AttnRes), a technique that replaces fixed unit-weight residual connections in Transformers with softmax attention over preceding layer outputs. Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights, causing uncontrolled hidden-state growth with depth that progressively dilutes each layer’s contribution.

Content-dependent depth-wise selection: AttnRes allows each layer to selectively aggregate earlier representations with learned, input-dependent weights. Instead of treating every preceding layer equally, the model learns which earlier layers matter most for each input, enabling more expressive information flow across depth.
Block AttnRes for scalability: To make the approach practical at scale, the authors introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations. This reduces the memory footprint while preserving most of the gains of full AttnRes, making it viable for production-scale pretraining.
Mitigates PreNorm dilution: Integrating AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pretraining on 1.4T tokens shows that AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth. This directly addresses a known architectural weakness.
Consistent scaling improvements: Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. Downstream performance improves across all evaluated tasks.

Paper | Tweet

6. MemCollab

LLM-based agents build useful memory during tasks, but that memory is typically trapped within a single model. MemCollab introduces a collaborative memory framework that constructs agent-agnostic memory by contrasting reasoning trajectories generated by different agents on the same task, enabling a single memory system to be shared across heterogeneous models.

The memory transfer problem: Existing approaches construct memory in a per-agent manner, tightly coupling stored knowledge to a single model’s reasoning style. Naively transferring this memory between agents often degrades performance because it entangles task-relevant knowledge with agent-specific biases. MemCollab directly addresses this fundamental limitation.
Contrastive trajectory distillation: The framework contrasts reasoning trajectories from different agents solving the same tasks. This contrastive process distills abstract reasoning constraints that capture shared task-level invariants while suppressing agent-specific artifacts, producing memory that any agent can benefit from.
Task-aware retrieval: MemCollab introduces a retrieval mechanism that conditions memory access on task category, ensuring that only relevant constraints are surfaced at inference time. This prevents irrelevant memory from interfering with the agent’s reasoning process.
Cross-family improvements: Experiments on mathematical reasoning and code generation benchmarks demonstrate that MemCollab consistently improves both accuracy and inference-time efficiency across diverse agents, including cross-modal-family settings where memory is shared between fundamentally different model architectures.

Paper | Tweet

7. Composer 2

Cursor releases the technical report for Composer 2, a specialized model designed for agentic software engineering that demonstrates strong long-term planning and coding intelligence while maintaining efficiency for interactive use. The report details a process for training domain-specialized models that starts with continued pretraining and scales up with reinforcement learning.

Two-phase training pipeline: The model is trained first with continued pretraining to improve knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance. The RL phase targets stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems.
Train-in-harness infrastructure: Cursor developed infrastructure to support training in the same harness used by the deployed model, with equivalent tools and structure. Training environments match real problems closely, bridging the gap between training-time and deployment-time behavior.
New internal benchmark: To measure the model on increasingly difficult tasks, the team introduces CursorBench, a benchmark derived from real software engineering problems in large codebases, including their own. Composer 2 achieves a major improvement in accuracy over previous Composer models on this benchmark.
Frontier-level performance: On public benchmarks, the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in Cursor’s harness, comparable to state-of-the-art systems. The report demonstrates that domain-specialized training with RL can produce models competitive with much larger general-purpose systems.

Paper | Tweet

8. PivotRL

PivotRL is a turn-level reinforcement learning algorithm from NVIDIA designed to tractably post-train large language models for long-horizon agentic tasks. The method operates on existing SFT trajectories, combining the compute efficiency of supervised fine-tuning with the out-of-domain accuracy of end-to-end RL. PivotRL identifies “pivots,” informative intermediate turns where sampled actions exhibit high variance in outcomes, and focuses training signal on these critical decision points. The approach achieves +4.17% higher in-domain accuracy and +10.04% higher out-of-domain accuracy compared to standard SFT, while matching end-to-end RL accuracy with 4x fewer rollout turns. PivotRL is adopted by NVIDIA’s Nemotron-3-Super-120B-A12B as the workhorse for production-scale agentic post-training.

Paper | Tweet

9. Workflow Optimization for LLM Agents

A comprehensive survey from IBM that maps recent methods for designing and optimizing LLM agent workflows, treating them as agentic computation graphs (ACGs). The survey organizes prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization. It distinguishes between reusable workflow templates, run-specific realized graphs, and execution traces, covering methods like AFlow (Monte Carlo Tree Search over operator graphs), Automated Design of Agentic Systems (code-space search via meta-agents), and evolutionary multi-agent system design. A useful reference for teams building production agent systems where wiring decisions between model calls, retrieval, tool use, and verification matter as much as model capability.

Paper | Tweet

10. BIGMAS

Even the best reasoning models hit an accuracy collapse beyond a certain problem complexity. BIGMAS (Brain-Inspired Graph Multi-Agent Systems) organizes specialized LLM agents as nodes in a dynamically constructed directed graph, coordinating exclusively through a centralized shared workspace inspired by global workspace theory from cognitive neuroscience. A GraphDesigner agent analyzes each problem instance and produces a task-specific directed agent graph together with a workspace contract. The framework constructs structurally distinct graphs whose complexity tracks task demands, from compact three-node pipelines for simple arithmetic to nine-node cyclic structures for multi-step planning. BIGMAS consistently improves reasoning performance for both standard LLMs and large reasoning models, outperforming existing multi-agent baselines.

Paper | Tweet

🤖 AI Agents Weekly: Hyperagents, Multi-Agent Harness Design, Chroma Context-1, Composer 2, ARC-AGI-3, and More

Sat, 28 Mar 2026 15:01:48 GMT

In today’s issue:

Hyperagents: self-improving agents that improve how they improve
Anthropic publishes multi-agent harness design
Chroma ships Context-1 open-source search agent
Cursor releases Composer 2 technical report
ARC-AGI-3 launches with sub-1% AI scores
Codex ships plugins for Slack, Figma, Notion
Gemini 3.1 Flash Live enables realtime voice agents
Claude Code auto mode skips permissions safely
AI Scientist published in Nature
Anthropic Economic Index tracks learning curves
Junyang Lin frames reasoning vs. agentic thinking
Cohere ships open-source Transcribe model
Agent-to-agent pair programming with Claude and Codex
Claude Code ships cloud-scheduled tasks
Cursor builds Instant Grep for millisecond search
OpenSpace: self-evolving agent skills via MCP

And all the top AI dev news, papers, and tools.

🥇Top AI Papers of the Week

Sun, 15 Mar 2026 15:02:53 GMT

1. OpenDev

Terminal-native coding agents represent a fundamental shift in how developers interact with AI assistance. OpenDev is an open-source, command-line coding agent that operates where developers already manage source control and deploy environments, offering a comprehensive 81-page technical report on scaffolding, harness design, context engineering, and lessons learned from building production coding agents.

Dual-agent architecture: OpenDev separates planning from execution through a compound AI system with workload-specialized model routing. Work is organized into concurrent sessions, each composed of multiple specialized sub-agents that independently bind to a user-configured LLM, enabling fine-grained model selection for different tasks.
Adaptive context compaction: Effective autonomous assistance requires highly efficient context management to prevent context bloat and reasoning degradation. OpenDev implements lazy tool discovery and adaptive methods to reduce older observations, keeping the agent’s working memory lean as tasks grow in complexity.
Automated project memory: The system incorporates automated memory for project-specific knowledge and event-driven reminders to prevent instruction fade-out. This ensures that the agent retains critical project context across sessions without manual intervention.
Four-layer architecture: The system spans agent reasoning, context engineering, tooling, and persistence layers. This modular design provides a secure, extensible foundation for terminal-first AI assistance that can evolve independently at each layer.

Paper | Tweet

2. AutoHarness

Google DeepMind researchers introduce AutoHarness, a method for automatically synthesizing code harnesses that prevent LLM agents from making illegal actions. The core insight comes from a striking observation: in the Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves, not poor strategy.

Automatic harness synthesis: Rather than building complex rule systems by hand, AutoHarness lets Gemini-2.5-Flash automatically generate a code harness through a small number of iterative refinement rounds using feedback from the game environment. The harness acts as a programmatic constraint layer between the agent and the environment.
Smaller models beat larger ones: The resulting harness enables the smaller Gemini-2.5-Flash to outperform much larger models including Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games. This shows that structured code constraints can compensate for raw model capability.
Complete illegal move prevention: The synthesized harness successfully prevents all illegal moves across 145 different TextArena games, covering both single-player and two-player settings. This transforms a model that previously failed on most turns into a competitive agent.
Cost-effective scaling: Using a smaller model to synthesize a custom code harness is not only more performant but also more cost-effective than simply deploying a larger model. This reframes the agent improvement problem from model scaling to harness engineering.

Paper | Tweet

3. SkillNet

AI agents repeatedly rediscover solutions across separate scenarios instead of systematically reusing what they have already learned. SkillNet introduces an open infrastructure designed to create, evaluate, and organize AI skills at scale, enabling agents to transition from transient experience to durable mastery.

Unified skill ontology: Skills are structured within a unified ontology that supports creation from heterogeneous sources, including code libraries, prompt templates, and tool compositions. Rich relational connections between skills enable discovery and composition that would be impossible with flat skill stores.
Multi-dimensional evaluation: Every skill is assessed across five dimensions: Safety, Completeness, Executability, Maintainability, and Cost-awareness. This systematic evaluation ensures that skills entering the repository meet quality thresholds before agents rely on them in production.
Massive skill repository: SkillNet includes a repository of over 200,000 skills, an interactive platform for skill browsing and management, and a Python toolkit for programmatic access. This scale enables meaningful skill retrieval and composition across diverse task domains.
Consistent agent improvements: Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models.

Paper | Tweet

4. The Spike, the Sparse and the Sink

Yann LeCun and collaborators at NYU dissect two recurring phenomena in Transformer language models: massive activations, where a small number of tokens exhibit extreme outliers in specific channels, and attention sinks, where certain tokens attract disproportionate attention mass regardless of semantic relevance. The paper reveals that their co-occurrence is largely an architectural artifact.

Distinct operational scopes: Massive activations operate globally, inducing near-constant hidden representations that persist across layers and function as implicit model parameters. Attention sinks operate locally, modulating attention outputs across heads and biasing individual heads toward short-range dependencies.
Pre-norm as the critical factor: The pre-norm configuration common in modern Transformers is identified as the key architectural element enabling the co-occurrence of these two phenomena. Removing pre-norm causes massive activations and attention sinks to decouple entirely.
Practical implications for efficiency: Understanding these phenomena has direct consequences for model compression, quantization, and KV-cache optimization. Many efficiency techniques fail silently when they inadvertently disrupt massive activations or attention sinks, and this paper explains why.
Not functionally necessary: The co-occurrence of spikes and sinks is a design-dependent artifact rather than a fundamental requirement for model performance. This opens the door to architectural modifications that could eliminate these phenomena without sacrificing capability.

Paper | Tweet

Message from the Editor

Excited to announce our new on-demand course “Vibe Coding AI Apps with Claude Code“. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.

Enroll Now

5. KARL

Databricks presents KARL, a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. The work also introduces KARLBench, a new evaluation framework spanning six search domains.

New post-training paradigm (OAPL): KARL concurrently develops OAPL, an iterative large-batch off-policy RL approach. By embracing off-policyness in the design of the objective, it is robust to discrepancies between the trainer and the inference engine without requiring heuristics like clipped importance weighting or data deletion.
Multi-task heterogeneous training: Rather than optimizing for a single benchmark, KARL trains across heterogeneous search behaviors including constraint-driven entity search, cross-document synthesis, tabular reasoning, entity retrieval, procedural reasoning, and fact aggregation. This produces substantially better generalization than single-benchmark optimization.
Pareto-optimal performance: Starting from GLM 4.5 Air with varying levels of test-time scaling, KARL is Pareto-optimal on KARLBench when compared to Claude 4.6 and GPT 5.2 across both cost-quality and latency-quality tradeoffs.
Scalable with test-time compute: KARL-BCP attains 59.6 on BrowseComp-Plus, which further improves to 70.4 with value-guided search. KARL-TREC reaches 85.0 on TREC-Biogen, the second-highest score overall. The system surpasses the strongest closed models given sufficient test-time compute.

Paper | Tweet

6. Memex(RL)

As tasks get longer and more complex, LLM agents lose track of what they have learned, what they have tried, and what still needs to be done. Memex(RL) introduces an indexed experience memory mechanism that scales agent capability on long-horizon tasks without discarding evidence or blowing up the context window.

Indexed experience memory: Rather than lossy compression, Memex maintains a compact working context consisting of concise structured summaries and stable indices while storing full-fidelity underlying interactions in an external experience database. The agent decides what to summarize, what to archive, how to index it, and when to retrieve it.
RL-optimized memory operations: The MemexRL reinforcement learning framework optimizes both write and read behaviors with reward shaping tailored to indexed memory usage under a context budget. This teaches the agent to manage its own memory strategically rather than relying on fixed heuristics.
Bounded retrieval complexity: Theoretical analysis demonstrates that Memex can maintain decision quality with bounded retrieval operations while keeping computational load manageable as task history grows. This makes the approach practical for tasks that span hundreds or thousands of steps.
Smaller context, better results: Empirically, agents trained with MemexRL improve task success rates on challenging long-horizon tasks while using a significantly smaller working context than baseline approaches. Less context, used more intelligently, outperforms brute-force context expansion.

Paper | Tweet

7. FlashAttention-4

FlashAttention-4 co-designs algorithms and kernel pipelines for the B200 and GB200 GPUs, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling where tensor core throughput doubles while other functional units scale more slowly.

Significant speedups on Blackwell: FlashAttention-4 achieves up to 1.3x speedup over cuDNN 9.13 and 2.7x over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s at 71% hardware utilization. These gains come from careful co-design rather than algorithmic changes alone.
Asymmetric scaling solutions: The techniques include redesigned pipelines that exploit fully asynchronous matrix multiply operations and larger tile sizes, software-emulated exponential and conditional softmax rescaling, and leveraging tensor memory to reduce shared memory traffic.
Python-native implementation: The entire system is implemented in CuTe-DSL embedded in Python, achieving 20-30x faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity. This dramatically lowers the barrier to kernel development.
Hardware-algorithm co-design: The paper demonstrates that next-generation GPU architectures demand fundamentally new attention kernel designs rather than incremental optimizations of existing ones. Techniques that worked well on Hopper GPUs leave significant performance on the table on Blackwell.

Paper | Tweet

8. STRUCTUREDAGENT

STRUCTUREDAGENT introduces a hierarchical planning framework for long-horizon web tasks using dynamic AND/OR trees. The framework separates planning responsibilities: the system constructs and maintains the planning tree while the LLM is invoked only for local operations like node expansion or repair. A structured memory module tracks candidate solutions to improve constraint satisfaction. Results on WebVoyager, WebArena, and custom shopping benchmarks show improved performance over standard LLM-based web agents, with the added benefit of interpretable hierarchical plans that enable easier debugging and human intervention.

Paper | Tweet

9. AgentIR

Deep research agents generate explicit reasoning before every search call, but existing retrievers completely ignore these rich signals about search intent and problem context. AgentIR introduces reasoning-aware retrieval that jointly embeds the agent’s reasoning trace alongside its query, along with DR-Synth, a data synthesis method for generating training data from standard QA datasets. On BrowseComp-Plus, AgentIR-4B achieves 68% accuracy with Tongyi-DeepResearch compared to 50% with conventional embedding models twice its size and 37% with BM25.

Paper | Tweet

10. Think Harder or Know More

This paper investigates transformer models featuring both adaptive per-layer looping, where each block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks that provide additional learned storage. The key finding is that looping primarily benefits mathematical reasoning while memory banks help recover performance on commonsense tasks. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline with three times the number of layers on math benchmarks. Analysis of model internals reveals layer specialization: early layers loop minimally and access memory sparingly, while later layers do both more heavily.

Paper | Tweet

AI Newsletter

🤖 AI Agents Weekly: Claude Opus 4.8, Claude Code Dynamic Workflows, Chrome DevTools for Agents 1.0, DeepSWE, Agent Harness Scaling Laws, and More

Top Stories

AutoScientists Self-Organize for Long-Running Science

Claude Opus 4.8 Sharpens Agentic Judgment

🥇Top AI Papers of the Week

1. Code as Agent Harness

Message from our Sponsor

2. OpenAI Disproves the Unit Distance Conjecture

3. Memory as a Model

4. AIRA

5. Weak-Model Critic-Comparator

6. MetaCogAgent

7. Production Agent Architecture Methodology

8. NanoGPT-Bench

9. General-Agent

10. Contrastive Neuron Attribution

🤖 AI Agents Weekly: Gemini 3.5 Flash, Antigravity 2.0, Codex Thursday, Cohere Command A+, Qwen3.7-Max, and More

Top Stories

Gemini 3.5 Flash and Managed Agents Land

🥇Top AI Papers of the Week

1. Lighthouse Attention

Message from the Editor

2. Is Grep All You Need?

3. A Geometric Calculator Inside a Neural Network

4. δ-mem

5. Beyond Individual Intelligence

6. AutoTTS

7. AI Co-Mathematician

8. AEvo

9. The Memory Curse in LLM Agents

10. Token Superposition Training

🤖 AI Agents Weekly: Thinking Machines Interaction Models, Is Grep All You Need?, Codex Mobile + Hooks, Cursor Cloud Agents, Ring-2.6-1T, and More

Top Stories

Thinking Machines Introduces Interaction Models

Is Grep All You Need? Harness Beats Vector RAG for Coding Agents

🥇Top AI Papers of the Week

1. HeavySkill

2. Conductor

3. Self-Improving Pretraining

4. Connect Four AlphaZero from Scratch

Message from the Editor

5. Coordination as Architecture

6. Horizon Generalization

7. 1,000 Synthetic Computers

8. Contextual Agentic Memory is a Memo

9. Agentic-imodels

10. Skills as Verifiable Artifacts

🤖 AI Agents Weekly: Meta FAIR Autodata, ZAYA1-8B, SubQ 12M Context, Natural Language Autoencoders, Claude Managed Agents Dreaming, and More

Top Stories

Autodata: An Agentic Data Scientist From Meta FAIR

🥇Top AI Papers of the Week

1. Agentic Harness Engineering

Message from our Sponsor

2. AgenticQwen-30B-A3B

3. Agentic World Modeling

4. RecursiveMAS

5. OneManCompany

6. From Skill Text to Skill Structure

7. Latent Agents

8. OCR-Memory

9. When to Retrieve During Reasoning

10. Co-evolving Decisions and Skills

🤖 AI Agents Weekly: Codex for Everyday Work, Cursor SDK, Mistral Workflows, LLM Knowledge Bases, Agentic Harness Engineering, and More

Top Stories

Codex for Everyday Work

🥇Top AI Papers of the Week

1. DeepSeek V4

2. Autogenesis

3. Attention to Mamba

4. Skill-RAG

Message from the Editor

5. Self-Generated World Knowledge

6. Self-Evolving Logic Synthesis

7. Stateless Decision Memory

8. There Will Be a Scientific Theory of Deep Learning

9. MASS-RAG

10. Diversity Collapse in Multi-Agent LLMs

🤖 AI Agents Weekly: GPT-5.5, DeepSeek-V4 Preview, Kimi K2.6 Agent Swarm, Diversity Collapse, Sakana Fugu, and More

Top Stories