🥇Top AI Papers of the Week
The Top AI Papers of the Week (June 21 - 28)
1. Sakana Fugu
Frontier LLMs keep advancing, and different providers are increasingly specializing in distinct domains, which raises a natural next objective: how do you combine those individual specializations into one collectively intelligent system? Sakana Fugu answers with a family of orchestrator models that are themselves language models trained to read a user query and dynamically devise the agentic scaffold needed to solve it.
Orchestrator models, not a fixed pipeline: Fugu is trained to understand a query and build an adaptive agentic scaffold on the fly, harnessing and amplifying a team of LLM agents rather than routing to a single frozen workflow.
Performance beyond any single agent: Through these query-adaptive scaffolds, Fugu reaches state-of-the-art results against other publicly accessible models across SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity’s Last Exam, and CharXiv Reasoning.
Two models for two regimes: They release Fugu, which balances answer quality against latency for everyday use, and Fugu-Ultra, which prioritizes quality on the hardest problems.
Why it matters: The training paradigm combines large-scale fine-tuning, evolutionary algorithms, and reinforcement learning, plus the infrastructure to turn that into a production system, pointing to dynamic, query-adaptive scaffolds and collective intelligence as a path toward the next frontier of AI capabilities.
2. Agent-Native Memory
Memory for LLM agents has quietly grown from a retrieval add-on into a full data system, with persistent storage, retrieval, update, consolidation, and lifecycle governance running throughout an agent’s execution. Yet most evaluations still score memory only through end-to-end task metrics like F1 and BLEU, treating the whole stack as a black box. This paper studies agent memory from a data management perspective and asks what we are actually missing when we measure it that way.
A data management view of memory: The authors argue that operational cost, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates are first-class concerns that task-success metrics hide entirely.
A four-module decomposition: They break memory into representation and storage, extraction, retrieval and routing, and maintenance, then evaluate 12 representative memory systems plus two baselines across five workloads spanning 11 datasets.
No single architecture wins: Effectiveness depends on how well the memory structure matches the workload bottleneck, and fine-grained ablations quantify each module’s effect on representation fidelity, retrieval precision, update correctness, and long-horizon stability.
Why it matters: The study shows localized maintenance is more cost-efficient than global reorganization, and reframing memory as a system with measurable trade-offs is what gets us toward genuinely agent-native memory rather than another leaderboard number.
3. Autodata
Building synthetic training data has mostly stayed a fixed pipeline that you hand-tune once and then freeze. Autodata rethinks that by casting an AI agent as a data scientist that builds high-quality training and evaluation data, then meta-optimizes that agent so it learns to create even stronger data over time.
An agent as data scientist: Autodata is a general formulation in which an AI agent plays the role of a data scientist building both training and evaluation data, instantiated as a concrete, practical implementation the authors call Agentic Self-Instruct.
Meta-optimization compounds the gains: Beyond using the agent to generate data, they train (meta-optimize) the data scientist agent itself, and this self-improvement step delivers a larger performance uplift than base agentic data creation alone.
Consistent across domains: On computer science research tasks, legal reasoning, and reasoning with mathematical objects, Autodata beats classical synthetic dataset creation methods, showing the approach is not tied to a single problem type.
Why it matters: Agentic data creation turns increased inference compute into higher-quality training data, offering a path that could change how teams build datasets rather than freezing a pipeline and hoping it generalizes.
4. Critique of the Agent Model
The word agent now covers everything from a for-loop with tool calls to speculative machine superintelligence, which makes it nearly useless as a technical term. This position paper from Eric Xing and collaborators tries to fix that by asking what an agent actually is and what agency consists of, drawing on Descartes and on science-fiction portrayals of autonomous beings to ground the discussion.
Five dimensions of agency: The authors analyze agent architectures along goal, identity, decision-making, self-regulation, and learning, and argue that genuine agency requires these structures to be internalized in the system rather than assembled through external scaffolding.
Agentic versus agentive: They draw a sharp line between agentic systems, whose competence lives in engineered workflows, and agentive systems, whose capabilities including social interaction arise endogenously, marking the boundary between task-specific tools and open-world autonomy.
A concrete architecture: Building on the analysis, they propose the Goal-Identity-Configurator, combining hierarchical goal decomposition, identity evolution, simulative reasoning grounded in a separately trained world model, learned self-regulation, and self-directed learning from real and simulated experience.
Why it matters: Clear definitions are not academic hair-splitting here. They shape what we build and what we should reasonably fear, and the paper centers auditability, controllability, and safety for systems that hold more autonomy yet stay under human oversight.
Message from the Editor
We just released LLM-as-a-Judge, a hands-on DAIR Academy lab where you build an LLM judge from scratch to evaluate open-ended AI output. Across six short labs, you grade a support bot’s freeform replies on a rubric, then validate the judge against human labels and harden it against bias, ending with a small, trustworthy evaluation harness you can point at any open-ended task.
5. Agent-as-a-Router
Most users now have access to many LLMs that each excel in different domains, so routing each task to the right model matters for both quality and cost. Existing routers treat this as a static, one-off classification problem, and this paper shows that framing is exactly what holds them back.
Information deficit is the bottleneck: Simply augmenting a vanilla LLM router with performance statistics at the task-dimension level yields a 15.3% relative gain, surpassing a heuristic router built on the same priors, which pinpoints missing information rather than model choice as the real limiter.
Routing as a closed loop: Agent-as-a-Router formalizes routing as a Context, Action, Feedback, Context loop that accumulates execution-grounded experience during deployment instead of deciding once and moving on.
A concrete system and benchmark: The framework is instantiated as ACRouter, built from an Orchestrator, a Verifier, and a Memory module, and the authors release CodeRouterBench, roughly 10K task instances scored across 8 frontier LLMs for regret-based comparison on streaming tasks.
Why it matters: ACRouter achieves the lowest cumulative regret on in-distribution tasks and generalizes to out-of-distribution agentic programming, showing that treating routing as an experience-gathering agent, not a classifier, is what closes the information gap.
6. Agent Communication Protocols
As multi-agent systems try to move past the limits of standalone agents, communication becomes the load-bearing infrastructure, and the protocol landscape for it is a fragmented mess. This study builds a technical taxonomy to classify and compare LLM agent communication protocols and to make the interoperability problem legible.
A five-dimensional taxonomy: Following an established iterative method, the authors classify protocols along counterparty, payload, interaction state, discovery mechanism, and schema flexibility, derived through five iterations over nine actively maintained open-source protocols with real adoption.
Recurring architectural patterns: Every sampled agent-to-agent protocol combines hybrid payloads with session-state persistence, most support multiple predefined schemas, and two negotiate schemas at runtime, signaling a clear trend toward schema flexibility.
Where the gaps are: Decentralized discovery remains rare, and the analysis suggests short-term convergence pressure toward protocols that unify agent-to-agent and agent-to-context communication for tools and data.
Why it matters: No single protocol is likely to maximize versatility, efficiency, and portability at once, so the field will probably evolve into a federated, layered protocol stack, and this taxonomy gives teams a way to choose protocols and surfaces open problems like privacy and policy enforcement.
7. A Pinch of Human Data
Self-play reinforcement learning can train driving policies with no human data at all, swapping expensive human demonstrations for cheap large-scale simulation. The catch is that pure self-play tends to discover effective but alien driving conventions that real people cannot work with, and the usual fixes lean on brittle reward engineering and domain randomization.
Human data as a regularizer: Instead of discarding demonstrations or imitating them wholesale, the method treats human data as a regularization objective layered on top of a minimal safe goal-reaching reward, keeping behavior compatible with people without hand-tuning conventions.
A little goes a long way: Just 30 minutes of human demonstrations, roughly 2500 times fewer than comparable imitation learning approaches, is enough to pull self-play policies into human-compatible behavior.
Cheap to train: The resulting policies coordinate with held-out human trajectories and finish training in 15 hours on a single consumer-grade GPU, which keeps the recipe accessible rather than a frontier-lab luxury.
Why it matters: Behavioral alignment with humans is the hard part of deploying autonomous policies in shared environments, and this work shows that a tiny, well-placed dose of human data can fix what massive reward engineering struggles to, pointing to a cleaner path for human-AI coordination.
8. Skill-MAS
Automatic generation of multi-agent systems is stuck between inference-time methods that reuse frozen frontier models but never learn, and training-time methods that internalize experience through gradient updates but are capped by the weaker models small enough to fine-tune. Skill-MAS proposes a third path that treats high-level orchestration as an evolvable Meta-Skill, decoupling experience retention from weight updates so frontier models keep getting better at orchestration without any gradient steps. Across four complex benchmarks and four distinct LLMs it delivers strong, transferable gains at a favorable cost-performance trade-off.
9. Reliability without Validity
LLM-as-a-Judge is the default way to evaluate language models, but validating those judges with exact-match agreement never corrects for chance and systematically overstates how good they are. In the largest audit to date, spanning 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench over 118 runs and roughly 541,000 judgments, the gap between raw agreement and chance-corrected Cohen’s kappa runs 33 to 41 percentage points, rankings shift by up to 14 positions across benchmarks, and high test-retest reliability coexists with severe position bias. The authors distill their findings into a Minimum Viable Validation Protocol so teams can stress-test judges before trusting them.
10. NatureBench
Can coding agents move past reproduction toward actual discovery on real scientific problems? NatureBench distills 90 cross-discipline tasks from peer-reviewed Nature-family papers and runs them in NatureGym, an automated pipeline that builds a standardized containerized environment per task to fix the environment-fragmentation problem. Under a strict web-search-disabled protocol, the strongest of ten frontier agent configurations beats published SOTA on only 17.8% of tasks, and analysis shows agents win mainly by translating problems into familiar supervised prediction rather than through genuine scientific invention.








