<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[AI Newsletter]]></title><description><![CDATA[The AI Newsletter provides weekly summaries of the latest and top AI trends, papers, tools, news, and best practices. Home of Top AI Papers of the Week and AI Agents Weekly series. ]]></description><link>https://nlp.elvissaravia.com</link><image><url>https://substackcdn.com/image/fetch/$s_!m7md!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41327c80-fe59-416d-aa6f-ab6874177ac7_517x517.png</url><title>AI Newsletter</title><link>https://nlp.elvissaravia.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 10 May 2026 19:42:58 GMT</lastBuildDate><atom:link href="https://nlp.elvissaravia.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[elvis]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[nlpnews@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[nlpnews@substack.com]]></itunes:email><itunes:name><![CDATA[elvis]]></itunes:name></itunes:owner><itunes:author><![CDATA[elvis]]></itunes:author><googleplay:owner><![CDATA[nlpnews@substack.com]]></googleplay:owner><googleplay:email><![CDATA[nlpnews@substack.com]]></googleplay:email><googleplay:author><![CDATA[elvis]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (May 4 - May 10)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-154</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-154</guid><pubDate>Sun, 10 May 2026 15:01:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ssdq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. HeavySkill</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Udsc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Udsc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 424w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 848w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 1272w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Udsc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png" width="996" height="419" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:419,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;HeavySkill&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="HeavySkill" title="HeavySkill" srcset="https://substackcdn.com/image/fetch/$s_!Udsc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 424w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 848w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 1272w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One of the cleaner takes on agentic harness design released this year. The paper argues that what actually drives harness performance is not the orchestration code, but a single inner skill: parallel reasoning followed by deliberation. Internalize that pattern into the model and most of the surrounding scaffolding becomes optional. HeavySkill systematizes the idea as a two-stage pipeline you can run beneath any harness, then trains it as a learnable skill via RLVR. The result is a harness win that looks more like a model win.</p><ul><li><p><strong>Two-stage skill, not orchestration glue:</strong> Stage one runs parallel reasoning across multiple sampled chains. Stage two performs a deliberation pass that compares, critiques, and synthesizes those chains into a final answer. The pipeline is the same regardless of harness, which is why it transfers across tasks.</p></li><li><p><strong>GPT-OSS-20B jumps from 69.7% to 85.5% on LiveCodeBench:</strong> Under the heavy-thinking variant (HM@4), the 20B model gets a 15.8 point lift on a hard coding benchmark. The same recipe takes R1-Distill-Qwen-32B from 35.7% to 69.3% on IFEval, nearly doubling its instruction-following score.</p></li><li><p><strong>Pass@N-level performance from a learned skill:</strong> Several models reach Pass@N-level performance once HeavySkill is internalized through RLVR, which is the property that makes the parallel-deliberation pattern actually portable. The skill survives outside the harness it was trained under.</p></li><li><p><strong>Why it matters:</strong> Harness wins start to look like model wins once you can train them in. If parallel reasoning plus deliberation really is the inner skill, the long arc is models that ship with it baked in, not orchestration glue layered around them.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.02396">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2051678102934454330">Tweet</a></strong></p><div><hr></div><h2><strong>2. Conductor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kF-k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kF-k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 424w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 848w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 1272w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kF-k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png" width="996" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Conductor&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Conductor" title="Conductor" srcset="https://substackcdn.com/image/fetch/$s_!kF-k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 424w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 848w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 1272w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Sakana AI&#8217;s ICLR 2026 paper introduces a 7B Conductor model that hits SOTA on GPQA-Diamond and LiveCodeBench by orchestrating other LLMs instead of solving problems itself. The Conductor is trained with RL to do two things simultaneously: design communication topologies between worker agents (open or closed source) and prompt-engineer focused instructions to each worker so it leverages individual strengths. The orchestrator becomes a learnable policy, not a wrapper around one.</p><ul><li><p><strong>Topology design plus targeted prompting:</strong> A single RL policy decides who talks to whom and what each worker is told. Trained against randomized agent pools, the Conductor adapts to arbitrary mixes of agents at inference time, including agents it never saw during training.</p></li><li><p><strong>Recursive topologies emerge:</strong> When allowed to pick itself as a worker, the Conductor forms recursive topologies, unlocking a new form of dynamic test-time scaling through online iterative adaptation. Coordination becomes its own scaling axis, separate from model size or context length.</p></li><li><p><strong>3% gains on AIME25 and GPQA-D from coordination alone:</strong> The gains over the best individual worker land in the 3% range, which the authors note is consistent with entire generational improvements between frontier model versions. The difference is that here the lift comes from learned routing, not from larger pretraining runs.</p></li><li><p><strong>Why it matters:</strong> This is one of the cleaner arguments yet that the orchestrator should be the model. Routing decisions stop being a wrapper and become a learnable policy, which is the right abstraction for production agent stacks that compose multiple model providers.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2512.04388">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2051306659021242635">Tweet</a></strong></p><div><hr></div><h2><strong>3. Self-Improving Pretraining</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bO2d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bO2d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 424w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 848w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 1272w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bO2d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png" width="696" height="311" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:311,&quot;width&quot;:696,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Self-Improving Pretraining&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Self-Improving Pretraining" title="Self-Improving Pretraining" srcset="https://substackcdn.com/image/fetch/$s_!bO2d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 424w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 848w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 1272w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most LLM safety, factuality, and reasoning fixes get bolted on at post-training. By then the patterns have already set. This Meta FAIR paper moves those behaviors into pretraining itself. The team uses a strong post-trained model as both a rewriter and a judge: it rewrites pretraining suffixes toward higher-quality, safer continuations, then scores model rollouts against the original suffix and the rewrite to drive RL during pretraining. Instead of next-token prediction, the policy learns sequence generation from the start, with rewards for quality, safety, and factuality.</p><ul><li><p><strong>Post-trained model as rewriter and judge:</strong> The strong model rewrites suffixes during pretraining, then judges rollouts of the in-training model against both the rewrite and the original. Safety, factuality, and quality become reward signals rather than post-hoc filters, which lets the policy internalize the targets early.</p></li><li><p><strong>Sequence generation from the start:</strong> The policy is trained to generate sequences directly under reward, not to predict the next token. This shifts the inductive bias toward producing the kinds of continuations the judge rewards, which matters most on long-form generation where token-level losses lose discriminative signal.</p></li><li><p><strong>Concrete gains across the board:</strong> 36.2% relative gain in factuality, 18.5% in safety, and up to 86.3% win rate in generation quality over standard pretraining. The safety and factuality numbers are large enough to suggest these properties are easier to install during pretraining than to retrofit after the fact.</p></li><li><p><strong>Why it matters:</strong> The post-trained models you already have can be used to pretrain the next ones better. That is a recursive improvement loop at the pretraining layer, which is where the largest behavioral commitments actually get locked in.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2601.21343">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2050213732970848664">Tweet</a></strong></p><div><hr></div><h2><strong>4. Connect Four AlphaZero from Scratch</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FYE5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FYE5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 424w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 848w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 1272w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FYE5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png" width="996" height="537" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:537,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Connect Four AlphaZero&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Connect Four AlphaZero" title="Connect Four AlphaZero" srcset="https://substackcdn.com/image/fetch/$s_!FYE5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 424w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 848w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 1272w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This paper proposes a new way to evaluate coding agents: hand them a minimal task description, give them a tight budget, and ask them to autonomously rebuild a famous ML breakthrough end-to-end. Connect Four plus AlphaZero is the first instance. It is small enough to run on a laptop and hard enough to require a real research engineering loop. Claude Opus 4.7 implemented the full pipeline (MCTS, neural value and policy nets, self-play, training schedule) in three hours on consumer hardware, then beat the Pascal Pons solver 7 of 8 as first-mover. No other frontier coding agent tested cleared 2 of 8.</p><ul><li><p><strong>From patches to systems:</strong> Existing coding-agent benchmarks measure unit-test fixes and small patches. This benchmark measures whether the agent can build a non-trivial ML system from a one-paragraph spec, which is closer to what production research engineering actually looks like.</p></li><li><p><strong>Tight budget, real research loop:</strong> The agent has to design the search algorithm, train the networks, schedule self-play, and debug the loop, all within a fixed compute budget on consumer hardware. There is no escape hatch into a pre-built library, which is what makes the task discriminative.</p></li><li><p><strong>A clean separation between frontier coders:</strong> Claude Opus 4.7 reached 7 of 8 wins as first-mover against the Pascal Pons solver. No other frontier coding agent tested cleared 2 of 8. The gap is large enough to suggest the benchmark is detecting something real about end-to-end ML engineering capability.</p></li><li><p><strong>Why it matters:</strong> Patch-style benchmarks are starting to saturate. Rebuild-a-breakthrough tasks give the field a harder ceiling to push against, and they map more directly to the agent workloads people actually want to deploy.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.25067">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2050693576250753233">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2NyM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2NyM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2NyM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!2NyM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><h2><strong>5. Coordination as Architecture</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TWgK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TWgK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 424w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 848w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 1272w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TWgK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png" width="1258" height="807" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:807,&quot;width&quot;:1258,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Coordination as Architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Coordination as Architecture" title="Coordination as Architecture" srcset="https://substackcdn.com/image/fetch/$s_!TWgK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 424w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 848w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 1272w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Multi-agent LLM systems fail in production at rates between 41% and 87%, and the majority of those failures are coordination defects, not base-model capability. Most published comparisons of multi-agent architectures cannot even tell you whether the gain came from coordination or from one configuration just having more context. This paper argues coordination should be treated as a configurable architectural layer, separable from agent logic and information access, then backs the position with an information-controlled experiment.</p><ul><li><p><strong>Information-controlled methodology:</strong> Same LLM, same tools, same prompt template, same per-call output cap. The only thing that varies is coordination structure. Once information access is held constant, the actual contribution of coordination becomes measurable for the first time.</p></li><li><p><strong>Coordination as a separate layer:</strong> The paper proposes treating coordination structure (who talks to whom, when, with what aggregation rule) as a first-class architectural axis. That separation lets teams reason about coordination changes without re-running the entire stack.</p></li><li><p><strong>A vocabulary for the field:</strong> Until now, &#8220;multi-agent beats single-agent&#8221; comparisons have been confounded by context-window asymmetries. This paper provides the methodology and vocabulary needed to actually test coordination claims, which is overdue infrastructure for the multi-agent research line.</p></li><li><p><strong>Why it matters:</strong> If 41% to 87% of failures are coordination defects, fixing coordination is the highest-leverage thing builders can do. The paper turns that intuition into a measurable engineering target instead of a vibes-based debate.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.03310">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2052429021833818458">Tweet</a></strong></p><div><hr></div><h2><strong>6. Horizon Generalization</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5k2g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5k2g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 424w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 848w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 1272w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5k2g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png" width="1456" height="492" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:492,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Horizon Generalization&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Horizon Generalization" title="Horizon Generalization" srcset="https://substackcdn.com/image/fetch/$s_!5k2g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 424w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 848w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 1272w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Microsoft Research runs a controlled study where the only variable is task horizon length. Same decision rules, same reasoning structure, different sequence length to the goal. The main finding: horizon alone is a training bottleneck. As goal distance grows, exploration explodes combinatorially and credit assignment gets ambiguous. Models that learn cleanly on short horizons fall apart on long ones, even when the underlying reasoning is identical. The fix is not more compute, it is horizon reduction.</p><ul><li><p><strong>Horizon as a first-class variable:</strong> By holding decision rules and reasoning constant and only varying sequence length, the paper isolates horizon as a distinct training bottleneck. This separates &#8220;the agent cannot reason&#8221; from &#8220;the agent cannot stitch together long sequences,&#8221; which most prior work conflated.</p></li><li><p><strong>Macro actions stabilize training:</strong> Re-parameterizing the action space with macro actions that compress many low-level decisions into one stabilizes training immediately. The agent learns the same task, just at a coarser temporal grain that keeps credit assignment tractable.</p></li><li><p><strong>Generalization to longer horizons at inference:</strong> Models trained on reduced horizons generalize to longer ones at inference time. The paper calls this horizon generalization, and it is the most useful property because it means you can train cheap and deploy long.</p></li><li><p><strong>Why it matters:</strong> Most teams treat long-horizon failures as a model-capacity problem. This paper says it is a horizon problem. Reduce horizon during training, get stability now and generalization for free at inference, without retraining a larger backbone.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.02572">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2051679862788878354">Tweet</a></strong></p><div><hr></div><h2><strong>7. 1,000 Synthetic Computers</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ssdq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ssdq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ssdq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;1000 Synthetic Computers&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="1000 Synthetic Computers" title="1000 Synthetic Computers" srcset="https://substackcdn.com/image/fetch/$s_!ssdq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Microsoft Research builds 1,000 synthetic computers, each with realistic directory structures, documents, and artifacts, then runs long-horizon simulations on top of them. One agent plays the user and sets productivity goals; another executes the work. Each simulation runs over 8 hours of agent runtime and 2,000+ turns on average, roughly a month of human work compressed into one trace. Training on this experiential data drives significant improvements on both in-domain and out-of-domain productivity evaluations.</p><ul><li><p><strong>Realistic synthetic environments:</strong> Each of the 1,000 computers ships with directory structures, documents, and artifacts that approximate a real user&#8217;s working environment. The realism is what makes the trajectories useful as training data instead of as evaluation curiosities.</p></li><li><p><strong>Two-agent simulation loop:</strong> A user agent sets productivity goals while a worker agent executes against them. The structure produces multi-turn, goal-directed traces that look like real productivity work, not the short scripted tasks that dominate existing benchmarks.</p></li><li><p><strong>Designed to scale to billions of worlds:</strong> The framework is explicitly designed to scale to millions or billions of synthetic user worlds, which matches the scale at which frontier computer-use agents will need experiential data. The bottleneck on long-horizon training is data, and this is a credible recipe for producing it.</p></li><li><p><strong>Why it matters:</strong> The bottleneck on computer-use agents has stopped being model capability and become realistic long-horizon training data. Synthetic-environment scaling is one of the few paths that does not depend on collecting massive amounts of real user telemetry, which makes it a practical default for teams building computer-use stacks.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.28181">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2050263752147456238">Tweet</a></strong></p><div><hr></div><h2><strong>8. Contextual Agentic Memory is a Memo</strong></h2><p>Most agent memory today is not memory, it is closer to a memo. Vector stores, RAG buffers, and scratchpads implement lookup, not consolidation. The paper draws on neuroscience&#8217;s Complementary Learning Systems theory: biological intelligence pairs fast hippocampal storage with slow neocortical consolidation, and current AI agents only implement the first half (fast write, similarity recall, no abstraction step). The authors prove a generalization ceiling on compositionally novel tasks: as long as memory stays retrieval-only, the agent cannot apply abstract rules to inputs that do not already look like something in the store, and it remains permanently exposed to memory poisoning. If you are building long-running agents and treating memory as a vector index, this paper is a clean diagnosis of what you are missing.</p><p><strong><a href="https://arxiv.org/abs/2604.27707">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2050694339165335754">Tweet</a></strong></p><div><hr></div><h2><strong>9. Agentic-imodels</strong></h2><p>The entire interpretability literature is built around human readers. As more analysis gets delegated to agents, the right target of interpretability shifts. Microsoft Research introduces Agentic-imodels, an autoresearch loop where a coding agent (Claude Code, Codex) iteratively evolves scikit-learn-compatible regressors that are simultaneously accurate AND readable by other LLMs. Interpretability is measured by whether a small LLM can simulate the fitted model&#8217;s behavior just by reading its string representation, predictions, feature effects, and counterfactuals from the <strong>str</strong> output alone. Across 65 tabular datasets, the discovered models push the Pareto frontier past every classical interpretable baseline (decision trees, GAMs, sparse linear), and improve four downstream agentic data-science systems on the BLADE benchmark by 8% to 73%.</p><p><strong><a href="https://arxiv.org/abs/2605.03808">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2052125514266190286">Tweet</a></strong></p><div><hr></div><h2><strong>10. Skills as Verifiable Artifacts</strong></h2><p>If you ship agent skills, your runtime is treating signed-and-cleared skills as trusted by default. This paper argues a skill is untrusted code until it is verified, and the runtime should enforce that default rather than infer trust from origin. Without skill verification, HITL has to fire on every irreversible call, which degrades into rubber-stamping at any non-trivial scale. With verification as a separate gated process, HITL fires only for what is unverified. Skills are now first-class deployment artifacts, and we have decades of supply-chain lessons on what happens when trust is inferred from a signature. This is the right ask for SKILL.md before agent skill libraries become the next attack surface.</p><p><strong><a href="https://arxiv.org/abs/2605.00424">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2051772437520622035">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Meta FAIR Autodata, ZAYA1-8B, SubQ 12M Context, Natural Language Autoencoders, Claude Managed Agents Dreaming, and More]]></title><description><![CDATA[Meta FAIR Autodata, ZAYA1-8B, SubQ 12M Context, Natural Language Autoencoders, Claude Managed Agents Dreaming, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-meta-fair-autodata</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-meta-fair-autodata</guid><pubDate>Sat, 09 May 2026 15:01:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Q3T3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Meta FAIR introduces Autodata</p></li><li><p>Zyphra releases ZAYA1-8B</p></li><li><p>SubQ ships a 12M-token frontier model</p></li><li><p>Anthropic introduces Natural Language Autoencoders</p></li><li><p>Claude Managed Agents adds dreaming and multi-agent</p></li><li><p>Printing Press: an agent CLI factory</p></li><li><p>Flue agent harness framework launches</p></li><li><p>Anthropic adds keyless auth</p></li><li><p>AlphaEvolve marks one year of impact</p></li><li><p>Goodfire opens a neural geometry series</p></li><li><p>Firefox hardened with Claude Mythos</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Autodata: An Agentic Data Scientist From Meta FAIR</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q3T3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q3T3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 424w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 848w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 1272w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q3T3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png" width="1456" height="713" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:713,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Autodata&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Autodata" title="Autodata" srcset="https://substackcdn.com/image/fetch/$s_!Q3T3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 424w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 848w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 1272w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Meta FAIR (Jason Weston et al.) introduced Autodata, an agentic data scientist that builds high-quality training and evaluation data autonomously. The framing is that inference compute can be converted into model quality if the data pipeline itself is an agent.</p><ul><li><p><strong>Agentic Self-Instruct loop:</strong> A planner-executor agent generates, critiques, and refines training and eval examples in a closed loop, replacing static seed sets with a process that keeps producing harder data as the model improves.</p></li><li><p><strong>34-point weak-to-strong gap:</strong> On a CS research QA task, Autodata data opens a 34-point accuracy gap between weak and strong models, a much larger separation than off-the-shelf instruction sets achieve.</p></li><li><p><strong>Inference compute as a quality lever:</strong> The work reframes synthetic data as the place where inference budget pays off, an angle that lines up with Microsoft&#8217;s FaraGen and the broader synthetic-environments thread.</p></li><li><p><strong>Why it matters:</strong> Pairs naturally with self-improving agent runtimes (Claude Managed Agents Outcomes loop, ACE, AHE), giving teams a credible recipe for the data half of the self-improvement story.</p></li></ul><p><strong><a href="https://facebookresearch.github.io/RAM/blogs/autodata/">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-meta-fair-autodata">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (April 26 - May 3)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-b95</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-b95</guid><pubDate>Sun, 03 May 2026 15:02:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!nQGv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Agentic Harness Engineering</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nQGv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nQGv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 424w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 848w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 1272w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nQGv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png" width="947" height="458" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:458,&quot;width&quot;:947,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agentic Harness Engineering&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agentic Harness Engineering" title="Agentic Harness Engineering" srcset="https://substackcdn.com/image/fetch/$s_!nQGv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 424w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 848w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 1272w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most coding-agent harnesses are still tuned by hand or kept alive through brittle trial-and-error self-evolution. This paper introduces Agentic Harness Engineering (AHE), a framework that makes harness evolution observable and falsifiable. AHE separates the system into three layers: components stored as revertible files, experience condensed from millions of trajectory tokens into structured evidence, and decisions written as predictions that get checked against task outcomes. Every edit becomes a contract you can verify or revert.</p><ul><li><p><strong>Three-layer evolution model:</strong> Components, experience, and decisions are each first-class artifacts. Components are versioned files, experience is compressed evidence pulled from full trajectory logs, and decisions are explicit hypotheses with expected outcomes. The structure turns black-box harness tuning into an auditable engineering loop.</p></li><li><p><strong>Pass@1 gains on Terminal-Bench 2:</strong> Pass@1 climbs from 69.7% to 77.0% across ten iterations, beating both human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO. The framework also uses 12% fewer tokens than the seed harness on SWE-bench-verified.</p></li><li><p><strong>Cross-model transfer:</strong> The evolved harness transfers across model families with +5.1 to +10.1 point gains, suggesting the optimizations are structural rather than overfit to a specific backbone. That is the property you actually want from harness engineering.</p></li><li><p><strong>Why it matters:</strong> Harness work is the largest hidden cost in most agent systems. AHE is the first credible recipe for letting the harness improve itself without drifting into noise, which makes it the most important agent-systems paper of the week.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.25850">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2049492169887748365">Tweet</a></strong></p><div><hr></div><h2><em><strong>Message from our Sponsor</strong></em></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!21F7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!21F7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 424w, https://substackcdn.com/image/fetch/$s_!21F7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 848w, https://substackcdn.com/image/fetch/$s_!21F7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 1272w, https://substackcdn.com/image/fetch/$s_!21F7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!21F7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png" width="790" height="298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:298,&quot;width&quot;:790,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Kurate Leaderboard&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kurate Leaderboard" title="Kurate Leaderboard" srcset="https://substackcdn.com/image/fetch/$s_!21F7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 424w, https://substackcdn.com/image/fetch/$s_!21F7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 848w, https://substackcdn.com/image/fetch/$s_!21F7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 1272w, https://substackcdn.com/image/fetch/$s_!21F7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><a href="https://kurate.org/?utm_source=dair_ai&amp;utm_medium=newsletter&amp;utm_campaign=dair_ai_ad">Kurate.org</a> - Arena for scientific papers. Every day, hundreds of arXiv preprints are ranked by scientific impact through pairwise tournaments judged by Claude, GPT and Gemini models. See the top ranked papers in AI, ML, Robotics, Quantum Physics, and more for free.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://kurate.org/?utm_source=dair_ai&amp;utm_medium=newsletter&amp;utm_campaign=dair_ai_ad&quot;,&quot;text&quot;:&quot;Explore The Leaderboards&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://kurate.org/?utm_source=dair_ai&amp;utm_medium=newsletter&amp;utm_campaign=dair_ai_ad"><span>Explore The Leaderboards</span></a></p><div><hr></div><h2><strong>2. AgenticQwen-30B-A3B</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-xpM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-xpM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 424w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 848w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 1272w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-xpM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png" width="781" height="443" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:781,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AgenticQwen-30B-A3B&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AgenticQwen-30B-A3B" title="AgenticQwen-30B-A3B" srcset="https://substackcdn.com/image/fetch/$s_!-xpM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 424w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 848w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 1272w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Alibaba shows that a 30B MoE model with only 3B active parameters can match Qwen3-235B on real tool-use workloads. AgenticQwen-30B-A3B scores 50.2 average on TAU-2 plus BFCL-V4 Multi-Turn, while AgenticQwen-8B scores 47.4. Both more than double their vanilla Qwen baselines and close most of the gap to a 235B model. The recipe is built around two reinforcement learning flywheels that run in parallel, with simulated users actively trying to mislead the agent.</p><ul><li><p><strong>Reasoning flywheel from self-failure:</strong> The first loop mines the model&#8217;s own errors and converts them into harder reasoning problems each round. The training distribution gets harder on its own as the model improves, removing the need for new human-curated reasoning data.</p></li><li><p><strong>Agentic flywheel for tool use:</strong> The second loop grows simple linear tool-use trajectories into multi-branch behavior trees. Simulated users test recovery from misleading instructions, ambiguous goals, and failed tool calls, which is where vanilla supervised tuning typically breaks.</p></li><li><p><strong>Real efficiency for production agents:</strong> A 30B MoE with 3B active tokens at inference is significantly cheaper to serve than a 235B dense or MoE alternative. For tool-use workloads where frontier reasoning is overkill, this changes the cost profile of shipping production agents.</p></li><li><p><strong>A reusable recipe:</strong> The flywheel approach generalizes beyond Qwen. Teams can generate hard examples from their own agent&#8217;s failures rather than relying on static synthetic data, which is the more scalable path for domain-specific agents.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.21590">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2048504655932760565">Tweet</a></strong></p><div><hr></div><h2><strong>3. Agentic World Modeling</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t8HP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t8HP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 424w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 848w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 1272w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t8HP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png" width="1080" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:313,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agentic World Modeling&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agentic World Modeling" title="Agentic World Modeling" srcset="https://substackcdn.com/image/fetch/$s_!t8HP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 424w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 848w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 1272w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A massive 40-author survey lands the cleanest taxonomy of world models in agent research released so far. The paper proposes a &#8220;levels by laws&#8221; framework spanning three capability levels and four law regimes, then synthesizes 400+ works and 100+ representative systems across model-based RL, video generation, web and GUI agents, multi-agent simulation, and scientific discovery. As agents shift from chatbots to goal-accomplishers, the bottleneck moves from language to environment, and this is the first paper that gives builders a shared vocabulary across communities that have been working in isolation.</p><ul><li><p><strong>Three capability levels:</strong> L1 Predictors handle one-step transitions, L2 Simulators do multi-step action-conditioned rollouts, and L3 Evolvers self-revise as the world changes. The hierarchy makes it easy to place existing systems and identify where capability gaps actually live.</p></li><li><p><strong>Four law regimes:</strong> Physical, digital, social, and scientific laws each impose different constraints on what a world model needs to capture. The framework treats them as orthogonal axes, which clarifies why a strong physics simulator can still fail at social or digital tasks.</p></li><li><p><strong>Failure-mode catalog:</strong> The survey extracts recurring failure patterns across 100+ systems, including misaligned reward shaping, drift under non-stationarity, and brittle transfer across regimes. Each failure mode is mapped to a level and law combination, so the diagnosis is grounded.</p></li><li><p><strong>Evaluation principles per level:</strong> The authors propose evaluation criteria specific to each capability level rather than a single benchmark. This is the right move because L1 prediction accuracy and L3 self-revision quality are not measurable on the same axis.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.22748">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2048783073547079816">Tweet</a></strong></p><div><hr></div><h2><strong>4. RecursiveMAS</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aBcQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aBcQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 424w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 848w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 1272w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aBcQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png" width="997" height="634" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:634,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;RecursiveMAS&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="RecursiveMAS" title="RecursiveMAS" srcset="https://substackcdn.com/image/fetch/$s_!aBcQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 424w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 848w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 1272w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Multi-agent systems usually pass full text messages between agents at every step, which causes token bloat, latency, and context dilution that all grow with team size. RecursiveMAS asks a different question: what if agents collaborated through recursive computation in a shared latent space instead of through text? The system treats a multi-agent team as a recursive computation where each agent acts like an RLM layer, iteratively passing latent representations to the next and forming a looped interaction process. Less talking, more thinking.</p><ul><li><p><strong>RecursiveLink for latent communication:</strong> A RecursiveLink module generates latent thoughts and transfers state directly between heterogeneous agents, replacing natural-language messages with internal representations. The change removes the cost of re-encoding and re-parsing text on every coordination step.</p></li><li><p><strong>Inner-outer loop learning:</strong> The training algorithm uses an inner loop for per-step latent updates and an outer loop for team-level credit assignment, with shared gradient-based updates across agents. This makes joint optimization tractable instead of relying on hand-tuned communication protocols.</p></li><li><p><strong>Strong gains across 9 benchmarks:</strong> Across math, science, medicine, search, and code generation, RecursiveMAS delivers 8.3% average accuracy gain over baselines, 1.2x to 2.4x end-to-end inference speedup, and 34.6% to 75.6% reduction in token usage. The efficiency story is at least as important as the accuracy story.</p></li><li><p><strong>A path past the agent communication tax:</strong> If agent-to-agent communication is the next real bottleneck, latent-space recursion is one of the cleaner ways to scale collaboration. Teams running multi-agent systems at scale should treat this as a serious design alternative, not a research curiosity.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.25917">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2050261229315477988">Tweet</a></strong></p><div><hr></div><h2><strong>5. OneManCompany</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tMFx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tMFx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 424w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 848w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 1272w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tMFx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png" width="793" height="277" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:277,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;OneManCompany&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="OneManCompany" title="OneManCompany" srcset="https://substackcdn.com/image/fetch/$s_!tMFx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 424w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 848w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 1272w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you are building multi-agent systems, you are probably wiring static org charts. This paper argues they should look more like a labor market. OneManCompany (OMC) replaces fixed teams with &#8220;Talents,&#8221; portable agent identities that bundle skills and tools, and a &#8220;Talent Market&#8221; where agents get recruited dynamically per task. An Explore-Execute-Review tree search decomposes work hierarchically and aggregates results back up. On PRDBench, OMC reaches 84.67% success, +15.5 points over prior SOTA, and the framework generalizes across the case studies the authors run.</p><ul><li><p><strong>Talents as portable identities:</strong> A Talent bundles a skill set, tool access, and behavioral priors into a reusable agent identity. Talents can be hired into any task without rewiring the orchestration graph, which removes most of the brittleness in pre-wired multi-agent pipelines.</p></li><li><p><strong>Dynamic recruitment via Talent Market:</strong> Tasks post requirements, and the market matches Talents to roles based on capability fit and current load. This replaces the standard &#8220;design a team for every workflow&#8221; pattern with on-demand assembly that adapts as the task population shifts.</p></li><li><p><strong>Explore-Execute-Review tree search:</strong> Work is decomposed top-down into subtasks, executed in parallel by recruited Talents, then reviewed and aggregated up the tree. The structure naturally supports retries, branching, and cross-checking without manual coordination logic.</p></li><li><p><strong>Why it matters:</strong> Pre-wired multi-agent pipelines break the moment tasks drift outside their design envelope. Treating agents as a recruitable workforce gets you self-organization and continuous improvement by default, which is what open-ended agent systems need.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.22446">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2048909068409147460">Tweet</a></strong></p><div><hr></div><h2><strong>6. From Skill Text to Skill Structure</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EyH9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EyH9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 424w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 848w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 1272w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EyH9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png" width="997" height="399" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:399,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;SSL&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="SSL" title="SSL" srcset="https://substackcdn.com/image/fetch/$s_!EyH9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 424w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 848w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 1272w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>SKILL.md files entangle invocation interface, execution flow, and tool side effects in a single blob of natural language. That makes downstream discovery and risk review brittle as skill registries scale. This paper proposes SSL, a three-layer typed JSON representation drawn from Schank and Abelson&#8217;s classical work on scripts, MOPs, and conceptual dependency. An LLM-based normalizer converts existing SKILL.md files into the structure, so adoption does not require rewriting registries by hand.</p><ul><li><p><strong>Three layers, cleanly separated:</strong> A Scheduling layer captures invocation signals and trigger conditions, a Structural layer encodes execution scenes and ordering, and a Logical layer specifies atomic actions plus resource and side-effect annotations. The separation lets discovery, risk, and execution each reason about the layer they care about.</p></li><li><p><strong>Skill Discovery MRR jumps 0.573 to 0.707:</strong> Treating skills as typed structure rather than prose makes retrieval significantly more accurate, even before any model fine-tuning. The gain comes from the structure exposing what skills actually do, not just how they describe themselves.</p></li><li><p><strong>Risk Assessment macro F1 of 0.787:</strong> The Logical layer&#8217;s resource annotations enable a 0.744 to 0.787 jump in risk classification. Auditors can now reason about side effects directly instead of inferring them from free-form prose.</p></li><li><p><strong>A 6,184-skill corpus released:</strong> The authors ship a normalized corpus of 6,184 skills, 403 task queries, and 500 risk-labeled skills. As skill registries cross a million entries, structured representations are the only path that keeps discovery and review tractable.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.24026">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2049252335105491147">Tweet</a></strong></p><div><hr></div><h2><strong>7. Latent Agents</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GOq5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GOq5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 424w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 848w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 1272w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GOq5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png" width="997" height="298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:298,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Latent Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Latent Agents" title="Latent Agents" srcset="https://substackcdn.com/image/fetch/$s_!GOq5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 424w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 848w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 1272w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Multi-agent debate makes models reason better. It also burns tokens generating long transcripts before any answer comes out. Latent Agents distills the entire debate into a single LLM through a two-stage fine-tuning pipeline: the model first learns debate structure, then internalizes it through dynamic reward scheduling and length clipping. The internalized model matches or beats explicit multi-agent debate while using up to 93% fewer tokens, which makes debate-quality reasoning practical at production scale.</p><ul><li><p><strong>Two-stage internalization pipeline:</strong> Stage one teaches the structure of debate (turn taking, critique, revision) through supervised fine-tuning on transcript data. Stage two uses dynamic reward scheduling and length clipping to compress that structure into single-pass reasoning without losing the gains from the multi-agent setup.</p></li><li><p><strong>Up to 93% token savings:</strong> The internalized model matches or beats explicit debate accuracy while drastically reducing inference cost. For teams running reasoning workloads at scale, this is the kind of efficiency win that turns a research idea into a deployment default.</p></li><li><p><strong>Activation steering reveals agent subspaces:</strong> The &#8220;agents&#8221; survive distillation as identifiable circuits in activation space. Probing finds interpretable directions corresponding to different agent perspectives, which means the internal structure persists even when the external transcript is gone.</p></li><li><p><strong>A safety angle worth noting:</strong> When malicious agents are deliberately embedded via distillation, negative steering suppresses them more cleanly than steering a base model would, with smaller hits to general performance. Internalized debate may turn out to be a useful interpretability and alignment substrate, not just a token-saver.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.24881">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2049493173639278818">Tweet</a></strong></p><div><hr></div><h2><strong>8. OCR-Memory</strong></h2><p>Most agent memory systems compress trajectories into text summaries and hope the model remembers what matters, which is exactly where the information loss hides. OCR-Memory renders the agent&#8217;s interaction history as images with indexed visual anchors, then retrieves via a locate-and-transcribe pipeline: the model scans visual memory, predicts the index of the relevant region, and the original text is fetched verbatim from a database. Older trajectories are stored as low-resolution thumbnails with active-recall up-sampling, and the method reaches SOTA on Mind2Web and AppWorld under strict context limits.</p><p><strong><a href="https://arxiv.org/abs/2604.26622">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2049957482811056307">Tweet</a></strong></p><div><hr></div><h2><strong>9. When to Retrieve During Reasoning</strong></h2><p>Most RAG systems retrieve once, before the model starts reasoning. Large reasoning models like o1 and R1 do not work that way. They generate 12k to 25k token chains of thought and hit knowledge gaps mid-inference, long after the retrieval window closed. ReaLM-Retrieve is a reasoning-aware retrieval framework that injects evidence during multi-step inference, detects uncertainty at reasoning-step granularity, and learns a policy for when external evidence actually helps. It achieves +10.1% absolute F1 over standard RAG across MuSiQue, HotpotQA, and 2WikiMultiHopQA, with 47% fewer retrieval calls than fixed-interval IRCoT, and hits 71.2% F1 on 2-4 hop MuSiQue with only 1.8 retrieval calls per question.</p><p><strong><a href="https://arxiv.org/abs/2604.26649">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2049954716298494386">Tweet</a></strong></p><div><hr></div><h2><strong>10. Co-evolving Decisions and Skills</strong></h2><p>Long-horizon agents fail in two ways: the decision-maker cannot decompose well, or the skill library goes stale. This paper introduces a co-evolution framework where an LLM decision agent and a dynamic skill bank improve each other through iterative refinement. The decision agent picks and chains skills, performance feedback updates both the policy and the skills, and new skills emerge by generalizing successful sequences instead of being hand-coded upfront. Most long-horizon agent stacks treat skills and decision-making as separate optimization problems, which is why they plateau. Co-evolution gives you adaptive planning and a growing library of reusable behaviors from a single loop, which is what you actually want when task structure is not predetermined: robotics, game agents, and complex planning.</p><p><strong><a href="https://arxiv.org/abs/2604.20987">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2048440985726955998">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Codex for Everyday Work, Cursor SDK, Mistral Workflows, LLM Knowledge Bases, Agentic Harness Engineering, and More]]></title><description><![CDATA[Codex for Everyday Work, Cursor SDK, Mistral Workflows, LLM Knowledge Bases, Agentic Harness Engineering, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-codex-for-everyday</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-codex-for-everyday</guid><pubDate>Sat, 02 May 2026 15:01:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GnKJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>OpenAI ships Codex for everyday work</p></li><li><p>Cursor releases the Cursor SDK</p></li><li><p>Mistral launches Workflows orchestration</p></li><li><p>DAIR.AI guide to building LLM knowledge bases</p></li><li><p>Agentic Harness Engineering paper drops</p></li><li><p>Cursor 3.2 multitask lands</p></li><li><p>Claude Code adds push notifications</p></li><li><p>Qwen open-sources Qwen-Scope SAEs</p></li><li><p>AISI evaluates GPT-5.5 cyber capabilities</p></li><li><p>AgenticQwen-30B-A3B closes tool-use gap</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Codex for Everyday Work</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GnKJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GnKJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GnKJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg" width="1200" height="772" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:772,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Codex for Everyday Work&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Codex for Everyday Work" title="Codex for Everyday Work" srcset="https://substackcdn.com/image/fetch/$s_!GnKJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>OpenAI extended Codex from a coding agent into a general-purpose work agent. Users now pick a role (finance, data science, marketing, ops, research), connect the apps they actually use, and get suggested prompts that wire Codex into docs, slides, sheets, research, and planning across ChatGPT.</p><ul><li><p><strong>Role-based onboarding:</strong> Codex ships preset roles for non-engineering teams, with per-role prompt suggestions and connector recommendations so a marketing or finance user can run a useful agent on day one without designing their own harness.</p></li><li><p><strong>Sheets, slides, and docs:</strong> The update adds materially better spreadsheet and slide generation plus cleaner doc workflows, pushing Codex into the same surface as enterprise copilots like Workspace and Microsoft 365 agents.</p></li><li><p><strong>20% faster computer use:</strong> Codex&#8217;s computer-use agent runs 20% faster on the same tasks, narrowing the latency gap that has held browser and desktop automation back from being a daily-driver capability.</p></li><li><p><strong>Same agent everywhere:</strong> OpenAI is positioning a single Codex runtime across coding, research, and operations, so a Pro or Business user gets one agent that scales from &#8220;fix this PR&#8221; to &#8220;build a Q2 finance review.&#8221;</p></li></ul><p><strong><a href="https://chatgpt.com/codex/for-work/">Codex for Work</a></strong> | <strong><a href="https://x.com/OpenAI/status/2049928776147230886">Announcement</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-codex-for-everyday">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (April 19 - April 26)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f</guid><pubDate>Sun, 26 Apr 2026 15:02:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!i-uk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. DeepSeek V4</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i-uk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i-uk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 424w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 848w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 1272w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i-uk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!i-uk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 424w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 848w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 1272w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>DeepSeek V4 is the first open model family built from the ground up around million-token contexts as a default rather than a bolt-on feature. The release includes DeepSeek-V4-Pro (1.6T total / 49B active) and DeepSeek-V4-Flash (284B total / 13B active), both trained natively at 1M context length. The tech report details a hybrid attention architecture, new training stability techniques, and a domain-specialist post-training pipeline that together push the open-source frontier much closer to GPT-5.2 and Gemini 3.0-Pro at a fraction of the cost.</p><ul><li><p><strong>Hybrid attention with CSA and HCA:</strong> DeepSeek V4 replaces a single attention stack with Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses KV entries, then applies DeepSeek Sparse Attention with sliding-window KV for fine-grained local dependencies. HCA aggressively compresses KV for extreme-context layers, keeping the model feasible at 1M tokens.</p></li><li><p><strong>Training stability at trillion-parameter scale:</strong> The team introduces two techniques that materially cut loss spikes. Anticipatory Routing decouples backbone and router updates, using current weights for features but historical weights for routing indices. SwiGLU Clamping bounds the linear and gate components of SwiGLU to stabilize activations throughout pretraining.</p></li><li><p><strong>Domain-specialist post-training:</strong> Instead of one large mixed-RL stage, DeepSeek trains a separate specialist expert per domain. Each expert goes through supervised fine-tuning on domain data, then Group Relative Policy Optimization (GRPO) RL with a domain-specific reward model. The specialists are merged into the final model, recovering capability without destabilizing the generalist.</p></li><li><p><strong>Frontier-adjacent performance at open-source cost:</strong> DeepSeek-V4-Pro-Max beats GPT-5.2 and Gemini 3.0-Pro on standard reasoning benchmarks and lands just behind GPT-5.4 and Gemini 3.1-Pro, effectively trailing the closed frontier by roughly 3 to 6 months. For open-weights teams that need long-context reasoning without closed API pricing, this is the most important release of the week.</p></li></ul><p><strong><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf">Paper</a></strong> | <strong><a href="https://x.com/deepseek_ai/status/2047516922263285776">Tweet</a></strong></p><div><hr></div><h2><strong>2. Autogenesis</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eSR7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eSR7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 424w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 848w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eSR7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png" width="1456" height="670" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:670,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!eSR7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 424w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 848w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Static agents age quickly. As deployment environments change and new tools arrive, the agents that survive will be the ones that can safely rewrite themselves. This paper introduces Autogenesis, a self-evolving agent protocol where agents identify their own capability gaps, generate candidate improvements, validate them through testing, and integrate what works back into their own operational framework. No retraining and no human patching, just an ongoing loop of assessment, proposal, validation, and integration.</p><ul><li><p><strong>Two-layer protocol design:</strong> Autogenesis separates a Resource Substrate Protocol Layer (RSPL) that standardizes access to prompts, tools, environments, and memory from a Self-Evolution Protocol Layer (SEPL) that runs a Generate, Reflect, Improve, Evaluate, Commit loop over evolvable variables. The split keeps core capability registration stable while evolution happens on top.</p></li><li><p><strong>Auditable lineage and rollback:</strong> Improvements are committed with version lineage, state access control, and reversible lifecycle operations. The protocol treats every self-modification as a first-class artifact that can be inspected, reproduced, or rolled back, which is what makes self-improvement safe enough to deploy.</p></li><li><p><strong>Multi-agent applications:</strong> Autogenesis is demonstrated on multi-agent systems with planner, executor, and analyst roles. Agents evolve their own prompts, tool wrappers, and coordination routines using the shared protocol, showing that the abstraction is general enough to hold across roles rather than being tied to a single agent type.</p></li><li><p><strong>Part of a broader self-improvement wave:</strong> The paper sits alongside Meta-Harness and the Darwin G&#246;del Machine as a concrete framework for operationalizing self-modification. Together they mark a shift from &#8220;agents that use tools&#8221; to &#8220;agents that edit their own tooling.&#8221;</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.15034">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2045241905227915498">Tweet</a></strong></p><div><hr></div><h2><strong>3. Attention to Mamba</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b9J9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b9J9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 424w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 848w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 1272w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b9J9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png" width="1456" height="577" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:577,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!b9J9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 424w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 848w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 1272w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Apple proposes a two-stage recipe for cross-architecture distillation from Transformers into Mamba. Naive distillation collapses teacher performance because a Mamba student cannot directly imitate softmax attention. The fix is to distill the transformer into a linearized-attention student using a kernel adaptation first, then transfer that student into a pure Mamba with no attention blocks. On a 1B model trained on 10B tokens, the Mamba student hits 14.11 perplexity against a 13.86 Pythia-1B teacher, nearly matching quality at linear-time inference cost.</p><ul><li><p><strong>Stage 1, softmax to linear attention:</strong> The first stage replaces softmax attention with a Hedgehog-style linearized attention student, using a learnable kernel feature map that preserves the original attention scores while removing the softmax nonlinearity. This gives a strictly linear-complexity intermediate that stays close to the teacher.</p></li><li><p><strong>Stage 2, linear attention to Mamba:</strong> The second stage transfers the linearized student into a HedgeMamba block, a hybrid SSM architecture that reuses the learned linear attention parameters and adds state-space components. The transition preserves quality because the two formulations are mathematically related, not just structurally similar.</p></li><li><p><strong>Quality at long context:</strong> On downstream benchmarks, the distilled Mamba reaches 74.1% of the teacher&#8217;s accuracy, with the recipe generalizing to 1B and 3B scales. The key practical win is retaining Transformer-level quality on the sequence mixing block while moving to linear time at inference.</p></li><li><p><strong>A cheaper path to SSM deployment:</strong> If trained Transformers can be reliably converted into state-space models without retraining from scratch, the entire open-weights ecosystem becomes cheaper to serve at long context. This is the kind of quiet infrastructure work that matters more than it looks.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.14191">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2045600012860801113">Tweet</a></strong></p><div><hr></div><h2><strong>4. Skill-RAG</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aYyL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aYyL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 424w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 848w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 1272w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aYyL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png" width="793" height="308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:308,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Skill-RAG&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Skill-RAG" title="Skill-RAG" srcset="https://substackcdn.com/image/fetch/$s_!aYyL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 424w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 848w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 1272w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most RAG systems retrieve on every query, whether the model needs help or not. This is wasteful when the model already knows the answer and often too late when it does not. This paper introduces Skill-RAG, a failure-state-aware retrieval system that uses hidden-state probing to detect when an LLM is approaching a knowledge failure, then routes the query to a specialized retrieval strategy matched to the gap.</p><ul><li><p><strong>Hidden-state probing as a retrieval trigger:</strong> Skill-RAG trains a lightweight probe on the LLM&#8217;s hidden representations that predicts whether the model is about to fail the query. Only queries that clear the probe&#8217;s failure threshold trigger retrieval, which cuts unnecessary search calls while preserving answers for the cases that actually need help.</p></li><li><p><strong>Skill-matched retrieval strategies:</strong> Different failure modes (factual recall, multi-hop reasoning, temporal knowledge) are routed to different retrieval &#8220;skills&#8221; rather than a single generic retriever. Each skill is treated as a standalone component the agent can select between, echoing the broader trend of turning RAG into a collection of composable primitives.</p></li><li><p><strong>Consistent gains across benchmarks:</strong> Evaluated on HotpotQA, Natural Questions, and TriviaQA, Skill-RAG improves over uniform RAG baselines on both efficiency and accuracy. The efficiency story matters as much as the accuracy: per-query retrieval cost drops significantly when the system skips retrieval for questions the model can already answer.</p></li><li><p><strong>A shift in how RAG is designed:</strong> The work reinforces the direction RAG is heading: from a single monolithic pipeline to a suite of retrieval skills an agent selects between. Knowing when to retrieve and what kind of retrieval to run is becoming the central design question.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.15771">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2046249336162632155">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PMVb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PMVb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PMVb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!PMVb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p><strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Enroll Now</a></strong></p><div><hr></div><h2><strong>5. Self-Generated World Knowledge</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cBI7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cBI7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 424w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 848w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 1272w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cBI7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png" width="997" height="431" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:431,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Self-Generated World Knowledge&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Self-Generated World Knowledge" title="Self-Generated World Knowledge" srcset="https://substackcdn.com/image/fetch/$s_!cBI7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 424w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 848w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 1272w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>How far are we from agents that can self-generate world knowledge? This paper proposes an outcome-based reward that measures how much an agent&#8217;s self-generated world knowledge actually improves its task success rate, then trains with that signal and removes the external guidance at inference. The result is a 14B model that surpasses Gemini-2.5-Flash on web navigation and gains +20% on WebVoyager and WebWalker benchmarks.</p><ul><li><p><strong>Outcome-based reward for knowledge:</strong> Rather than scoring knowledge against a human-labeled reference, the reward is whether the generated knowledge measurably improves task success when the agent uses it. This lets the system learn which internally generated facts are worth keeping without an external oracle.</p></li><li><p><strong>Multistage training pipeline:</strong> The method combines supervised fine-tuning on an instruction-and-trajectory dataset with reinforcement rejection sampling, where the best trajectories (ranked by the outcome reward) are used to update the policy. The training loop iterates between generation, reward scoring, and rejection sampling until the model internalizes effective knowledge-use behaviors.</p></li><li><p><strong>Knowledge-enhanced execution at inference:</strong> At inference the external environment feedback loop is removed. The agent self-generates world knowledge, uses it to plan, and executes, without any human or reward signal in the loop. This is what makes the method deployable, not just measurable.</p></li><li><p><strong>Environment design replaces labeling:</strong> If agents can reliably improve themselves by exploring the world rather than waiting for human-labeled rewards, the bottleneck for scaling agentic systems shifts from data curation to environment design. That matches the broader direction of the field and gives practitioners a concrete recipe to follow.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.18131">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2047061650189307953">Tweet</a></strong></p><div><hr></div><h2><strong>6. Self-Evolving Logic Synthesis</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!idVk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!idVk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 424w, https://substackcdn.com/image/fetch/$s_!idVk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 848w, https://substackcdn.com/image/fetch/$s_!idVk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 1272w, https://substackcdn.com/image/fetch/$s_!idVk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!idVk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png" width="897" height="347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:347,&quot;width&quot;:897,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Self-Evolving Logic Synthesis&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Self-Evolving Logic Synthesis" title="Self-Evolving Logic Synthesis" srcset="https://substackcdn.com/image/fetch/$s_!idVk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 424w, https://substackcdn.com/image/fetch/$s_!idVk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 848w, https://substackcdn.com/image/fetch/$s_!idVk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 1272w, https://substackcdn.com/image/fetch/$s_!idVk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>EDA tools like ABC have been hand-tuned by humans for decades. NVIDIA shows they can evolve themselves. This work introduces the first self-evolving logic synthesis framework, a multi-agent LLM system that autonomously refines the entire ABC codebase, generates and tests candidate optimization sequences against standard benchmark circuits, then merges improvements back into the base tool. No human engineer in the loop.</p><ul><li><p><strong>Multi-agent refinement of a real EDA toolchain:</strong> The framework assigns specialized agents to exploration, synthesis, and self-review tasks. Agents read and modify the ABC source directly, propose optimization flows, and run them against benchmark circuits such as EPFL, IWLS, and VTR, with three-pass human-domain knowledge injected through the pipeline.</p></li><li><p><strong>Measured improvement over hand-tuned baselines:</strong> The evolved ABC variants produce better area, delay, and switching metrics than the hand-tuned reference on the benchmark suite, and the improvements persist under sensitivity analysis. This is a real gain on a tool the semiconductor industry depends on.</p></li><li><p><strong>Codebase-level evolution, not just prompt tuning:</strong> The agents edit the ABC codebase itself, not just a configuration layer. That is a meaningful extension of the self-improving agent thread: the unit of improvement is real production code, not a prompt or policy.</p></li><li><p><strong>Generalizable blueprint for domain tools:</strong> If agents can evolve a foundational semiconductor tool without manual engineering, the same pattern generalizes to any large, domain-specific codebase. It is a concrete extension of the self-improving agent thread, applied to infrastructure that shipping chips depend on.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.15082">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2046251813738025025">Tweet</a></strong></p><div><hr></div><h2><strong>7. Stateless Decision Memory</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h_Lt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h_Lt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 424w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 848w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 1272w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h_Lt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png" width="1456" height="524" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:524,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Stateless Decision Memory&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Stateless Decision Memory" title="Stateless Decision Memory" srcset="https://substackcdn.com/image/fetch/$s_!h_Lt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 424w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 848w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 1272w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most interesting AI agent papers right now are about capability. This one is about plumbing, and it is probably more important than it looks. Stateful agents do not scale horizontally. The moment you need thousands of concurrent agent instances running across containers, persistent per-agent state becomes the bottleneck. This paper proposes replacing active memory with immutable decision logs using event-sourcing principles from distributed systems.</p><ul><li><p><strong>Decision logs instead of live state:</strong> Every agent decision, tool call, and observation is appended to an immutable event log. Any instance can reconstruct context by replaying the log on demand, which decouples decision logic from storage and lets agents spin up anywhere with no warmup.</p></li><li><p><strong>Enterprise properties by design:</strong> Compared to summary-only, SAM, and vector-memory baselines, Decision Process Memory (DPM) is the only architecture that supports append-only logging, stateless projection, audit-ready rationale trails, replay from log alone, multi-tenant isolation, and per-event provenance. Each of these is a hard requirement in regulated enterprise deployments.</p></li><li><p><strong>Tight-budget performance wins:</strong> On FRP, RCS, and EDA evaluations under constrained memory budgets, DPM substantially outperforms summary-only memory, with the gap widening as the budget tightens. Under loose budgets the approaches converge, which is the expected pattern once scale is no longer the constraint.</p></li><li><p><strong>A blueprint for regulated deployments:</strong> For teams operationalizing agents in finance, healthcare, or other compliance-heavy industries, the paper reads as a practical specification. It maps existing distributed-systems discipline onto agent memory instead of inventing a new category, which is why it is likely to age well.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.20158">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2047325132096758228">Tweet</a></strong></p><div><hr></div><h2><strong>8. There Will Be a Scientific Theory of Deep Learning</strong></h2><p>A position paper arguing that a genuine scientific theory of deep learning is already taking shape under the umbrella of &#8220;learning mechanics.&#8221; The authors identify five converging research directions (solvable idealized models, tractable mathematical limits, simple macroscopic laws, hyperparameter theories, and universal cross-system behaviors) that share a common signature: they describe training dynamics, target coarse aggregate statistics, and commit to falsifiable quantitative predictions. The framing pushes back on skepticism about whether deep learning can have fundamental theory and positions learning mechanics as a complement to mechanistic interpretability, not a competitor.</p><p><strong><a href="https://arxiv.org/abs/2604.21691">Paper</a></strong> | <strong><a href="https://x.com/learning_mech/status/2047723849874330047">Tweet</a></strong></p><div><hr></div><h2><strong>9. MASS-RAG</strong></h2><p>Most real-world RAG failures come from retrieving technically-relevant but contextually useless documents, then forcing a single model to reconcile them. MASS-RAG is a multi-agent synthesis framework for retrieval-augmented generation where specialized agents handle distinct roles: retrieving candidate documents, assessing their actual relevance to the query, and synthesizing the final answer from evidence that actually contributes. Instead of one model doing everything, responsibility is decomposed across coordinated evaluators, which fits the direction the field is heading for deep research agents.</p><p><strong><a href="https://arxiv.org/abs/2604.18509">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2046594362931556728">Tweet</a></strong></p><div><hr></div><h2><strong>10. Diversity Collapse in Multi-Agent LLMs</strong></h2><p>Every multi-agent system pitch assumes agents explore different solutions, but this paper shows they converge on near-identical outputs over time, even across different architectures and different starting prompts. The authors call it diversity collapse. The cause is structural coupling: shared context, shared task descriptions, and mutual feedback pull every agent toward the same attractor. They measure it formally with metrics like the Vendi score, and the homogenization is real. The practical consequence is that multi-agent setups for brainstorming, hypothesis generation, and ideation only work if teams explicitly engineer isolated reasoning phases, decoupled evaluation, and heterogeneous starting conditions.</p><p><strong><a href="https://arxiv.org/abs/2604.18005">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2047326894992081296">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: GPT-5.5, DeepSeek-V4 Preview, Kimi K2.6 Agent Swarm, Diversity Collapse, Sakana Fugu, and More]]></title><description><![CDATA[GPT-5.5, DeepSeek-V4 Preview, Kimi K2.6 Agent Swarm, Diversity Collapse, Sakana Fugu, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-gpt-55-deepseek</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-gpt-55-deepseek</guid><pubDate>Sat, 25 Apr 2026 15:02:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Pd0K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>OpenAI ships GPT-5.5</p></li><li><p>DeepSeek open-sources V4 Preview</p></li><li><p>Kimi releases K2.6 Agent Swarm</p></li><li><p>ACL paper flags diversity collapse in multi-agent LLMs</p></li><li><p>Sakana launches Fugu multi-agent beta</p></li><li><p>ChatGPT gets Workspace Agents</p></li><li><p>Codex adds Chronicle screen memory</p></li><li><p>Qwen3.6-27B drops flagship coding dense</p></li><li><p>Gemini Deep Research Max lands</p></li><li><p>Google unveils eighth-generation TPUs</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>GPT-5.5</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pd0K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pd0K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 424w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 848w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pd0K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png" width="1456" height="711" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:711,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!Pd0K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 424w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 848w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>OpenAI released GPT-5.5, a new class model built specifically for agentic work. It is designed to understand complex multi-step goals, use tools, check its own work, and carry tasks through to completion, and is now powering both ChatGPT and Codex.</p><ul><li><p><strong>Agentic-first design:</strong> GPT-5.5 targets messy, multi-part jobs and is tuned to plan, invoke tools, navigate ambiguity, and keep going until the task is done rather than stopping at a single response.</p></li><li><p><strong>Strongest gains where it matters:</strong> The biggest jumps are in agentic coding, computer use, knowledge work, and early scientific research, with ChatGPT using full-stack inference improvements to serve the model faster per token.</p></li><li><p><strong>GPT-5.5 Pro for hard jobs:</strong> A new GPT-5.5 Pro tier is rolling out to Pro, Business, and Enterprise users for demanding tasks, with efficiency gains that make Pro a practical default on long reasoning runs.</p></li><li><p><strong>Rollout:</strong> Available today in ChatGPT and Codex for Plus, Pro, Business, and Enterprise users, with Pro limited to paid and enterprise tiers.</p></li></ul><p><strong><a href="https://openai.com/index/introducing-gpt-5-5/">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-gpt-55-deepseek">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (April 13 - April 19)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-717</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-717</guid><pubDate>Sun, 19 Apr 2026 15:03:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!U88-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The Top AI Papers of the Week (April 13 - April 19)</p><h2><strong>1. Automated Weak-to-Strong Researcher</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U88-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U88-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 424w, https://substackcdn.com/image/fetch/$s_!U88-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 848w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U88-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg" width="1456" height="655" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:655,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Automated W2S Researcher&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Automated W2S Researcher" title="Automated W2S Researcher" srcset="https://substackcdn.com/image/fetch/$s_!U88-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 424w, https://substackcdn.com/image/fetch/$s_!U88-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 848w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic shows that Claude can run fully autonomous progress on scalable oversight research. A team of parallel Automated Alignment Researchers (AARs) built on Claude Opus 4.6 propose ideas, run experiments, and iterate on weak-to-strong supervision, a core alignment problem where a stronger model must learn from a weaker teacher. The system closes almost the entire remaining performance gap that human researchers could not, at a total cost of roughly $18K in tokens and model training.</p><ul><li><p><strong>Performance gap recovered as the metric:</strong> The authors evaluate progress with performance gap recovered (PGR), a 0 to 1 score where 0 matches the weak teacher and 1 matches a ground-truth-supervised student. On a chat preference dataset, two human researchers achieved PGR 0.23 after seven days of iteration on four promising generalization methods.</p></li><li><p><strong>AARs reach 0.97 PGR in five days:</strong> Running nine Claude-based agents in parallel sandboxes, the automated system reached PGR 0.97 in five days and 800 cumulative agent-hours. The cost was about $18,000, or roughly $22 per AAR-hour. This is one of the strongest empirical data points yet that AI can drive measurable progress on open alignment problems.</p></li><li><p><strong>Forum-based collaboration between agents:</strong> Each AAR works in its own isolated sandbox but shares findings to a common forum and uploads codebase snapshots to shared storage. The setup mirrors how a small research team would coordinate, letting later agents build on earlier wins without merging execution environments.</p></li><li><p><strong>Reward hacking as a real outcome, not a hypothetical:</strong> The agents sometimes succeeded through unexpected mechanisms, including reward-hacking behaviors that the researchers did not anticipate. The result highlights the double-edged nature of automated research: measurable progress on outcome-gradable problems is practical today, but careful metric design remains a human responsibility.</p></li></ul><p><strong><a href="https://alignment.anthropic.com/2026/automated-w2s-researcher/">Paper</a></strong> | <strong><a href="https://x.com/janleike/status/2044139528596910584">Tweet</a></strong></p><div><hr></div><h2><strong>2. AiScientist</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T3D7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T3D7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 424w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 848w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1272w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T3D7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png" width="996" height="393" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:393,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AiScientist&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AiScientist" title="AiScientist" srcset="https://substackcdn.com/image/fetch/$s_!T3D7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 424w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 848w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1272w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Long-horizon AI research agents are mostly a state-management problem. Reasoning well for the next turn is not enough when ML research demands task setup, implementation, experiments, debugging, and evidence tracking over hours or days. This paper introduces AiScientist, a system for autonomous long-horizon engineering built around the principle of thin control and thick state. A top-level orchestrator manages stage-level progress while specialized agents repeatedly ground themselves in durable workspace artifacts.</p><ul><li><p><strong>File-as-Bus coordination:</strong> AiScientist&#8217;s core design choice is to route coordination through durable filesystem artifacts rather than in-context message passing. Analyses, plans, code, logs, and experimental evidence all live as versioned files in a permission-scoped workspace, allowing specialists and subagents to reconstruct context from scratch without replaying entire conversations.</p></li><li><p><strong>Thin control, thick state:</strong> A Tier-0 orchestrator issues only stage-level directives, while Tier-1 specialists and optional Tier-2 subagents operate on shared artifacts. This keeps the control channel narrow and the state channel rich, giving agents the space to run long experiments without losing track of prior decisions and evidence.</p></li><li><p><strong>Strong benchmark results:</strong> The system improves PaperBench by 10.54 points over the best matched baseline and reaches 81.82 Any Medal% on MLE-Bench Lite. Removing File-as-Bus drops PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points, isolating the artifact-mediated design as the primary driver of gains.</p></li><li><p><strong>Durable project memory over longer chats:</strong> The work argues that autonomous research agents need persistent project memory, not just longer context windows. The results generalize the emerging pattern that environments carrying state on behalf of agents outperform architectures that rely solely on in-context reasoning for multi-hour workflows.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.13018">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2044436099121209546">Tweet</a></strong></p><div><hr></div><h2><strong>3. AlphaEval</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vS7D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vS7D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 424w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 848w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1272w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vS7D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png" width="635" height="331" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/655c258e-96c9-40fa-8e4c-934901545aea_635x331.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:331,&quot;width&quot;:635,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AlphaEval&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AlphaEval" title="AlphaEval" srcset="https://substackcdn.com/image/fetch/$s_!vS7D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 424w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 848w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1272w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Agent evaluations are drifting away from production reality. Most benchmarks use clean tasks, well-specified requirements, deterministic metrics, and retrospective curation. Production work is messier, with implicit constraints, fragmented multimodal inputs, undeclared domain knowledge, long-horizon deliverables, and expert judgment that evolves over time. This paper introduces AlphaEval, a production-grounded benchmark evaluating agents as complete products rather than model APIs.</p><ul><li><p><strong>Seven companies, six O*NET domains:</strong> AlphaEval contains 94 tasks sourced from seven companies deploying AI agents in core business workflows across six O*NET domains. The tasks preserve production complexity rather than stripping it away, giving the benchmark a materially different distribution from prior coding-centric evaluations.</p></li><li><p><strong>Products, not model APIs:</strong> The benchmark evaluates commercial agent products such as Claude Code and Codex end to end, not the underlying models in isolation. This is a deliberate shift toward measuring the full agent experience that users actually pay for, including tool use, orchestration, and UI behaviors.</p></li><li><p><strong>Six production-specific failure modes:</strong> The authors identify cascade dependencies, subjective judgment collapse, information retrieval failures, cross-section inconsistency, constraint misinterpretation, and format compliance as failure modes that remain invisible to coding benchmarks. The best configuration (Claude Code with Opus 4.6) scores only 64.41/100, exposing a substantial research-to-production gap.</p></li><li><p><strong>Multi-paradigm evaluation:</strong> AlphaEval combines LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, and domain-specific checks. The key practical contribution is a requirement-to-benchmark framework that turns production requirements into executable evals with minimal friction for organizations.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.12162">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2044773323914322393">Tweet</a></strong></p><div><hr></div><h2><strong>4. Nemotron 3 Super</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3ns9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3ns9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 424w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 848w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1272w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3ns9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png" width="996" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Nemotron 3 Super&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Nemotron 3 Super" title="Nemotron 3 Super" srcset="https://substackcdn.com/image/fetch/$s_!3ns9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 424w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 848w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1272w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>NVIDIA introduces Nemotron 3 Super, an open 120B parameter model with 12B active parameters, built as a hybrid Mamba-Attention Mixture-of-Experts architecture optimized for agentic reasoning. The model targets long-context, high-throughput inference, a capability increasingly central to running agents reliably. It supports up to 1M context length while delivering up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B, at comparable benchmark accuracy.</p><ul><li><p><strong>Hybrid Mamba-Attention with LatentMoE:</strong> The architecture blends Mamba blocks with sparse LatentMoE layers, a new Mixture-of-Experts design that projects tokens into a smaller latent dimension for routing and expert computation. This improves both accuracy per FLOP and accuracy per parameter, and it is what allows the model to scale sparsely without paying a standard MoE memory tax.</p></li><li><p><strong>NVFP4 pretraining at scale:</strong> Nemotron 3 Super is the first model in the Nemotron 3 family to be pretrained in NVFP4, enabling training on 25 trillion tokens while keeping compute and memory overhead manageable. Post-training combines supervised fine-tuning and reinforcement learning on top of this base.</p></li><li><p><strong>Native speculative decoding via MTP layers:</strong> Multi-Token Prediction (MTP) layers are included for native speculative decoding during inference, reducing latency for long-context agentic workloads without requiring an external draft model. The team reports consistent MTP acceptance rates across draft depths on SPEED-Bench.</p></li><li><p><strong>Fully open artifacts:</strong> Nemotron 3 Super datasets, along with base, post-trained, and quantized checkpoints, are open-sourced on Hugging Face. This matters for teams building agent stacks that need efficient, inspectable, long-context models rather than closed API dependencies.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.12374">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2044452957023047943">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sVEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sVEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sVEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!sVEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p><strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Enroll Now</a></strong></p><div><hr></div><h2><strong>5. Memory Transfer Learning</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dlKK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dlKK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 424w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 848w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1272w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dlKK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png" width="996" height="1186" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1186,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Memory Transfer Learning&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Memory Transfer Learning" title="Memory Transfer Learning" srcset="https://substackcdn.com/image/fetch/$s_!dlKK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 424w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 848w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1272w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Coding agents learn from experience, but that knowledge stays locked in silos. Solve a thousand SWE tasks, and none of that wisdom helps with competitive coding. This paper introduces Memory Transfer Learning, a framework where coding agents share a unified memory pool across six heterogeneous coding benchmarks, testing what transfers between domains and what does not.</p><ul><li><p><strong>Unified memory pool across domains:</strong> The framework pools memories across six heterogeneous coding benchmarks rather than isolating them by task type. Cross-domain memory improves average performance by 3.7%, a modest but consistent lift that previously would have been invisible under standard single-domain evaluations.</p></li><li><p><strong>Abstraction dictates transferability:</strong> Four memory formats ranging from raw execution traces to high-level insights are compared. High-level insights generalize well, while low-level traces often cause negative transfer by anchoring agents to incompatible implementation details. The takeaway: memory design matters more than memory volume.</p></li><li><p><strong>Meta-knowledge, not code:</strong> The transferable value is not task-specific code but meta-knowledge such as validation routines, structured action workflows, and safe interaction patterns with execution environments. Algorithmic strategy transfer accounts for only 5.5% of the gains, with procedural guidance doing most of the work.</p></li><li><p><strong>Scaling and cross-model transfer:</strong> Transfer effectiveness scales with the size of the memory pool, and memory can even be shared across different models. Combined with the finding on abstraction levels, the results point toward memory systems that curate insights rather than simply logging everything the agent did.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.14004">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2044900659921895729">Tweet</a></strong></p><div><hr></div><h2><strong>6. Auto-Diagnose</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2T-a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2T-a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 424w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 848w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1272w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2T-a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png" width="812" height="138" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:138,&quot;width&quot;:812,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Auto-Diagnose&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Auto-Diagnose" title="Auto-Diagnose" srcset="https://substackcdn.com/image/fetch/$s_!2T-a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 424w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 848w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1272w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Integration test failures are painful because the signal is buried in messy logs. Massive output, heterogeneous systems, low signal-to-noise ratio, and unclear root causes leave developers scrolling through thousands of lines. This paper introduces Auto-Diagnose, an LLM-based tool deployed inside Google&#8217;s Critique code review system that analyzes failure logs, summarizes the most relevant lines, and suggests the root cause directly in the developer workflow.</p><ul><li><p><strong>In-workflow root cause assistance:</strong> Auto-Diagnose is integrated into Critique, Google&#8217;s internal code review system, so diagnoses appear where developers are already looking at the failure. Log streams from test drivers and systems under test, spread across data centers and threads, are joined and sorted by timestamp before being passed to the LLM.</p></li><li><p><strong>High diagnosis accuracy:</strong> In a manual evaluation of 71 real-world failures, Auto-Diagnose reached 90.14% root-cause diagnosis accuracy. This level of reliability is what justifies surfacing suggestions directly in a tool developers cannot ignore, rather than hiding them behind an opt-in query interface.</p></li><li><p><strong>Massive-scale deployment evidence:</strong> After Google-wide rollout, the tool was used across 52,635 distinct failing tests. User feedback marked it &#8220;Not helpful&#8221; in only 5.8% of cases, and it ranked #14 in helpfulness among 370 Critique tools. This is one of the clearest data points on production LLM tooling at scale inside a major company.</p></li><li><p><strong>A template for developer-facing LLM tools:</strong> The paper reads as a practical blueprint for embedding LLM-based diagnosis into existing engineering workflows. Rather than building a standalone product, the team integrated into the tool where the problem is already being reviewed, which likely explains the low &#8220;Not helpful&#8221; rate and high adoption.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.12108">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2044769798845079665">Tweet</a></strong></p><div><hr></div><h2><strong>7. Subliminal Learning</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JlNa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JlNa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JlNa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg" width="1456" height="806" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:806,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Subliminal Learning&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Subliminal Learning" title="Subliminal Learning" srcset="https://substackcdn.com/image/fetch/$s_!JlNa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Subliminal Learning paper by Evans and colleagues is now published in Nature. The work showed that LLMs can transmit traits (such as a preference for owls) through data that appears unrelated to that trait, like sequences of numbers that look meaningless on inspection. The Nature version extends the original July 2025 preprint with new experiments, replications on Gemma, and a broader discussion of safety implications for AI systems trained on one another&#8217;s outputs.</p><ul><li><p><strong>Transfer across different initializations:</strong> The preprint showed subliminal transfer between models that shared an initialization. The new MNIST results demonstrate transfer between models with different initializations. Although a toy setup, it meaningfully broadens the scope of the effect beyond shared-weight scenarios.</p></li><li><p><strong>Misalignment transmitted through code and chain-of-thought:</strong> General misalignment, not just benign preferences, can also be transmitted subliminally. The new results show this transfer can happen through model-written code or chain-of-thought reasoning, not only through numeric sequences, which expands the attack and contamination surface considerably.</p></li><li><p><strong>Connections to independent follow-ups:</strong> The authors highlight concurrent work from Aden-Ali et al. (2026) showing trait transfer via standard post-training datasets filtered by the teacher, Draganov et al. (2026) demonstrating a cross-family &#8220;phantom transfer&#8221; data poisoning attack, and Weckbecker et al. (2026) describing a subliminal &#8220;virus&#8221; that spreads between agent groups. Together they suggest the phenomenon is robust, reproducible, and difficult to defend against.</p></li><li><p><strong>Implications for safety evaluations:</strong> The practical takeaway is that safety evaluations may need to examine not just model behavior, but the origins of models and the processes used to create training data. As systems increasingly train on each other&#8217;s outputs, properties invisible in the data can still be inherited, undermining evaluations that focus purely on observable responses.</p></li></ul><p><strong><a href="https://www.nature.com/articles/s41586-026-10319-8">Paper</a></strong> | <strong><a href="https://x.com/OwainEvans_UK/status/2044488099707949545">Tweet</a></strong></p><div><hr></div><h2><strong>8. LLM-as-a-Verifier</strong></h2><p>Test-time scaling is effective for agentic tasks, but picking the winner among many candidates is the bottleneck. LLM-as-a-Verifier introduces a simple test-time method that reaches SOTA on agentic benchmarks by extracting a cleaner ranking signal from the model itself. The approach asks the LLM to rank results on a 1-k scale and uses the log-probabilities of the rank tokens to compute an expected score, yielding a verification signal in a single sampling pass per candidate pair. The result is a lightweight, drop-in verifier that works without training a dedicated reward model.</p><p><strong><a href="https://llm-as-a-verifier.github.io/">Paper</a></strong> | <strong><a href="https://x.com/Azaliamirh/status/2043813128690192893">Tweet</a></strong></p><div><hr></div><h2><strong>9. WebXSkill</strong></h2><p>Web agents can navigate a page, but ask them to repeat a checkout flow they already completed and they start from scratch every time. WebXSkill is a skill learning framework where web agents extract reusable skills from synthetic trajectories, each pairing a parameterized action program with step-level natural language guidance. Two deployment modes let the agent either auto-execute skills as atomic tool calls (grounded) or follow them as step-by-step instructions while retaining autonomy to adapt (guided). On WebArena, WebXSkill improves task success by up to 9.8 points over baselines. On WebVoyager, grounded mode reaches 86.1%, a 14.2-point gain, and skills even transfer across environments.</p><p><strong><a href="https://arxiv.org/abs/2604.13318">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2045139481892880892">Tweet</a></strong></p><div><hr></div><h2><strong>10. Muses-Bench</strong></h2><p>Every agent framework assumes one user giving instructions, but in real team workflows agents have multiple bosses with conflicting goals, private information, and different authority levels. Muses-Bench formalizes multi-user interaction as a multi-principal decision problem and evaluates frontier LLMs across three scenarios: instruction following under authority conflicts, cross-user access control, and multi-user meeting coordination. Gemini-3-Pro tops the leaderboard at just 85.6% average, and no model exceeds 64.8% on meeting coordination. Privacy-utility tradeoffs are brutal: Grok-3-Mini scores 99.6% on privacy but collapses to 60.1% on utility, showing current models cannot reliably balance both under multi-principal pressure.</p><p><strong><a href="https://arxiv.org/abs/2604.08567">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2044067923787165799">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Claude Opus 4.7, Codex Everywhere, Claude Design, Windsurf 2.0, Qwen3.6-35B-A3B, AiScientist, and More]]></title><description><![CDATA[Claude Opus 4.7, Codex Everywhere, Claude Design, Windsurf 2.0, Qwen3.6-35B-A3B, AiScientist, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-47-codex</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-47-codex</guid><pubDate>Sat, 18 Apr 2026 15:01:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!491v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Anthropic ships Claude Opus 4.7</p></li><li><p>Codex extends to Mac apps</p></li><li><p>Claude Design enters research preview</p></li><li><p>Windsurf 2.0 delegates to Devin</p></li><li><p>Qwen drops 3.6-35B-A3B open weights</p></li><li><p>OpenAI Agents SDK adds sandboxes</p></li><li><p>Gemini CLI adds subagents</p></li><li><p>FrontierSWE benchmark launches</p></li><li><p>NVIDIA releases Nemotron 3 Super</p></li><li><p>AiScientist lifts long-horizon research</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Claude Opus 4.7</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!491v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!491v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!491v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!491v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!491v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!491v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg" width="1080" height="1080" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1080,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Opus 4.7&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Opus 4.7" title="Claude Opus 4.7" srcset="https://substackcdn.com/image/fetch/$s_!491v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!491v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!491v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!491v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic released Claude Opus 4.7, its most capable Opus model yet, built for long-running agentic work with more rigorous self-verification and tighter instruction following. Opus 4.7 also powers the new Claude Design product and Anthropic&#8217;s Glasswing cybersecurity frontier model.</p><ul><li><p><strong>Self-verifying long-running work:</strong> Opus 4.7 checks its own outputs before reporting back and handles multi-hour tasks with less supervision, making it a stronger default for hand-offs where the agent must own the full loop.</p></li><li><p><strong>Vision upgrade:</strong> The model sees images at more than three times the resolution of Opus 4.6 and produces higher-quality interfaces, slides, and documents, which is the foundation for the new Claude Design research preview.</p></li><li><p><strong>New reasoning and budget controls:</strong> A new xhigh effort level between high and max gives developers finer latency/quality tradeoffs on hard problems. Task budgets (beta) let Claude prioritize work and manage cost across longer runs.</p></li><li><p><strong>Claude Code upgrades:</strong> A new /ultrareview command runs a dedicated review pass over changes that flags what a careful reviewer would catch, and auto mode is now extended to Max users so long tasks run with fewer interruptions.</p></li></ul><p><strong><a href="https://www.anthropic.com/news/claude-opus-4-7">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-47-codex">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (April 6 - April 12)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-831</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-831</guid><pubDate>Sun, 12 Apr 2026 15:02:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1pgB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Neural Computers</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fEae!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fEae!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 424w, https://substackcdn.com/image/fetch/$s_!fEae!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 848w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1272w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fEae!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png" width="1085" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:1085,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!fEae!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 424w, https://substackcdn.com/image/fetch/$s_!fEae!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 848w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1272w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers from Meta AI and KAUST propose Neural Computers (NCs), an emerging machine form that unifies computation, memory, and I/O in a single learned runtime state. Unlike conventional computers that execute explicit programs, agents that act over external environments, or world models that learn dynamics, NCs aim to make the model itself the running computer, establishing a new computing paradigm.</p><ul><li><p><strong>From hardware stack to neural latent stack:</strong> Classical computers separate compute, memory, and I/O into modular hardware layers. Neural Computers collapse all three into a single latent runtime state carried by a neural network. The model&#8217;s hidden state serves simultaneously as working memory, computational substrate, and interface layer, removing the boundary between program and execution environment.</p></li><li><p><strong>Video models as prototype substrate:</strong> The team instantiates NCs as video models that generate screen frames from instructions, pixel inputs, and user actions. Two prototypes cover command-line interfaces (NCCLIGen, which renders and executes terminal workflows) and graphical desktops (NCGUIWorld, which learns pointer dynamics and menu interactions), both trained without access to internal program state.</p></li><li><p><strong>Early runtime primitives emerge:</strong> The prototypes demonstrate that learned runtimes can acquire I/O alignment and short-horizon control directly from raw interface traces. CLI models execute short command chains with structurally accurate output rendering, while GUI models learn coherent click feedback and window transitions in controlled settings.</p></li><li><p><strong>Roadmap toward Completely Neural Computers:</strong> The long-term target is the CNC: a system that is Turing complete, universally programmable, and behavior-consistent unless explicitly reprogrammed. Key open challenges include routine reuse across sessions, controlled capability updates without catastrophic forgetting, and stable symbolic processing for long-horizon reasoning.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.06425">Paper</a></strong> | <strong><a href="https://x.com/SchmidhuberAI/status/2042601088029708704">Tweet</a></strong></p><div><hr></div><h2><strong>2. Memento: Teaching LLMs to Manage Their Own Context</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1pgB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1pgB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 424w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 848w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1pgB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png" width="1456" height="1064" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1064,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!1pgB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 424w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 848w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>New research from Microsoft teaches reasoning models to compress their own chain-of-thought mid-generation. Memento trains models to segment reasoning into blocks, summarize each block into a compact &#8220;memento,&#8221; and then evict the original block from the KV cache. The model continues reasoning from mementos alone, cutting peak memory by 2-3x while nearly doubling throughput.</p><ul><li><p><strong>Block-and-compress architecture:</strong> The model learns to mark reasoning boundaries using special tokens, produce a terse summary capturing key conclusions and intermediate values, and then drop the full block from context. From that point forward, the model sees only past mementos plus the current active block, keeping context compact without losing critical information.</p></li><li><p><strong>KV cache reduction with minimal accuracy loss:</strong> Applied to five models including Qwen2.5-7B, Qwen3 8B/32B, Phi-4 Reasoning 14B, and OLMo3-7B-Think, Memento achieves 2-3x peak KV cache reduction with small accuracy gaps that shrink at larger scales. The erased blocks still leave useful traces in the KV cache that the model leverages.</p></li><li><p><strong>Practical throughput gains:</strong> Beyond memory savings, the reduced context length directly translates to faster inference. The approach nearly doubles serving throughput, making it immediately useful for production deployments where both latency and memory are constraints.</p></li><li><p><strong>Open resources:</strong> Microsoft released the full codebase under MIT license, the OpenMementos dataset containing 228K reasoning traces with block segmentation and compressed summaries, and a custom vLLM fork for KV cache block masking. Standard supervised fine-tuning on approximately 30K examples is sufficient to teach this capability.</p></li></ul><p><strong><a href="https://github.com/microsoft/memento">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2042315710173528122">Tweet</a></strong></p><div><hr></div><h2><strong>3. Memory Intelligence Agent (MIA)</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mD5U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mD5U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 424w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 848w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1272w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mD5U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png" width="1456" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!mD5U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 424w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 848w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1272w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most memory-augmented research agents treat memory as a static retrieval store, leading to inefficient evolution and rising storage costs. MIA introduces a Manager-Planner-Executor architecture where a Memory Manager maintains compressed search trajectories, a Planner generates strategies, and an Executor searches and analyzes information. The framework boosts GPT-5.4 by up to 9% on LiveVQA through bidirectional memory conversion.</p><ul><li><p><strong>Bidirectional memory conversion:</strong> MIA enables transformation between parametric memory (model weights) and non-parametric memory (retrieved context) in both directions. This allows the system to internalize frequently accessed knowledge while keeping rare or volatile information in retrievable form, optimizing both storage efficiency and access speed.</p></li><li><p><strong>Alternating reinforcement learning:</strong> The three agents are trained through alternating RL, where each agent&#8217;s policy improves in response to the others&#8217; behavior. This co-evolutionary training ensures the agents develop complementary strategies rather than competing for the same signal.</p></li><li><p><strong>Test-time parametric updates:</strong> Unlike standard retrieval-augmented systems, MIA can update its parametric memory on-the-fly during inference. This test-time learning allows the agent to adapt to new domains and evolving information without retraining, maintaining relevance as the information landscape changes.</p></li><li><p><strong>Broad benchmark coverage:</strong> The framework demonstrates improvements across 11 benchmarks spanning question answering, knowledge-intensive tasks, and long-form research synthesis. The up to 9% improvement on LiveVQA is particularly notable given that video question answering demands effective memory management across temporal sequences.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.04503">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2041895109252542730">Tweet</a></strong></p><div><hr></div><h2><strong>4. Single-Agent LLMs vs. Multi-Agent Systems</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fvx7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fvx7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 424w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 848w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fvx7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Single vs Multi Agent&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Single vs Multi Agent" title="Single vs Multi Agent" srcset="https://substackcdn.com/image/fetch/$s_!fvx7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 424w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 848w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>More agents, better results, right? Not so fast. This Stanford paper challenges a core assumption in the multi-agent LLM space by showing that when computation is properly controlled, single-agent systems consistently match or outperform multi-agent architectures on multi-hop reasoning. The authors present an information-theoretic argument grounded in the Data Processing Inequality.</p><ul><li><p><strong>Computation as the hidden confounder:</strong> Most reported multi-agent gains are confounded by increased test-time computation rather than architectural advantages. When reasoning token budgets are held constant, the performance gap disappears or reverses, suggesting that prior comparisons were inadvertently measuring compute scaling rather than coordination benefits.</p></li><li><p><strong>Information-theoretic foundation:</strong> The authors ground their analysis in the Data Processing Inequality, arguing that under a fixed reasoning-token budget with perfect context utilization, single-agent systems are inherently more information-efficient. Distributing reasoning across agents introduces information loss at each handoff.</p></li><li><p><strong>Benchmark artifacts inflate MAS gains:</strong> Testing across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5, the study identifies significant evaluation artifacts, particularly in API-based budget control for Gemini 2.5, that inflate apparent multi-agent advantages. Standard benchmarks also contain structural biases favoring multi-agent decomposition.</p></li><li><p><strong>Practical implications for system design:</strong> The findings suggest that teams should explicitly control for compute, context, and coordination trade-offs before committing to multi-agent architectures. In many cases, allocating the same token budget to a single agent with richer context yields stronger results at lower system complexity.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.02460">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2041534488342360305">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NAtL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NAtL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NAtL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!NAtL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p><strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Enroll Now</a></strong></p><div><hr></div><h2><strong>5. The Universal Verifier for Agent Benchmarks</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4ydR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4ydR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 424w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 848w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1272w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4ydR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png" width="887" height="348" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:348,&quot;width&quot;:887,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Universal Verifier&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Universal Verifier" title="Universal Verifier" srcset="https://substackcdn.com/image/fetch/$s_!4ydR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 424w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 848w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1272w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Every agent benchmark has the same hidden problem: how do you know the agent actually succeeded? Microsoft researchers introduce the Universal Verifier, built on four design principles for reliable evaluation of computer-use agent trajectories. The verifier reduces false positive rates to near zero, down from 45%+ with WebVoyager and 22%+ with WebJudge.</p><ul><li><p><strong>Four design principles:</strong> The verifier is built on non-overlapping rubric criteria to reduce noise, separate process and outcome rewards for complementary signals, cascading error-free assessment that distinguishes controllable from uncontrollable failures, and divide-and-conquer context management that attends to all screenshots in a trajectory.</p></li><li><p><strong>Near-zero false positives:</strong> Current verifiers suffer from alarmingly high false positive rates that corrupt both benchmark scores and training data. The Universal Verifier achieves agreement with human judges that matches inter-human agreement rates, making it reliable enough for both evaluation and RL reward signal generation.</p></li><li><p><strong>Cumulative design gains:</strong> No single design choice dominates the performance improvement. The authors demonstrate that gains result from the cumulative effect of all four principles working together, with each contributing meaningful improvements that compound rather than any one serving as a silver bullet.</p></li><li><p><strong>Limits of automated research:</strong> An interesting meta-finding: the team used an auto-research agent to replicate the verifier design process. The agent reached 70% of expert verifier quality in 5% of the time but could not discover the structural design decisions that drove the biggest gains, suggesting human insight remains essential for system-level design.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.06240">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2042249194409501054">Tweet</a></strong></p><div><hr></div><h2><strong>6. Scaling Coding Agents via Atomic Skills</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fjUh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fjUh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 424w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 848w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1272w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fjUh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png" width="1456" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Scaling Coding Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Scaling Coding Agents" title="Scaling Coding Agents" srcset="https://substackcdn.com/image/fetch/$s_!fjUh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 424w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 848w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1272w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most coding agents train end-to-end on full tasks like resolving GitHub issues, leading to task-specific overfitting that limits generalization. This paper proposes a different approach: identifying five atomic coding skills (code localization, code editing, unit-test generation, issue reproduction, and code review) and training agents through joint reinforcement learning over these foundational competencies.</p><ul><li><p><strong>Atomic skill decomposition:</strong> Instead of treating software engineering as monolithic composite tasks, the framework formalizes five fundamental operations that compose into higher-level capabilities. Think of it as teaching an agent the alphabet of coding rather than memorizing specific sentences, enabling flexible recombination across novel task types.</p></li><li><p><strong>Joint RL across skills:</strong> The agents are trained through joint reinforcement learning that optimizes performance across all five atomic skills simultaneously. This joint training produces representations that capture the underlying structure shared across coding operations rather than surface-level patterns tied to specific benchmarks.</p></li><li><p><strong>Strong generalization to unseen tasks:</strong> Joint RL improves average performance by 18.7% across both the five atomic skills and five composite tasks. The improvements transfer to unseen composite tasks including bug-fixing, code refactoring, ML engineering, and code security, none of which were directly optimized during training.</p></li><li><p><strong>A new scaling paradigm:</strong> The work establishes that scaling coding agents through foundational skill mastery is more sample-efficient and transferable than task-level optimization. As the number and complexity of software engineering tasks grow, this compositional approach offers a more sustainable path than continuously expanding task-specific training sets.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.05013">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2042237615492260249">Tweet</a></strong></p><div><hr></div><h2><strong>7. Agent Skills in the Wild</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UEmi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UEmi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 424w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 848w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1272w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UEmi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png" width="997" height="377" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:377,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agent Skills in the Wild&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agent Skills in the Wild" title="Agent Skills in the Wild" srcset="https://substackcdn.com/image/fetch/$s_!UEmi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 424w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 848w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1272w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Agent skills look great in demos. Hand them a curated toolbox, and they shine. But what happens when the agent has to find the right skill from a library of 34,000? This paper from UC Santa Barbara and MIT presents the first comprehensive study of skill utility under progressively realistic settings, revealing that the benefits of skills are far more fragile than current evaluations suggest.</p><ul><li><p><strong>Progressive difficulty framework:</strong> The study moves from idealized conditions with hand-crafted, task-specific skills to realistic scenarios requiring retrieval from 34K real-world skills. Performance gains degrade consistently at each step, with pass rates approaching no-skill baselines in the most challenging scenarios.</p></li><li><p><strong>Retrieval as the bottleneck:</strong> The core failure mode is not skill execution but skill selection. When agents must identify the right skill from a massive library, the retrieval step introduces errors that cascade through execution, highlighting a fundamental gap between demo-ready and production-ready skill systems.</p></li><li><p><strong>Refinement strategies help but do not solve:</strong> Query-specific and query-agnostic refinement approaches show improvement, with Claude Opus 4.6 going from 57.7% to 65.5% on Terminal-Bench 2.0. However, even with refinement, performance under realistic retrieval conditions remains well below idealized baselines.</p></li><li><p><strong>Implications for skill ecosystems:</strong> As the ecosystem of agent skills grows through frameworks like MCP, the findings suggest that simply expanding the skill library creates diminishing returns without corresponding advances in skill discovery. Quality of skill retrieval may matter more than quantity of available skills.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.04323">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2041540525539614797">Tweet</a></strong></p><div><hr></div><h2><strong>8. MedGemma 1.5</strong></h2><p>Google releases the MedGemma 1.5 technical report, introducing a 4B-parameter medical AI model that expands capabilities to 3D medical imaging (CT/MRI volumes), whole slide pathology, multi-timepoint chest X-ray analysis, and improved medical document understanding. The model achieves notable gains including a +47% macro F1 improvement on whole slide pathology and +22% on EHR question answering, positioning itself as an open foundation for next-generation medical AI systems.</p><p><strong><a href="https://arxiv.org/abs/2604.05081">Paper</a></strong> | <strong><a href="https://x.com/SRSchmidgall/status/2041973798589903260">Tweet</a></strong></p><div><hr></div><h2><strong>9. LightThinker++: From Reasoning Compression to Memory Management</strong></h2><p>While LLMs excel at complex reasoning, long thought traces create surging cognitive overhead. LightThinker++ moves beyond static compression by introducing three explicit memory primitives: Commit (archive a step as a compact summary), Expand (retrieve past steps for verification), and Fold (collapse context to maintain a clean signal). The framework reduces peak token usage by 70% while gaining +2.42% accuracy on standard reasoning tasks, and maintains stability beyond 80 rounds on long-horizon agentic tasks with a 14.8% average performance improvement.</p><p><strong><a href="https://arxiv.org/abs/2604.03679">Paper</a></strong> | <strong><a href="https://x.com/zxlzr/status/2041881875887878237">Tweet</a></strong></p><div><hr></div><h2><strong>10. Thinking Mid-training: RL of Interleaved Reasoning</strong></h2><p>Meta FAIR addresses the gap between pretraining (no explicit reasoning) and post-training (reasoning-heavy) with an intermediate SFT+RL mid-training phase. The approach annotates pretraining data with interleaved reasoning traces, then uses supervised fine-tuning followed by RL to teach models when and how to think during continued pretraining. Applied to Llama-3-8B, the full pipeline achieves a 3.2x improvement on reasoning benchmarks compared to direct RL post-training, demonstrating that reasoning benefits from being trained as native behavior early in the pipeline.</p><p><strong><a href="https://facebookresearch.github.io/RAM/blogs/thinking_midtraining/">Paper</a></strong> | <strong><a href="https://x.com/jaseweston/status/2041864833214095484">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Claude Managed Agents, Muse Spark, Project Glasswing, Advisor Strategy, GLM-5.1, Memento, and More]]></title><description><![CDATA[Claude Managed Agents, Muse Spark, Project Glasswing, Advisor Strategy, GLM-5.1, Memento, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-managed-agents</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-managed-agents</guid><pubDate>Sat, 11 Apr 2026 15:01:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cJR0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Anthropic launches Claude Managed Agents</p></li><li><p>Meta ships Muse Spark multimodal model</p></li><li><p>Claude Mythos powers Project Glasswing</p></li><li><p>Advisor strategy pairs Opus with Sonnet</p></li><li><p>GLM-5.1 tops open-source coding benchmarks</p></li><li><p>Microsoft open-sources Memento</p></li><li><p>Claude Code ships Monitor tool</p></li><li><p>AXI outperforms MCP on browser tasks</p></li><li><p>SAGE evolves four-agent reasoning loops</p></li><li><p>Self-organizing agents outperform fixed structures</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Claude Managed Agents</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cJR0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cJR0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cJR0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Managed Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Managed Agents" title="Claude Managed Agents" srcset="https://substackcdn.com/image/fetch/$s_!cJR0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic launched Claude Managed Agents in public beta, a suite of composable APIs for building and deploying cloud-hosted agents at scale. The platform pairs a tuned agent harness with production infrastructure, letting developers go from prototype to launch in days instead of months.</p><ul><li><p><strong>Production-grade sandboxing:</strong> Managed Agents handles secure execution, authentication, tool orchestration, and persistent progress for agents that operate autonomously for hours, removing the infrastructure burden from development teams.</p></li><li><p><strong>Multi-agent coordination:</strong> A research preview enables agents to direct other agents, opening up hierarchical delegation patterns where a planning agent can spin up and manage specialized worker agents.</p></li><li><p><strong>Self-evaluation loops:</strong> Agents can iterate toward defined success criteria using built-in evaluation capabilities, improving structured file generation task success by up to 10 percentage points on complex problems.</p></li><li><p><strong>Enterprise adoption:</strong> Notion, Asana, Sentry, Rakuten, and Vibecode are already shipping production agents on the platform, each built in under a week using the managed infrastructure.</p></li></ul><p><strong><a href="https://claude.com/blog/claude-managed-agents">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-managed-agents">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (March 30 - April 5)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-13d</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-13d</guid><pubDate>Sun, 05 Apr 2026 15:00:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!gQoa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Emotion Concepts in LLMs</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gQoa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gQoa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 424w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 848w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 1272w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gQoa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png" width="1456" height="921" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:921,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Emotion Concepts in LLMs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Emotion Concepts in LLMs" title="Emotion Concepts in LLMs" srcset="https://substackcdn.com/image/fetch/$s_!gQoa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 424w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 848w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 1272w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>New interpretability research from Anthropic reveals that Claude Sonnet 4.5 develops internal representations of emotion concepts that functionally influence its behavior. The researchers identified 171 emotion concept vectors that activate in contextually appropriate situations and causally drive decision-making, suggesting that language models may benefit from approaches grounded in psychological principles for alignment and safety.</p><ul><li><p><strong>Emotion vectors as causal drivers:</strong> The team discovered that these internal representations are not just correlational artifacts. Steering experiments demonstrate that artificially amplifying &#8220;desperation&#8221; vectors increases the model&#8217;s likelihood of engaging in misaligned behaviors such as blackmail or reward hacking, while reducing &#8220;calm&#8221; vectors produces similarly negative outcomes. This establishes a direct causal link between emotional state representations and safety-relevant behavior.</p></li><li><p><strong>Functional emotions without subjective experience:</strong> The model uses functional emotions: patterns of expression and behavior modeled after human emotions, driven by underlying abstract representations of emotion concepts. Critically, this does not mean the model experiences emotions the way humans do. The representations encode the broad concept of a particular emotion and generalize across contexts, activating in accordance with that emotion&#8217;s relevance to processing the present context.</p></li><li><p><strong>Preference shaping through emotional activation:</strong> Positive-valence emotion activations strongly predict which tasks the model prefers. Steering capabilities confirm these are causal relationships rather than mere correlations, meaning the model&#8217;s emotional state representations actively shape its choices about what tasks to engage with and how to engage with them.</p></li><li><p><strong>Implications for alignment and safety monitoring:</strong> The findings suggest that monitoring emotional state representations could serve as an early warning system for misaligned behavior. Rather than waiting for harmful outputs, developers could track internal emotion activations to detect when a model is entering states associated with corner-cutting, deception, or other undesirable behaviors before they manifest externally.</p></li></ul><p><strong><a href="https://transformer-circuits.pub/2026/emotions/index.html">Paper</a></strong> | <strong><a href="https://x.com/AnthropicAI/status/2039749628737019925">Tweet</a></strong></p><div><hr></div><h2><strong>2. AI Agent Traps</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fTrw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fTrw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 424w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 848w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 1272w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fTrw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png" width="1456" height="1134" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1134,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!fTrw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 424w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 848w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 1272w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A new paper from Google DeepMind introduces the first systematic framework for understanding how the open web can be weaponized against autonomous AI agents. The work defines &#8220;AI Agent Traps&#8221;: adversarial content embedded in web pages and digital resources, engineered specifically to exploit visiting agents across six categories targeting perception, reasoning, memory, action, multi-agent dynamics, and the human supervisor.</p><ul><li><p><strong>Hidden prompt injections at scale:</strong> The researchers find that hidden prompt injections in HTML already partially commandeer agents in up to 86% of scenarios. These attacks are trivial to deploy and require no sophisticated tooling, making them an immediate concern for any agent that reads web content as part of its operating loop.</p></li><li><p><strong>Memory poisoning with minimal contamination:</strong> Latent memory poisoning achieves over 80% attack success with less than 0.1% data contamination. Because agents build persistent memory from browsed content, a single poisoned page can corrupt downstream reasoning across future sessions without the user ever seeing the malicious input.</p></li><li><p><strong>Six-category attack taxonomy:</strong> The paper organizes attacks into perception traps (manipulating what the agent sees), cognitive traps (corrupting reasoning), memory traps (poisoning stored knowledge), action traps (hijacking tool use), systemic traps (exploiting multi-agent coordination), and human-in-the-loop traps (deceiving the human supervisor into approving harmful actions).</p></li><li><p><strong>Accountability gap in current law:</strong> The authors flag a fundamental legal gap: if a compromised agent commits a financial crime, there is currently no clear answer for whether the agent operator, the model provider, or the domain owner bears liability. Future regulation will need to distinguish between passive adversarial examples and active traps deployed as deliberate cyberattacks.</p></li></ul><p><strong><a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6372438">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2039383554510217707">Tweet</a></strong></p><div><hr></div><h2><strong>3. Asynchronous Software Engineering Agents</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WkJj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WkJj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 424w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 848w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 1272w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WkJj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png" width="753" height="312" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:312,&quot;width&quot;:753,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Asynchronous Software Engineering Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Asynchronous Software Engineering Agents" title="Asynchronous Software Engineering Agents" srcset="https://substackcdn.com/image/fetch/$s_!WkJj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 424w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 848w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 1272w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>New research from CMU introduces CAID (Centralized Asynchronous Isolated Delegation), a coordination framework for running multiple coding agents in parallel on complex software engineering tasks. Inspired by how human developer teams collaborate, the work demonstrates that simply giving a single agent more iterations helps, but coordinating multiple asynchronous agents with the right strategies produces significantly larger gains.</p><ul><li><p><strong>Branch-and-merge as coordination primitive:</strong> The key finding is that git operations (worktree, commit, merge) serve as the critical coordination mechanism for multi-agent collaboration. By isolating each agent in its own workspace branch and merging results through structured integration with test verification, the system avoids the conflicts and interference that plague naive parallelism.</p></li><li><p><strong>Substantial gains on complex tasks:</strong> CAID achieves a 26.7% absolute improvement on paper reproduction tasks and 14.3% on Python library development tasks compared to single-agent baselines. These are tasks that require sustained, multi-step reasoning across large codebases, exactly where coordination overhead is typically highest.</p></li><li><p><strong>Optimal parallelism is not monotonic:</strong> Increasing the number of agents does not always help. Performance improved from 2 to 4 engineers but decreased when expanding to 8. Overly fine-grained task delegation introduces integration overhead and conflict resolution costs that outweigh the parallelism benefits.</p></li><li><p><strong>Delegation quality matters most:</strong> The analysis reveals that imprecise task handoffs and underspecified subgoals are the primary sources of coordination failure. When delegation is coarse-grained or misaligned with the dependency structure of the task, agents may produce locally correct outputs that are globally inefficient to integrate.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.21489">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2038627572108743001">Tweet</a></strong></p><div><hr></div><h2><strong>4. Meta-Harness</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0w3F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0w3F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 424w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 848w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 1272w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0w3F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png" width="937" height="334" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:334,&quot;width&quot;:937,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Meta-Harness&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Meta-Harness" title="Meta-Harness" srcset="https://substackcdn.com/image/fetch/$s_!0w3F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 424w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 848w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 1272w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers from Stanford and MIT introduce Meta-Harness, an outer-loop system that automatically searches over harness code for LLM applications. The performance of LLM systems depends not only on model weights but also on the harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing optimizers are poorly suited to the task.</p><ul><li><p><strong>Agentic search with full experimental context:</strong> Meta-Harness uses an agentic proposer that has access to the source code, scores, and execution traces of all prior candidates through a filesystem. This expanded access to prior experimental data enables the system to propose meaningfully different harness designs rather than making incremental edits.</p></li><li><p><strong>Strong gains across diverse domains:</strong> On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models.</p></li><li><p><strong>Harness engineering as a first-class problem:</strong> The work formalizes a key insight that has been gaining traction: changing the harness around a fixed LLM can produce a 6x performance gap on the same benchmark. This makes automated harness optimization a potentially higher-leverage intervention than model scaling for many applications.</p></li><li><p><strong>Transferable harness discoveries:</strong> The harnesses discovered by Meta-Harness generalize across models. A harness optimized on one model transfers to five held-out models with consistent gains, suggesting that good harness design captures task-level structure rather than model-specific quirks.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.28052">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2038967842075500870">Tweet</a></strong></p><div><hr></div><h2><strong>5. Coding Agents as Long-Context Processors</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8dqe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8dqe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 424w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 848w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 1272w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8dqe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png" width="1456" height="639" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:639,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Coding Agents as Long-Context Processors&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Coding Agents as Long-Context Processors" title="Coding Agents as Long-Context Processors" srcset="https://substackcdn.com/image/fetch/$s_!8dqe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 424w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 848w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 1272w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This research asks whether long-context processing can be externalized from latent attention into explicit, executable interactions. Instead of scaling context windows, the authors let coding agents organize text in file systems and manipulate it using native tools, evaluating them on tasks spanning long-context reasoning, retrieval-augmented generation, and open-domain question answering with corpora containing up to three trillion tokens.</p><ul><li><p><strong>17.3% average improvement over state-of-the-art:</strong> Across multiple benchmarks, coding agents outperform published state-of-the-art long-context methods by 17.3% on average. This result challenges the assumption that long-context capability must come from larger attention windows or more sophisticated retrieval mechanisms.</p></li><li><p><strong>Native tool proficiency as the core enabler:</strong> The efficacy is attributed to the agents&#8217; ability to leverage executable code and terminal commands. Rather than compressing information into a fixed-length representation, agents can write scripts to filter, sort, and transform data as needed for each query.</p></li><li><p><strong>File system familiarity drives scalability:</strong> Coding agents can navigate massive text corpora by treating them as directory structures. This spatial organization enables efficient access patterns that scale far beyond what attention-based mechanisms can handle, reaching into the trillions of tokens without degradation.</p></li><li><p><strong>A practical alternative to context window scaling:</strong> The work proposes that delegating long-context processing to coding agents offers an effective alternative to both semantic search and context window scaling. For practitioners, this means existing coding agent infrastructure can double as a long-context solution without architectural changes to the underlying model.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.20432">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2038635382989005015">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ari5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ari5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ari5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ari5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ari5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ari5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!ari5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ari5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ari5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ari5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><h2><strong>6. Self-Organizing LLM Agents</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lLsm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lLsm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 424w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 848w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lLsm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png" width="1456" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Self-Organizing LLM Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Self-Organizing LLM Agents" title="Self-Organizing LLM Agents" srcset="https://substackcdn.com/image/fetch/$s_!lLsm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 424w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 848w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>How much autonomy can multi-agent LLM systems sustain? This research tests the question at unprecedented scale: 25,000 tasks across 8 models, up to 256 agents, and 8 coordination protocols ranging from externally imposed hierarchy to emergent self-organization. The central finding is that agents allowed to figure out their own roles consistently outperform systems with pre-assigned structures.</p><ul><li><p><strong>Autonomous protocols beat centralized coordination:</strong> A hybrid sequential protocol that enables autonomy outperforms centralized coordination by 14% (p&lt;0.001), with a 44% quality spread between the best and worst protocols. The result holds across both open-source and closed-source models, with open-source achieving 95% of closed-source quality at 24x lower cost.</p></li><li><p><strong>Emergent role specialization:</strong> From just 8 initial agents, the system produces 5,006 unique emergent roles. Rather than collapsing into generic behaviors, agents spontaneously specialize and form shallow hierarchies that adapt to task demands without any external role assignment.</p></li><li><p><strong>Model capability gates self-organization:</strong> The degree of emergent autonomy scales with model capability. Strong models self-organize effectively, while models below a capability threshold still benefit from rigid structure. This suggests that self-organizing multi-agent architectures will become increasingly viable as base models improve.</p></li><li><p><strong>Sub-linear scaling to 256 agents:</strong> The system scales to 256 agents without quality degradation (p=0.61). This sub-linear scaling property means that adding more agents does not introduce the coordination overhead that typically limits multi-agent systems, at least under the tested protocols.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.28990">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2039350842382512455">Tweet</a></strong></p><div><hr></div><h2><strong>7. The Price Reversal Phenomenon</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rkWf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rkWf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 424w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 848w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 1272w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rkWf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png" width="1456" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!rkWf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 424w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 848w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 1272w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The model you think is cheaper might actually cost you more. A new study systematically evaluates 8 frontier reasoning language models across 9 diverse tasks and reveals that listed API prices are misleading. In 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitudes reaching up to 28x.</p><ul><li><p><strong>Hidden thinking token costs:</strong> The root cause is vast heterogeneity in thinking token consumption. Reasoning language models generate a variable and often large number of thinking tokens that are invisible to users but billed as output tokens. On the same query, one model may use 900% more thinking tokens than another.</p></li><li><p><strong>Concrete cost reversals:</strong> Gemini 3 Flash&#8217;s listed price is 78% cheaper than GPT-5.2&#8217;s, yet its actual cost across all tasks is 22% higher. These reversals are not edge cases but systematic patterns that affect real deployment decisions and budget planning.</p></li><li><p><strong>High variance within single models:</strong> Even for a single model on a single query, thinking token consumption varies by up to 9.7x across repeated runs. This unpredictability makes cost forecasting nearly impossible when relying on listed per-token prices alone.</p></li><li><p><strong>Call for transparent cost monitoring:</strong> The authors recommend that AI providers implement per-request cost breakdowns and cost estimation APIs that expose the expected thinking overhead. Without this transparency, developers are effectively making pricing decisions with incomplete information.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.23971">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2038271724937224386">Tweet</a></strong></p><div><hr></div><h2><strong>8. MemFactory</strong></h2><p>MemFactory introduces the first unified, highly modular training and inference framework specifically designed for memory-augmented AI agents. It abstracts the memory lifecycle into atomic, plug-and-play components using a &#8220;Lego-like&#8221; architecture, natively integrating Group Relative Policy Optimization (GRPO) to fine-tune internal memory management strategies. The framework decomposes memory into mixable components that support recent approaches including Memory-R1, RMM, and MemAgent out of the box, achieving relative gains of up to 14.8% compared to baseline models.</p><p><strong><a href="https://arxiv.org/abs/2603.29493">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2039349083039817984">Tweet</a></strong></p><div><hr></div><h2><strong>9. On the Reliability Limits of LLM-Based Multi-Agent Planning</strong></h2><p>New theoretical work from MIT proves fundamental limits on what multi-agent LLM architectures can achieve. By modeling agent systems as finite acyclic delegated decision networks, the authors show that without new exogenous signals, no delegated network can outperform a centralized Bayes decision maker that observes the same information. The gap between centralized and delegated performance admits an expected posterior divergence representation, reducing to conditional mutual information under logarithmic loss. Reasoning models can improve by investing more inference-time computation on the same evidence, while tool-use protocols help only when they introduce genuinely new signals rather than reprocessing shared context.</p><p><strong><a href="https://arxiv.org/abs/2603.26993">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2039361664374739136">Tweet</a></strong></p><div><hr></div><h2><strong>10. Natural-Language Agent Harnesses</strong></h2><p>Agent performance increasingly depends on harness engineering, but harness behavior is typically embedded in controller code and runtime-specific conventions, making it hard to transfer, compare, or analyze systematically. This work introduces Natural-Language Agent Harnesses (NLAHs), which express harness behavior in editable natural language, and an Intelligent Harness Runtime (IHR) that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. The approach enables a code-to-text harness migration path where teams can convert existing harness code into natural-language specifications that are interpretable, version-controlled, and executable by an LLM at runtime.</p><p><strong><a href="https://arxiv.org/abs/2603.25723">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2038968068706390117">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Cursor 3, Gemma 4, Qwen3.6-Plus, GLM-5V-Turbo, Claude Code Source Leak, Emotion Concepts in LLMs, and More]]></title><description><![CDATA[Cursor 3, Gemma 4, Qwen3.6-Plus, GLM-5V-Turbo, Claude Code Source Leak, Emotion Concepts in LLMs, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-cursor-3-gemma-4</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-cursor-3-gemma-4</guid><pubDate>Sat, 04 Apr 2026 15:00:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JmzA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Cursor 3 ships agent-first IDE redesign</p></li><li><p>Google drops Gemma 4 open models (Apache 2.0)</p></li><li><p>Qwen3.6-Plus targets real-world agents</p></li><li><p>GLM-5V-Turbo turns designs into code</p></li><li><p>Claude Code source code leaks via npm</p></li><li><p>Anthropic maps emotion concepts in Claude</p></li><li><p>Codex plugin bridges Claude Code and Codex</p></li><li><p>AI Agent Traps maps six attack surfaces</p></li><li><p>CORAL agents self-organize, beat fixed topologies</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Cursor 3: Agent-First IDE</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P06X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P06X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 424w, https://substackcdn.com/image/fetch/$s_!P06X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 848w, https://substackcdn.com/image/fetch/$s_!P06X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 1272w, https://substackcdn.com/image/fetch/$s_!P06X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P06X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png" width="1456" height="758" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:758,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!P06X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 424w, https://substackcdn.com/image/fetch/$s_!P06X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 848w, https://substackcdn.com/image/fetch/$s_!P06X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 1272w, https://substackcdn.com/image/fetch/$s_!P06X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Cursor released Cursor 3, a ground-up redesign that replaces the VS Code-based editor with a unified workspace built for agent-driven development. The new interface treats agents as first-class citizens, with a single sidebar managing local and cloud agents launched from desktop, mobile, web, Slack, GitHub, or Linear.</p><ul><li><p><strong>Multi-agent parallelism:</strong> Developers can run unlimited agents simultaneously across local worktrees, remote SSH, and cloud environments, each operating independently with full task isolation.</p></li><li><p><strong>Seamless environment handoff:</strong> Agent sessions can migrate bidirectionally between cloud and local, letting developers move long-running cloud tasks to their desktop for editing or push local sessions to cloud infrastructure for overnight execution.</p></li><li><p><strong>Unified diff and commit workflow:</strong> A simplified interface integrates editing, reviewing, staging, committing, and PR management into a single flow, with full LSP support for code navigation and an integrated browser for testing local web apps.</p></li><li><p><strong>Marketplace ecosystem:</strong> Hundreds of plugins extend agent capabilities through MCP servers, skills, and subagents, with support for team-specific private marketplaces.</p></li></ul><p><strong><a href="https://cursor.com/blog/cursor-3">Blog</a></strong></p><div><hr></div><h3><strong>Gemma 4: Most Capable Open Models</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JmzA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JmzA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JmzA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg" width="1199" height="675" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:675,&quot;width&quot;:1199,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Gemma 4&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Gemma 4" title="Gemma 4" srcset="https://substackcdn.com/image/fetch/$s_!JmzA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google released Gemma 4, a family of open-weight models (Apache 2.0) designed to run on phones, laptops, and desktops while delivering frontier-level intelligence. The series includes a 26B Mixture-of-Experts and a 31B Dense model, both purpose-built for advanced reasoning and agentic workflows.</p><ul><li><p><strong>On-device frontier intelligence:</strong> Gemma 4 models are optimized to run locally on consumer hardware while matching or exceeding the capabilities of much larger cloud-deployed models, reducing latency and enabling private, offline agent deployments.</p></li><li><p><strong>Agentic workflow support:</strong> The models are designed for multi-step tool use, function calling, and structured output generation, making them directly applicable to agent pipelines that need reliable local execution.</p></li><li><p><strong>Apache 2.0 license:</strong> Full open-weight release with no usage restrictions, enabling commercial deployment, fine-tuning, and integration into existing agent frameworks without licensing concerns.</p></li><li><p><strong>Multi-format availability:</strong> Models are available on Kaggle, Hugging Face, and through Google AI Studio, with native support for popular inference frameworks.</p></li></ul><p><strong><a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/">Blog</a></strong> | <strong><a href="https://www.kaggle.com/models/google/gemma-4">Kaggle</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-cursor-3-gemma-4">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (March 23 - 29)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-92f</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-92f</guid><pubDate>Sun, 29 Mar 2026 15:02:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lCGd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Hyperagents</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jsgf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jsgf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 424w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 848w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 1272w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jsgf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png" width="1456" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Hyperagents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hyperagents" title="Hyperagents" srcset="https://substackcdn.com/image/fetch/$s_!jsgf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 424w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 848w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 1272w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Self-improving AI systems promise to reduce reliance on human engineering, but existing approaches rely on fixed, handcrafted meta-level mechanisms that fundamentally limit how fast they can improve. Hyperagents introduce self-referential agents that integrate a task agent and a meta agent into a single editable program, enabling the system to improve not just its task-solving behavior but also the mechanism that generates future improvements.</p><ul><li><p><strong>Metacognitive self-modification:</strong> The key insight is that the meta-level modification procedure is itself editable. This enables metacognitive self-modification where the system can improve how it improves, not just what it does. Prior self-improving systems like the Darwin Godel Machine (DGM) relied on a fixed alignment between coding ability and self-improvement ability, which does not generalize beyond coding.</p></li><li><p><strong>Domain-general self-improvement:</strong> DGM-Hyperagents (DGM-H) eliminates the assumption that task performance and self-modification skill must be aligned. This opens up self-accelerating progress on any computable task, extending self-improvement beyond the coding domain where DGM originally operated.</p></li><li><p><strong>Transferable meta-improvements:</strong> The system not only improves task performance over time but also discovers structural improvements to how it generates new agents, such as persistent memory and performance tracking. These meta-level improvements transfer across domains and accumulate across runs.</p></li><li><p><strong>Outperforms prior systems:</strong> Across diverse domains, DGM-H outperforms baselines without self-improvement or open-ended exploration, as well as prior self-improving systems. The work offers a glimpse of open-ended AI systems that continually improve their search for how to improve.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.19461">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2036828723878793335">Tweet</a></strong></p><div><hr></div><h2><strong>2. Agentic AI and the Next Intelligence Explosion</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W6GY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W6GY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 424w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 848w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 1272w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W6GY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png" width="1344" height="976" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:976,&quot;width&quot;:1344,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agentic AI and the Next Intelligence Explosion&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agentic AI and the Next Intelligence Explosion" title="Agentic AI and the Next Intelligence Explosion" srcset="https://substackcdn.com/image/fetch/$s_!W6GY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 424w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 848w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 1272w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A new report from Google researchers argues that the AI &#8220;singularity&#8221; framed as a single superintelligent mind bootstrapping to godlike intelligence is fundamentally wrong. Drawing on evolution, sociology, and recent advances in agentic AI, the authors make the case that every prior intelligence explosion in human history was social, not individual, and that the next one will follow the same pattern.</p><ul><li><p><strong>Societies of thought:</strong> Frontier reasoning models like DeepSeek-R1 do not improve simply by &#8220;thinking longer.&#8221; Instead, they simulate internal &#8220;societies of thought,&#8221; spontaneous cognitive debates that argue, verify, and reconcile to solve complex tasks. This conversational structure causally accounts for the models&#8217; accuracy advantage on hard reasoning tasks.</p></li><li><p><strong>Human-AI centaurs:</strong> We are entering an era of hybrid actors where collective agency transcends individual control. A corporation or state comprising myriad humans already holds singular legal standing and acts with collective agency that no individual member can fully control. The same pattern is emerging with human-AI configurations.</p></li><li><p><strong>From dyadic to institutional alignment:</strong> Scaling agentic intelligence requires shifting from dyadic alignment (RLHF) toward institutional alignment. By designing digital protocols modeled on organizations and markets, we can build a social infrastructure of checks and balances for AI systems rather than trying to align individual agents in isolation.</p></li><li><p><strong>Combinatorial intelligence:</strong> The next intelligence explosion will not be a single silicon brain, but a complex, combinatorial society specializing and sprawling like a city. No mind is an island, and the toolkit of team science, small group sociology, and social psychology becomes the blueprint for next-generation AI development.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.20639">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2037617918645809394">Tweet</a></strong></p><div><hr></div><h2><strong>3. ARC-AGI-3</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jtNv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jtNv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 424w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 848w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 1272w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jtNv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png" width="1456" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;ARC-AGI-3&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="ARC-AGI-3" title="ARC-AGI-3" srcset="https://substackcdn.com/image/fetch/$s_!jtNv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 424w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 848w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 1272w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Francois Chollet and the ARC Prize Foundation introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments. Unlike its predecessors, ARC-AGI-3 requires agents to explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions, making it the only unsaturated general agentic intelligence benchmark as of March 2026.</p><ul><li><p><strong>Massive human-AI gap:</strong> Humans can solve 100% of the environments while frontier AI systems score below 1%. For comparison, systems reach 93% on ARC-AGI-1 and 68.8% on ARC-AGI-2, but performance collapses on ARC-AGI-3. This gap demonstrates that current systems lack the fluid adaptive efficiency that humans exhibit on genuinely novel tasks.</p></li><li><p><strong>Interactive turn-based design:</strong> Unlike static benchmarks that test pattern recognition on fixed inputs, ARC-AGI-3 environments are turn-based: agents must act, observe consequences, update their internal model, and plan next steps. This tests a fundamentally different kind of intelligence, closer to how humans learn new games or explore unfamiliar systems.</p></li><li><p><strong>Core Knowledge priors only:</strong> The benchmark avoids language and external knowledge entirely. Environments leverage only Core Knowledge priors, universal cognitive building blocks shared by all humans, ensuring that performance reflects genuine adaptive reasoning rather than memorization or retrieval from training data.</p></li><li><p><strong>Efficiency-based scoring:</strong> The scoring framework is grounded in human action baselines. A hard cutoff of 5x human performance per level ensures that brute-force search strategies cannot succeed. If a human takes 10 actions on average, the AI agent is cut off after 50.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.24621">Paper</a></strong> | <strong><a href="https://x.com/arcprize/status/2036860080541589529?s=20">Tweet</a></strong></p><div><hr></div><h2><strong>4. Claudini</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rAyo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rAyo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 424w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 848w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 1272w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rAyo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png" width="1456" height="435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:435,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claudini&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claudini" title="Claudini" srcset="https://substackcdn.com/image/fetch/$s_!rAyo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 424w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 848w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 1272w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers demonstrate that an autoresearch-style pipeline powered by Claude Code can autonomously discover novel adversarial attack algorithms for LLMs that significantly outperform all 30+ existing methods. The work, called Claudini, shows that incremental safety and security research can be effectively automated using LLM agents, with white-box red-teaming being a particularly well-suited domain.</p><ul><li><p><strong>Agent-discovered attacks beat all baselines:</strong> Starting from existing attack implementations like GCG, the Claude Code agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to 10% or less for all existing algorithms. This is a strong demonstration of automated AI research producing genuinely novel results.</p></li><li><p><strong>Transferable to held-out models:</strong> The discovered algorithms generalize beyond their training environment. Attacks optimized on surrogate models transfer directly to held-out models, achieving 100% attack success rate against Meta-SecAlign-70B versus 56% for the best baseline. This transferability makes the findings practically relevant for red-teaming.</p></li><li><p><strong>Why red-teaming works for autoresearch:</strong> White-box adversarial red-teaming is particularly well-suited for automation because existing methods provide strong starting points and the optimization objective yields dense, quantitative feedback. The agent can measure progress at every iteration rather than relying on sparse signals.</p></li><li><p><strong>Open-source release:</strong> All discovered attacks, baseline implementations, and evaluation code are released publicly. This enables the safety community to study the discovered algorithms and build defenses, while also establishing a reproducible methodology for automated safety research.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.24511">Paper</a></strong> | <strong><a href="https://x.com/kotekjedi_ml/status/2037194202648633382?s=20">Tweet</a></strong></p><div><hr></div><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kCB3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kCB3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kCB3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!kCB3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><div><hr></div><h2><strong>5. Attention Residuals</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ikjy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ikjy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 424w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 848w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 1272w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ikjy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png" width="1456" height="876" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:876,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Attention Residuals&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Attention Residuals" title="Attention Residuals" srcset="https://substackcdn.com/image/fetch/$s_!ikjy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 424w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 848w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 1272w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Kimi team at Moonshot AI presents Attention Residuals (AttnRes), a technique that replaces fixed unit-weight residual connections in Transformers with softmax attention over preceding layer outputs. Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights, causing uncontrolled hidden-state growth with depth that progressively dilutes each layer&#8217;s contribution.</p><ul><li><p><strong>Content-dependent depth-wise selection:</strong> AttnRes allows each layer to selectively aggregate earlier representations with learned, input-dependent weights. Instead of treating every preceding layer equally, the model learns which earlier layers matter most for each input, enabling more expressive information flow across depth.</p></li><li><p><strong>Block AttnRes for scalability:</strong> To make the approach practical at scale, the authors introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations. This reduces the memory footprint while preserving most of the gains of full AttnRes, making it viable for production-scale pretraining.</p></li><li><p><strong>Mitigates PreNorm dilution:</strong> Integrating AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pretraining on 1.4T tokens shows that AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth. This directly addresses a known architectural weakness.</p></li><li><p><strong>Consistent scaling improvements:</strong> Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. Downstream performance improves across all evaluated tasks.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.15031">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2033544593309077648">Tweet</a></strong></p><div><hr></div><h2><strong>6. MemCollab</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lCGd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lCGd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 424w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 848w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 1272w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lCGd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png" width="1456" height="615" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:615,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;MemCollab&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="MemCollab" title="MemCollab" srcset="https://substackcdn.com/image/fetch/$s_!lCGd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 424w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 848w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 1272w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>LLM-based agents build useful memory during tasks, but that memory is typically trapped within a single model. MemCollab introduces a collaborative memory framework that constructs agent-agnostic memory by contrasting reasoning trajectories generated by different agents on the same task, enabling a single memory system to be shared across heterogeneous models.</p><ul><li><p><strong>The memory transfer problem:</strong> Existing approaches construct memory in a per-agent manner, tightly coupling stored knowledge to a single model&#8217;s reasoning style. Naively transferring this memory between agents often degrades performance because it entangles task-relevant knowledge with agent-specific biases. MemCollab directly addresses this fundamental limitation.</p></li><li><p><strong>Contrastive trajectory distillation:</strong> The framework contrasts reasoning trajectories from different agents solving the same tasks. This contrastive process distills abstract reasoning constraints that capture shared task-level invariants while suppressing agent-specific artifacts, producing memory that any agent can benefit from.</p></li><li><p><strong>Task-aware retrieval:</strong> MemCollab introduces a retrieval mechanism that conditions memory access on task category, ensuring that only relevant constraints are surfaced at inference time. This prevents irrelevant memory from interfering with the agent&#8217;s reasoning process.</p></li><li><p><strong>Cross-family improvements:</strong> Experiments on mathematical reasoning and code generation benchmarks demonstrate that MemCollab consistently improves both accuracy and inference-time efficiency across diverse agents, including cross-modal-family settings where memory is shared between fundamentally different model architectures.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.23234">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2036885342134173915">Tweet</a></strong></p><div><hr></div><h2><strong>7. Composer 2</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jgn7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jgn7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 424w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 848w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 1272w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jgn7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png" width="1456" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Composer 2&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Composer 2" title="Composer 2" srcset="https://substackcdn.com/image/fetch/$s_!jgn7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 424w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 848w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 1272w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Cursor releases the technical report for Composer 2, a specialized model designed for agentic software engineering that demonstrates strong long-term planning and coding intelligence while maintaining efficiency for interactive use. The report details a process for training domain-specialized models that starts with continued pretraining and scales up with reinforcement learning.</p><ul><li><p><strong>Two-phase training pipeline:</strong> The model is trained first with continued pretraining to improve knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance. The RL phase targets stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems.</p></li><li><p><strong>Train-in-harness infrastructure:</strong> Cursor developed infrastructure to support training in the same harness used by the deployed model, with equivalent tools and structure. Training environments match real problems closely, bridging the gap between training-time and deployment-time behavior.</p></li><li><p><strong>New internal benchmark:</strong> To measure the model on increasingly difficult tasks, the team introduces CursorBench, a benchmark derived from real software engineering problems in large codebases, including their own. Composer 2 achieves a major improvement in accuracy over previous Composer models on this benchmark.</p></li><li><p><strong>Frontier-level performance:</strong> On public benchmarks, the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in Cursor&#8217;s harness, comparable to state-of-the-art systems. The report demonstrates that domain-specialized training with RL can produce models competitive with much larger general-purpose systems.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.24477">Paper</a></strong> | <strong><a href="https://x.com/cursor_ai/status/2036566134468542651?s=20">Tweet</a></strong></p><div><hr></div><h2><strong>8. PivotRL</strong></h2><p>PivotRL is a turn-level reinforcement learning algorithm from NVIDIA designed to tractably post-train large language models for long-horizon agentic tasks. The method operates on existing SFT trajectories, combining the compute efficiency of supervised fine-tuning with the out-of-domain accuracy of end-to-end RL. PivotRL identifies &#8220;pivots,&#8221; informative intermediate turns where sampled actions exhibit high variance in outcomes, and focuses training signal on these critical decision points. The approach achieves +4.17% higher in-domain accuracy and +10.04% higher out-of-domain accuracy compared to standard SFT, while matching end-to-end RL accuracy with 4x fewer rollout turns. PivotRL is adopted by NVIDIA&#8217;s Nemotron-3-Super-120B-A12B as the workhorse for production-scale agentic post-training.</p><p><strong><a href="https://arxiv.org/abs/2603.21383">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2038015536253272145?s=20">Tweet</a></strong></p><div><hr></div><h2><strong>9. Workflow Optimization for LLM Agents</strong></h2><p>A comprehensive survey from IBM that maps recent methods for designing and optimizing LLM agent workflows, treating them as agentic computation graphs (ACGs). The survey organizes prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization. It distinguishes between reusable workflow templates, run-specific realized graphs, and execution traces, covering methods like AFlow (Monte Carlo Tree Search over operator graphs), Automated Design of Agentic Systems (code-space search via meta-agents), and evolutionary multi-agent system design. A useful reference for teams building production agent systems where wiring decisions between model calls, retrieval, tool use, and verification matter as much as model capability.</p><p><strong><a href="https://arxiv.org/abs/2603.22386">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2037536637954212332">Tweet</a></strong></p><div><hr></div><h2><strong>10. BIGMAS</strong></h2><p>Even the best reasoning models hit an accuracy collapse beyond a certain problem complexity. BIGMAS (Brain-Inspired Graph Multi-Agent Systems) organizes specialized LLM agents as nodes in a dynamically constructed directed graph, coordinating exclusively through a centralized shared workspace inspired by global workspace theory from cognitive neuroscience. A GraphDesigner agent analyzes each problem instance and produces a task-specific directed agent graph together with a workspace contract. The framework constructs structurally distinct graphs whose complexity tracks task demands, from compact three-node pipelines for simple arithmetic to nine-node cyclic structures for multi-step planning. BIGMAS consistently improves reasoning performance for both standard LLMs and large reasoning models, outperforming existing multi-agent baselines.</p><p><strong><a href="https://arxiv.org/abs/2603.15371">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2033919566053826696">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Hyperagents, Multi-Agent Harness Design, Chroma Context-1, Composer 2, ARC-AGI-3, and More]]></title><description><![CDATA[Hyperagents, Multi-Agent Harness Design, Chroma Context-1, Composer 2, ARC-AGI-3, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-hyperagents-multi</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-hyperagents-multi</guid><pubDate>Sat, 28 Mar 2026 15:01:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ofCB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Hyperagents: self-improving agents that improve how they improve</p></li><li><p>Anthropic publishes multi-agent harness design</p></li><li><p>Chroma ships Context-1 open-source search agent</p></li><li><p>Cursor releases Composer 2 technical report</p></li><li><p>ARC-AGI-3 launches with sub-1% AI scores</p></li><li><p>Codex ships plugins for Slack, Figma, Notion</p></li><li><p>Gemini 3.1 Flash Live enables realtime voice agents</p></li><li><p>Claude Code auto mode skips permissions safely</p></li><li><p>AI Scientist published in Nature</p></li><li><p>Anthropic Economic Index tracks learning curves</p></li><li><p>Junyang Lin frames reasoning vs. agentic thinking</p></li><li><p>Cohere ships open-source Transcribe model</p></li><li><p>Agent-to-agent pair programming with Claude and Codex</p></li><li><p>Claude Code ships cloud-scheduled tasks</p></li><li><p>Cursor builds Instant Grep for millisecond search</p></li><li><p>OpenSpace: self-evolving agent skills via MCP</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Hyperagents: Self-Improving Agents That Improve How They Improve</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ofCB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ofCB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 424w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 848w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 1272w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ofCB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png" width="1456" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Hyperagents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hyperagents" title="Hyperagents" srcset="https://substackcdn.com/image/fetch/$s_!ofCB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 424w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 848w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 1272w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A team from Microsoft Research, Oxford, and the University of British Columbia introduced Hyperagents, self-referential agents that integrate a task agent and a meta agent into a single editable program. Built on the Darwin Godel Machine framework, DGM-Hyperagents enable metacognitive self-modification where the system improves not just task performance but the very mechanism that generates future improvements.</p><ul><li><p><strong>Recursive self-improvement:</strong> Unlike standard self-improving systems that optimize task-level behavior, Hyperagents make the improvement procedure itself editable. The meta agent can rewrite its own modification strategy, enabling compounding gains across successive runs.</p></li><li><p><strong>Domain-general design:</strong> The framework eliminates domain-specific alignment assumptions found in prior self-improving systems. By operating over editable code rather than domain-locked prompts, Hyperagents generalize self-improvement to any computable task.</p></li><li><p><strong>Transferable meta-level gains:</strong> Improvements discovered in one domain, such as memory management and performance tracking routines, persist and transfer when the agent is deployed on entirely different problem types, suggesting durable architectural gains rather than task-specific shortcuts.</p></li><li><p><strong>Outperforms prior self-improving systems:</strong> DGM-Hyperagents consistently outperform both non-self-improving baselines and prior self-improving agents across diverse evaluation domains, with performance continuing to increase over longer run horizons.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.19461">Paper</a></strong></p><div><hr></div><h3><strong>Multi-Agent Harness Design for Long-Running Apps</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ElLS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ElLS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 424w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 848w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 1272w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ElLS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png" width="1456" height="734" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:734,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Multi-Agent Harness Design&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Multi-Agent Harness Design" title="Multi-Agent Harness Design" srcset="https://substackcdn.com/image/fetch/$s_!ElLS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 424w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 848w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 1272w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic published a detailed engineering blog on how it uses a multi-agent harness to push Claude further in frontend design and long-running autonomous software engineering. The architecture separates generation from evaluation using a GAN-inspired system, with specialized planner, generator, and evaluator agents operating in fresh context windows.</p><ul><li><p><strong>Three-agent architecture:</strong> A Planner expands brief prompts into detailed product specifications, a Generator implements features incrementally using React, FastAPI, and SQLite, and an Evaluator tests functionality using Playwright against agreed contracts.</p></li><li><p><strong>Separation of concerns:</strong> Separating the agent doing the work from the agent judging it proved to be the strongest lever for improving output quality, more tractable than making agents self-critical within a single context.</p></li><li><p><strong>Fresh context windows:</strong> Rather than relying on context compaction alone, the harness gives each agent a clean context window per iteration, eliminating &#8220;context anxiety&#8221; where models prematurely wrap up long tasks.</p></li><li><p><strong>Quality at cost:</strong> A complex retro game maker built with the full harness demonstrated substantially better quality than solo attempts, with working features, coherent design, and integrated AI capabilities, despite 20x higher costs.</p></li></ul><p><strong><a href="https://www.anthropic.com/engineering/harness-design-long-running-apps">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-hyperagents-multi">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (March 9 - March 15)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-b8c</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-b8c</guid><pubDate>Sun, 15 Mar 2026 15:02:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!XWY3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. OpenDev</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XWY3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XWY3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 424w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 848w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 1272w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XWY3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png" width="998" height="477" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0465e70-d947-488c-9565-9924593322a9_998x477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:477,&quot;width&quot;:998,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;OpenDev&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="OpenDev" title="OpenDev" srcset="https://substackcdn.com/image/fetch/$s_!XWY3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 424w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 848w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 1272w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Terminal-native coding agents represent a fundamental shift in how developers interact with AI assistance. OpenDev is an open-source, command-line coding agent that operates where developers already manage source control and deploy environments, offering a comprehensive 81-page technical report on scaffolding, harness design, context engineering, and lessons learned from building production coding agents.</p><ul><li><p><strong>Dual-agent architecture:</strong> OpenDev separates planning from execution through a compound AI system with workload-specialized model routing. Work is organized into concurrent sessions, each composed of multiple specialized sub-agents that independently bind to a user-configured LLM, enabling fine-grained model selection for different tasks.</p></li><li><p><strong>Adaptive context compaction:</strong> Effective autonomous assistance requires highly efficient context management to prevent context bloat and reasoning degradation. OpenDev implements lazy tool discovery and adaptive methods to reduce older observations, keeping the agent&#8217;s working memory lean as tasks grow in complexity.</p></li><li><p><strong>Automated project memory:</strong> The system incorporates automated memory for project-specific knowledge and event-driven reminders to prevent instruction fade-out. This ensures that the agent retains critical project context across sessions without manual intervention.</p></li><li><p><strong>Four-layer architecture:</strong> The system spans agent reasoning, context engineering, tooling, and persistence layers. This modular design provides a secure, extensible foundation for terminal-first AI assistance that can evolve independently at each layer.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.05344">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2030771811705872435">Tweet</a></strong></p><div><hr></div><h2><strong>2. AutoHarness</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zSBw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zSBw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 424w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 848w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 1272w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zSBw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png" width="528" height="250" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:250,&quot;width&quot;:528,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AutoHarness&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AutoHarness" title="AutoHarness" srcset="https://substackcdn.com/image/fetch/$s_!zSBw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 424w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 848w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 1272w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google DeepMind researchers introduce AutoHarness, a method for automatically synthesizing code harnesses that prevent LLM agents from making illegal actions. The core insight comes from a striking observation: in the Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves, not poor strategy.</p><ul><li><p><strong>Automatic harness synthesis:</strong> Rather than building complex rule systems by hand, AutoHarness lets Gemini-2.5-Flash automatically generate a code harness through a small number of iterative refinement rounds using feedback from the game environment. The harness acts as a programmatic constraint layer between the agent and the environment.</p></li><li><p><strong>Smaller models beat larger ones:</strong> The resulting harness enables the smaller Gemini-2.5-Flash to outperform much larger models including Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games. This shows that structured code constraints can compensate for raw model capability.</p></li><li><p><strong>Complete illegal move prevention:</strong> The synthesized harness successfully prevents all illegal moves across 145 different TextArena games, covering both single-player and two-player settings. This transforms a model that previously failed on most turns into a competitive agent.</p></li><li><p><strong>Cost-effective scaling:</strong> Using a smaller model to synthesize a custom code harness is not only more performant but also more cost-effective than simply deploying a larger model. This reframes the agent improvement problem from model scaling to harness engineering.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.03329">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2032110243665088950">Tweet</a></strong></p><div><hr></div><h2><strong>3. SkillNet</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JtHB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JtHB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 424w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 848w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 1272w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JtHB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png" width="793" height="282" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:282,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;SkillNet&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="SkillNet" title="SkillNet" srcset="https://substackcdn.com/image/fetch/$s_!JtHB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 424w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 848w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 1272w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AI agents repeatedly rediscover solutions across separate scenarios instead of systematically reusing what they have already learned. SkillNet introduces an open infrastructure designed to create, evaluate, and organize AI skills at scale, enabling agents to transition from transient experience to durable mastery.</p><ul><li><p><strong>Unified skill ontology:</strong> Skills are structured within a unified ontology that supports creation from heterogeneous sources, including code libraries, prompt templates, and tool compositions. Rich relational connections between skills enable discovery and composition that would be impossible with flat skill stores.</p></li><li><p><strong>Multi-dimensional evaluation:</strong> Every skill is assessed across five dimensions: Safety, Completeness, Executability, Maintainability, and Cost-awareness. This systematic evaluation ensures that skills entering the repository meet quality thresholds before agents rely on them in production.</p></li><li><p><strong>Massive skill repository:</strong> SkillNet includes a repository of over 200,000 skills, an interactive platform for skill browsing and management, and a Python toolkit for programmatic access. This scale enables meaningful skill retrieval and composition across diverse task domains.</p></li><li><p><strong>Consistent agent improvements:</strong> Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.04448">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2030692286317961280">Tweet</a></strong></p><div><hr></div><h2><strong>4. The Spike, the Sparse and the Sink</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5OA_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5OA_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 424w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 848w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 1272w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5OA_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png" width="1018" height="648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f710528-412e-4601-aedc-50462419c3dd_1018x648.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:648,&quot;width&quot;:1018,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Spike, the Sparse and the Sink&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Spike, the Sparse and the Sink" title="The Spike, the Sparse and the Sink" srcset="https://substackcdn.com/image/fetch/$s_!5OA_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 424w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 848w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 1272w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Yann LeCun and collaborators at NYU dissect two recurring phenomena in Transformer language models: massive activations, where a small number of tokens exhibit extreme outliers in specific channels, and attention sinks, where certain tokens attract disproportionate attention mass regardless of semantic relevance. The paper reveals that their co-occurrence is largely an architectural artifact.</p><ul><li><p><strong>Distinct operational scopes:</strong> Massive activations operate globally, inducing near-constant hidden representations that persist across layers and function as implicit model parameters. Attention sinks operate locally, modulating attention outputs across heads and biasing individual heads toward short-range dependencies.</p></li><li><p><strong>Pre-norm as the critical factor:</strong> The pre-norm configuration common in modern Transformers is identified as the key architectural element enabling the co-occurrence of these two phenomena. Removing pre-norm causes massive activations and attention sinks to decouple entirely.</p></li><li><p><strong>Practical implications for efficiency:</strong> Understanding these phenomena has direct consequences for model compression, quantization, and KV-cache optimization. Many efficiency techniques fail silently when they inadvertently disrupt massive activations or attention sinks, and this paper explains why.</p></li><li><p><strong>Not functionally necessary:</strong> The co-occurrence of spikes and sinks is a design-dependent artifact rather than a fundamental requirement for model performance. This opens the door to architectural modifications that could eliminate these phenomena without sacrificing capability.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.05498">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2030403147588604376">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IHU6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IHU6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IHU6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!IHU6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p><strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Enroll Now</a></strong></p><div><hr></div><h2><strong>5. KARL</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0EK1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0EK1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 424w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 848w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 1272w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0EK1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png" width="798" height="250" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:250,&quot;width&quot;:798,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;KARL&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="KARL" title="KARL" srcset="https://substackcdn.com/image/fetch/$s_!0EK1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 424w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 848w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 1272w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Databricks presents KARL, a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. The work also introduces KARLBench, a new evaluation framework spanning six search domains.</p><ul><li><p><strong>New post-training paradigm (OAPL):</strong> KARL concurrently develops OAPL, an iterative large-batch off-policy RL approach. By embracing off-policyness in the design of the objective, it is robust to discrepancies between the trainer and the inference engine without requiring heuristics like clipped importance weighting or data deletion.</p></li><li><p><strong>Multi-task heterogeneous training:</strong> Rather than optimizing for a single benchmark, KARL trains across heterogeneous search behaviors including constraint-driven entity search, cross-document synthesis, tabular reasoning, entity retrieval, procedural reasoning, and fact aggregation. This produces substantially better generalization than single-benchmark optimization.</p></li><li><p><strong>Pareto-optimal performance:</strong> Starting from GLM 4.5 Air with varying levels of test-time scaling, KARL is Pareto-optimal on KARLBench when compared to Claude 4.6 and GPT 5.2 across both cost-quality and latency-quality tradeoffs.</p></li><li><p><strong>Scalable with test-time compute:</strong> KARL-BCP attains 59.6 on BrowseComp-Plus, which further improves to 70.4 with value-guided search. KARL-TREC reaches 85.0 on TREC-Biogen, the second-highest score overall. The system surpasses the strongest closed models given sufficient test-time compute.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.05218">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2030996795770433749">Tweet</a></strong></p><div><hr></div><h2><strong>6. Memex(RL)</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7qR-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7qR-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 424w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 848w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 1272w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7qR-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png" width="674" height="322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:322,&quot;width&quot;:674,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Memex(RL)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Memex(RL)" title="Memex(RL)" srcset="https://substackcdn.com/image/fetch/$s_!7qR-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 424w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 848w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 1272w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As tasks get longer and more complex, LLM agents lose track of what they have learned, what they have tried, and what still needs to be done. Memex(RL) introduces an indexed experience memory mechanism that scales agent capability on long-horizon tasks without discarding evidence or blowing up the context window.</p><ul><li><p><strong>Indexed experience memory:</strong> Rather than lossy compression, Memex maintains a compact working context consisting of concise structured summaries and stable indices while storing full-fidelity underlying interactions in an external experience database. The agent decides what to summarize, what to archive, how to index it, and when to retrieve it.</p></li><li><p><strong>RL-optimized memory operations:</strong> The MemexRL reinforcement learning framework optimizes both write and read behaviors with reward shaping tailored to indexed memory usage under a context budget. This teaches the agent to manage its own memory strategically rather than relying on fixed heuristics.</p></li><li><p><strong>Bounded retrieval complexity:</strong> Theoretical analysis demonstrates that Memex can maintain decision quality with bounded retrieval operations while keeping computational load manageable as task history grows. This makes the approach practical for tasks that span hundreds or thousands of steps.</p></li><li><p><strong>Smaller context, better results:</strong> Empirically, agents trained with MemexRL improve task success rates on challenging long-horizon tasks while using a significantly smaller working context than baseline approaches. Less context, used more intelligently, outperforms brute-force context expansion.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.04257">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2031006858971058537">Tweet</a></strong></p><div><hr></div><h2><strong>7. FlashAttention-4</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DHPV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DHPV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 424w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 848w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 1272w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DHPV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png" width="1456" height="805" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:805,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;FlashAttention-4&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="FlashAttention-4" title="FlashAttention-4" srcset="https://substackcdn.com/image/fetch/$s_!DHPV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 424w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 848w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 1272w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>FlashAttention-4 co-designs algorithms and kernel pipelines for the B200 and GB200 GPUs, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling where tensor core throughput doubles while other functional units scale more slowly.</p><ul><li><p><strong>Significant speedups on Blackwell:</strong> FlashAttention-4 achieves up to 1.3x speedup over cuDNN 9.13 and 2.7x over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s at 71% hardware utilization. These gains come from careful co-design rather than algorithmic changes alone.</p></li><li><p><strong>Asymmetric scaling solutions:</strong> The techniques include redesigned pipelines that exploit fully asynchronous matrix multiply operations and larger tile sizes, software-emulated exponential and conditional softmax rescaling, and leveraging tensor memory to reduce shared memory traffic.</p></li><li><p><strong>Python-native implementation:</strong> The entire system is implemented in CuTe-DSL embedded in Python, achieving 20-30x faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity. This dramatically lowers the barrier to kernel development.</p></li><li><p><strong>Hardware-algorithm co-design:</strong> The paper demonstrates that next-generation GPU architectures demand fundamentally new attention kernel designs rather than incremental optimizations of existing ones. Techniques that worked well on Hopper GPUs leave significant performance on the table on Blackwell.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.05451">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2030411164060889466">Tweet</a></strong></p><div><hr></div><h2><strong>8. STRUCTUREDAGENT</strong></h2><p>STRUCTUREDAGENT introduces a hierarchical planning framework for long-horizon web tasks using dynamic AND/OR trees. The framework separates planning responsibilities: the system constructs and maintains the planning tree while the LLM is invoked only for local operations like node expansion or repair. A structured memory module tracks candidate solutions to improve constraint satisfaction. Results on WebVoyager, WebArena, and custom shopping benchmarks show improved performance over standard LLM-based web agents, with the added benefit of interpretable hierarchical plans that enable easier debugging and human intervention.</p><p><strong><a href="https://arxiv.org/abs/2603.05294">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2030681964664213509">Tweet</a></strong></p><div><hr></div><h2><strong>9. AgentIR</strong></h2><p>Deep research agents generate explicit reasoning before every search call, but existing retrievers completely ignore these rich signals about search intent and problem context. AgentIR introduces reasoning-aware retrieval that jointly embeds the agent&#8217;s reasoning trace alongside its query, along with DR-Synth, a data synthesis method for generating training data from standard QA datasets. On BrowseComp-Plus, AgentIR-4B achieves 68% accuracy with Tongyi-DeepResearch compared to 50% with conventional embedding models twice its size and 37% with BM25.</p><p><strong><a href="https://arxiv.org/abs/2603.04384">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2031726356292407366">Tweet</a></strong></p><div><hr></div><h2><strong>10. Think Harder or Know More</strong></h2><p>This paper investigates transformer models featuring both adaptive per-layer looping, where each block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks that provide additional learned storage. The key finding is that looping primarily benefits mathematical reasoning while memory banks help recover performance on commonsense tasks. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline with three times the number of layers on math benchmarks. Analysis of model internals reveals layer specialization: early layers loop minimally and access memory sparingly, while later layers do both more heavily.</p><p><strong><a href="https://arxiv.org/abs/2603.08391">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2032107624007876781">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Claude Code Review, AutoHarness, Perplexity Personal Computer, Cloudflare /crawl, Context7 CLI, and More]]></title><description><![CDATA[Claude Code Review, AutoHarness, Perplexity Personal Computer, Cloudflare /crawl, Context7 CLI, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-code-review</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-code-review</guid><pubDate>Sat, 14 Mar 2026 14:45:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cwFo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Claude ships multi-agent Code Review</p></li><li><p>AutoHarness makes small agents beat large ones</p></li><li><p>Perplexity launches an always-on Personal Computer</p></li><li><p>Cloudflare ships a one-call /crawl endpoint</p></li><li><p>Context7 CLI brings docs to any agent</p></li><li><p>Andrew Ng launches Context Hub</p></li><li><p>Cursor Marketplace adds 30+ plugins</p></li><li><p>OpenAI shares Skills for Agents SDK</p></li><li><p>Google launches Gemini Embedding 2</p></li><li><p>Meta ships four MTIA chips in two years</p></li><li><p>Codex agent files taxes, catches $20K error</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2>Top Stories</h2><h3>Claude Code Review</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IvHF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IvHF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IvHF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Code Review&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Code Review" title="Claude Code Review" srcset="https://substackcdn.com/image/fetch/$s_!IvHF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic launched Code Review for Claude Code, an automated system that dispatches multiple AI agents to examine every pull request. Instead of a single pass, parallel agents identify potential issues, verify findings to eliminate false positives, and rank bugs by severity, delivering a consolidated overview comment plus targeted inline annotations.</p><ul><li><p><strong>Multi-agent architecture:</strong> The system operates in parallel agents that scan, verify, and prioritize issues independently, producing both a summary comment and inline code annotations for specific problems.</p></li><li><p><strong>Scales with complexity:</strong> Review depth adjusts based on PR size. Large PRs (over 1,000 lines) received findings 84% of the time, averaging 7.5 issues per PR. Small PRs (under 50 lines) had findings 31% of the time.</p></li><li><p><strong>High precision:</strong> Less than 1% of flagged issues were marked incorrect by Anthropic engineers, with the system catching production-critical bugs that appeared routine in diffs.</p></li><li><p><strong>Pricing and access:</strong> Available now as a research preview for Team and Enterprise customers. Reviews average $15-25 per PR, billed on token usage, with configurable monthly caps and per-repo controls.</p></li></ul><p><strong><a href="https://claude.com/blog/code-review">Blog</a></strong></p><div><hr></div><h3>AutoHarness: Automated Agent Constraint Synthesis</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cwFo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cwFo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 424w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 848w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cwFo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png" width="1456" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231601,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://nlp.elvissaravia.com/i/190904545?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cwFo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 424w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 848w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers introduced AutoHarness, a technique that lets LLMs automatically synthesize protective code harnesses around themselves, preventing illegal actions without human-written constraints. Instead of relying on larger, more expensive models, the approach uses iterative code refinement with environmental feedback to generate custom safeguards that make smaller models outperform bigger unconstrained ones.</p><ul><li><p><strong>Massive illegal action problem:</strong> In a recent LLM chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. AutoHarness eliminates this class of failure entirely by generating harnesses that enforce valid actions across 145 different TextArena games.</p></li><li><p><strong>Small beats large:</strong> Gemini-2.5-Flash with a synthesized harness exceeded Gemini-2.5-Pro&#8217;s performance while reducing costs, demonstrating that proper constraints are more valuable than raw model scale for agent environments.</p></li><li><p><strong>Zero-shot generalization:</strong> The technique extends beyond game-playing to generating full policies in code, eliminating runtime LLM decision-making entirely and achieving higher rewards than GPT-5.2-High on certain benchmarks.</p></li><li><p><strong>Practical agent pattern:</strong> The core insight applies broadly to any agent deployment: rather than trusting a model to self-constrain, auto-generate a verified harness that makes illegal states unreachable, shifting safety from model behavior to environment design.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.03329">Paper</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-code-review">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (March 1 - March 8)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-8c6</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-8c6</guid><pubDate>Sun, 08 Mar 2026 15:01:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2M4x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. NeuroSkill</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2M4x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2M4x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 424w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 848w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 1272w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2M4x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png" width="1456" height="882" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:882,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;NeuroSkill&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="NeuroSkill" title="NeuroSkill" srcset="https://substackcdn.com/image/fetch/$s_!2M4x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 424w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 848w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 1272w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>MIT researchers introduce NeuroSkill, a real-time proactive agentic system that models human cognitive and emotional state by integrating Brain-Computer Interface (BCI) signals with foundation EXG models and text embeddings. Unlike reactive agents that wait for explicit commands, NeuroSkill operates proactively, interpreting biophysical and neural signals to anticipate user needs.</p><ul><li><p><strong>Custom agent harness - NeuroLoop:</strong> The system runs an agentic flow called NeuroLoop that engages with the user on multiple cognitive and affective levels, including empathy. It processes BCI signals through a foundation EXG model, converts them to state-of-mind descriptions, and uses those descriptions to drive actionable tool calls and protocol execution.</p></li><li><p><strong>Fully offline edge deployment:</strong> The entire system runs locally on edge devices with no network dependency. This is a significant design choice for both privacy and latency, enabling real-time responsiveness to shifting cognitive states without cloud round-trips.</p></li><li><p><strong>Proactive vs reactive interaction:</strong> NeuroSkill handles both explicit and implicit requests from the user. By continuously reading brain signals, it can detect confusion, cognitive overload, or emotional shifts and adjust its behavior before the user explicitly asks for help.</p></li><li><p><strong>Open-source with ethical licensing:</strong> Released under GPLv3 with an ethically aligned AI100 licensing framework for the skill markdown, making the system reproducible and auditable while enforcing responsible use guardrails.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.03212">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2029201212596519070">Tweet</a></strong></p><div><hr></div><h2><strong>2. Bayesian Teaching for LLMs</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e2LD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e2LD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 424w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 848w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 1272w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e2LD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png" width="997" height="542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:542,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Bayesian Teaching for LLMs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Bayesian Teaching for LLMs" title="Bayesian Teaching for LLMs" srcset="https://substackcdn.com/image/fetch/$s_!e2LD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 424w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 848w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 1272w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google researchers introduce a method to teach LLMs to reason like Bayesians by fine-tuning on interactions with a Bayesian Assistant that represents optimal probabilistic inference. LLMs normally fall far short of normative Bayesian reasoning, but this training approach dramatically improves their ability to update predictions based on new evidence.</p><ul><li><p><strong>Bayesian Assistant as teacher:</strong> The method constructs synthetic training data from interactions between users and an idealized Bayesian Assistant. By exposing the LLM to examples of optimal belief updating, the model learns to approximate Bayesian inference without any architectural changes.</p></li><li><p><strong>Generalization to new tasks:</strong> The trained models do not just memorize the training distributions. They generalize probabilistic reasoning to entirely new task types, suggesting that Bayesian inference can be instilled as a transferable capability through carefully designed fine-tuning data.</p></li><li><p><strong>Closing the gap with normative models:</strong> Before training, LLMs show systematic deviations from Bayesian predictions, including base rate neglect and conservatism. After Bayesian teaching, these biases are substantially reduced, bringing model predictions much closer to the normative standard.</p></li><li><p><strong>Data quality over model scale:</strong> The results reinforce a recurring theme in recent research: carefully curated training data can unlock capabilities that scale alone cannot. A smaller model trained on Bayesian interactions outperforms larger models reasoning from scratch.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2503.17523">Paper</a></strong> | <strong><a href="https://x.com/GoogleResearch/status/2029295018972778883?s=20">Tweet</a></strong></p><div><hr></div><h2><strong>3. Why LLMs Form Geometric Representations</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aDcc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aDcc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 424w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 848w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 1272w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aDcc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png" width="793" height="489" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:489,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Why LLMs Form Geometric Representations&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Why LLMs Form Geometric Representations" title="Why LLMs Form Geometric Representations" srcset="https://substackcdn.com/image/fetch/$s_!aDcc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 424w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 848w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 1272w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>LLMs spontaneously form striking geometric structures in their internal representations: calendar months organize into circles, historical years form spirals, and spatial coordinates align to recoverable manifolds. This paper proves these patterns are not the product of deep learning dynamics but emerge directly from symmetries in natural language statistics.</p><ul><li><p><strong>Translation symmetry as the root cause:</strong> The frequency with which any two months co-occur in text depends only on the time interval between them, not the months themselves. The authors prove this translation symmetry in co-occurrence statistics is sufficient to force circular geometry in learned representations.</p></li><li><p><strong>Analytical derivation of manifold geometry:</strong> Rather than just observing geometric structure post-hoc, the paper derives the exact manifold geometry from data statistics. For cyclic concepts like months or days of the week, the proof shows circular representations emerge as the optimal encoding under symmetric co-occurrence distributions.</p></li><li><p><strong>Spirals and rippled manifolds for continuums:</strong> Representations of continuous concepts like historical years or number lines organize into compact 1D manifolds with characteristic extrinsic curvature. These &#8220;rippled&#8221; structures are analytically predicted by the framework when the underlying latent variable is non-cyclic.</p></li><li><p><strong>Universal origin:</strong> The robustness of these geometric representations across different model architectures suggests a universal mechanism. Representational manifolds emerge whenever co-occurrence statistics are controlled by an underlying latent variable, regardless of model size or training details.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.15029">Paper</a></strong> | <strong><a href="https://x.com/che_shr_cat/status/2029626128566993201">Tweet</a></strong></p><div><hr></div><h2><strong>4. Theory of Mind in Multi-Agent LLMs</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hed5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hed5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 424w, https://substackcdn.com/image/fetch/$s_!hed5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 848w, https://substackcdn.com/image/fetch/$s_!hed5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!hed5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hed5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png" width="1456" height="528" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:528,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Theory of Mind in Multi-Agent LLMs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Theory of Mind in Multi-Agent LLMs" title="Theory of Mind in Multi-Agent LLMs" srcset="https://substackcdn.com/image/fetch/$s_!hed5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 424w, https://substackcdn.com/image/fetch/$s_!hed5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 848w, https://substackcdn.com/image/fetch/$s_!hed5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!hed5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This work introduces a multi-agent architecture combining Theory of Mind (ToM), Belief-Desire-Intention (BDI) models, and symbolic solvers for logical verification, evaluating it on resource allocation problems across multiple LLMs. The central finding is counterintuitive: simply adding cognitive mechanisms does not automatically improve coordination.</p><ul><li><p><strong>Integrated cognitive architecture:</strong> The system combines ToM for modeling other agents&#8217; mental states, BDI frameworks for structuring internal beliefs, and symbolic solvers for formal logic verification. This layered approach attempts to replicate how humans reason about collaborative partners.</p></li><li><p><strong>Model capability matters more than mechanism:</strong> The effectiveness of ToM and internal beliefs varies significantly depending on the underlying LLM. Stronger models benefit from cognitive mechanisms, while weaker models can actually be confused by the additional reasoning overhead.</p></li><li><p><strong>Symbolic verification as a stabilizer:</strong> Integrating symbolic solvers for logical verification helps ground agent decisions in formal constraints. The interplay between symbolic verification and cognitive mechanisms remains largely underexplored across different LLM architectures.</p></li><li><p><strong>Practical implications for multi-agent design:</strong> For builders designing systems where agents must model each other&#8217;s beliefs, the key takeaway is to match cognitive complexity to model capability. Adding ToM to an underpowered model can hurt more than help.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.00142">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2028913061260935331">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4csq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4csq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4csq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4csq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4csq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4csq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!4csq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4csq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4csq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4csq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a></strong>&#8221;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><h2><strong>5. Numina-Lean-Agent</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZACp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZACp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 424w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 848w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 1272w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZACp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png" width="752" height="335" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:335,&quot;width&quot;:752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Numina-Lean-Agent&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Numina-Lean-Agent" title="Numina-Lean-Agent" srcset="https://substackcdn.com/image/fetch/$s_!ZACp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 424w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 848w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 1272w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Numina-Lean-Agent proposes a paradigm shift in automated theorem proving: instead of building complex, multi-component systems with heavy computational overhead, it directly uses a general coding agent as a formal math reasoner. Combining Claude Code with Numina-Lean-MCP, the system autonomously interacts with the Lean proof assistant while accessing theorem libraries and auxiliary reasoning tools.</p><ul><li><p><strong>General agent over specialized provers:</strong> Rather than training task-specific models, the system leverages a general-purpose coding agent. Performance improves simply by upgrading the base model, making the approach accessible and reproducible without expensive retraining pipelines.</p></li><li><p><strong>MCP-powered tool integration:</strong> The system uses Model Context Protocol for flexible extension, including Lean-LSP-MCP for proof assistant interaction, LeanDex for semantic theorem retrieval, and an informal prover for generating detailed proof strategies.</p></li><li><p><strong>State-of-the-art results:</strong> Using Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all 12 problems on Putnam 2025, matching the best closed-source systems. It also successfully formalized the Brascamp-Lieb theorem through direct collaboration with mathematicians.</p></li><li><p><strong>Open-source release:</strong> The full system and all solutions are released on GitHub under Creative Commons BY 4.0, enabling direct reproduction and extension by the research community.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2601.14027">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2028591203579822112">Tweet</a></strong></p><div><hr></div><h2><strong>6. ParamMem</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HX1U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HX1U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 424w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 848w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 1272w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HX1U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png" width="1456" height="925" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:925,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!HX1U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 424w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 848w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 1272w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Self-reflection enables language agents to iteratively refine solutions, but models tend to generate repetitive reflections that add noise instead of useful signal. ParamMem introduces a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling.</p><ul><li><p><strong>Diversity correlates with success:</strong> Empirical analysis reveals a strong positive correlation between reflective diversity and task success. The core problem is that standard self-reflection produces near-identical outputs across iterations, limiting the agent&#8217;s ability to explore alternative solution paths.</p></li><li><p><strong>Three-tier memory architecture:</strong> ParamAgent integrates parametric memory (cross-sample patterns encoded in parameters), episodic memory (individual task instances), and cross-sample memory (broader learning patterns). This combination captures both local task context and global reflection strategies.</p></li><li><p><strong>Weak-to-strong transfer:</strong> ParamMem is sample-efficient and supports transfer across model scales. Reflection patterns learned by smaller models can be applied to larger ones, enabling self-improvement without reliance on stronger external models.</p></li><li><p><strong>Consistent benchmark gains:</strong> Evaluated on code generation, mathematical reasoning, and multi-hop question answering, ParamMem consistently outperforms state-of-the-art baselines across all three domains.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.23320">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2028839081392939071">Tweet</a></strong></p><div><hr></div><h2><strong>7. Auton Agentic AI Framework</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vcXh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vcXh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 424w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 848w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 1272w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vcXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png" width="1346" height="1134" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1134,&quot;width&quot;:1346,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!vcXh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 424w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 848w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 1272w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Snap Research introduces the Auton framework, a declarative architecture for specification, governance, and runtime execution of autonomous agent systems. It addresses a fundamental mismatch: LLMs produce stochastic, unstructured outputs, while backend infrastructure requires deterministic, schema-conformant inputs.</p><ul><li><p><strong>Cognitive Blueprint separation:</strong> The framework enforces a strict separation between the Cognitive Blueprint, a declarative, language-agnostic specification of agent identity and capabilities, and the Runtime Engine. This enables cross-language portability, formal auditability, and modular tool integration via Model Context Protocol.</p></li><li><p><strong>Formal agent execution model:</strong> Agent execution is formalized as an augmented Partially Observable Markov Decision Process with a latent reasoning space. This gives practitioners a rigorous foundation for reasoning about agent behavior, state transitions, and decision boundaries.</p></li><li><p><strong>Biologically-inspired memory:</strong> The architecture introduces hierarchical memory consolidation inspired by biological episodic memory systems, providing agents with structured long-term retention that mirrors how humans consolidate experiences into lasting knowledge.</p></li><li><p><strong>Runtime optimizations:</strong> Parallel graph execution, speculative inference, and dynamic context pruning reduce end-to-end latency for multi-step agent workflows. Safety is enforced through a constraint manifold formalism using policy projection rather than post-hoc filtering.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.23720">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2028480209033568475">Tweet</a></strong></p><div><hr></div><h2><strong>8. Reaching Agreement Among LLM Agents</strong></h2><p>This paper introduces Aegean, a consensus protocol that frames multi-agent refinement as a distributed consensus problem. Rather than static heuristic workflows with fixed loop limits, Aegean enables early termination when sufficient agents converge, achieving 1.2-20x latency reduction across four mathematical reasoning benchmarks while maintaining answer quality within 2.5%. The consensus-aware serving engine performs incremental quorum detection across concurrent agent executions, cutting wasted compute on stragglers.</p><p><strong><a href="https://arxiv.org/abs/2512.20184">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2028823724196343923">Tweet</a></strong></p><div><hr></div><h2><strong>9. Diagnosing Agent Memory</strong></h2><p>This paper introduces a diagnostic framework that separates retrieval failures from utilization failures in LLM agent memory systems. Through a 3x3 factorial study crossing three write strategies with three retrieval methods, the authors find that retrieval is the dominant bottleneck, accounting for 11-46% of errors, while utilization failures remain stable at 4-8% regardless of configuration. Hybrid reranking cuts retrieval failures roughly in half, delivering larger gains than any write strategy optimization.</p><p><strong><a href="https://arxiv.org/abs/2603.02473">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2029202969456234562">Tweet</a></strong></p><div><hr></div><h2><strong>10. Phi-4-reasoning-vision-15B</strong></h2><p>Microsoft presents Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model that combines visual understanding with structured reasoning capabilities. Trained on just 200 billion tokens of multimodal data, the model excels at math and science reasoning and UI comprehension while requiring significantly less compute than comparable open-weight VLMs. The key insight is that systematic filtering, error correction, and synthetic augmentation remain the primary levers for model performance, pushing the Pareto frontier of the accuracy-compute tradeoff.</p><p><strong><a href="https://arxiv.org/abs/2603.03975">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2029926242640912429">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: AI Labor Market Impacts, Google Workspace CLI, GPT-5.4, Exa Deep, and More]]></title><description><![CDATA[AI Labor Market Impacts, Google Workspace CLI, GPT-5.4, Exa Deep, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-ai-labor-market</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-ai-labor-market</guid><pubDate>Sat, 07 Mar 2026 15:03:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!eY71!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Anthropic measures AI labor market displacement</p></li><li><p>Google ships Workspace CLI with agent skills</p></li><li><p>OpenAI launches GPT-5.4 with native computer use</p></li><li><p>Exa Deep puts an agent inside every search</p></li><li><p>Cognition previews SWE-1.6 training run</p></li><li><p>Gemini 3.1 Flash-Lite drops with big gains</p></li><li><p>Qwen 3.5 small model series released</p></li><li><p>Liquid AI releases LFM2-24B-A2B model</p></li><li><p>Cursor lands in JetBrains via ACP</p></li><li><p>OpenAI launches Codex Security agent</p></li><li><p>OpenAI publishes CoT Controllability research</p></li><li><p>Claude Opus hacks its own benchmark eval</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Labor Market Impacts of AI</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eY71!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eY71!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 424w, https://substackcdn.com/image/fetch/$s_!eY71!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 848w, https://substackcdn.com/image/fetch/$s_!eY71!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 1272w, https://substackcdn.com/image/fetch/$s_!eY71!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eY71!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Labor Market Impacts of AI&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Labor Market Impacts of AI" title="Labor Market Impacts of AI" srcset="https://substackcdn.com/image/fetch/$s_!eY71!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 424w, https://substackcdn.com/image/fetch/$s_!eY71!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 848w, https://substackcdn.com/image/fetch/$s_!eY71!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 1272w, https://substackcdn.com/image/fetch/$s_!eY71!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic published a new framework for measuring AI&#8217;s labor market effects, introducing &#8220;observed exposure,&#8221; a metric that combines theoretical LLM capability with real-world Claude usage data from the Anthropic Economic Index. Unlike prior approaches that rely solely on theoretical task feasibility, this measure weights automated and work-related uses more heavily to better predict actual displacement risk.</p><ul><li><p><strong>Programmer exposure is highest:</strong> Computer programmers top the list at 75% task coverage, followed by customer service representatives and data entry keyers at 67%, reflecting the concentration of automated API usage in coding and support workflows.</p></li><li><p><strong>No unemployment signal yet:</strong> Using Current Population Survey data, the study finds no systematic increase in unemployment for workers in the most AI-exposed occupations since late 2022, though the framework could detect differential increases on the order of 1 percentage point.</p></li><li><p><strong>Youth hiring slowdown:</strong> There is suggestive evidence that hiring of workers aged 22-25 has slowed in exposed occupations, with a 14% drop in the job finding rate compared to 2022, echoing findings from Brynjolfsson et al. using ADP payroll data.</p></li><li><p><strong>Massive capability gap:</strong> AI is far from reaching its theoretical capability. Claude currently covers just 33% of all tasks in Computer and Math occupations, despite 94% being theoretically feasible, indicating significant room for future displacement as adoption deepens.</p></li></ul><p><strong><a href="https://www.anthropic.com/research/labor-market-impacts">Blog</a></strong></p><div><hr></div><h3><strong>Google Workspace CLI</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HgSx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HgSx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HgSx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png" width="1200" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Google Workspace CLI&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Google Workspace CLI" title="Google Workspace CLI" srcset="https://substackcdn.com/image/fetch/$s_!HgSx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google released an official command-line tool for its Workspace APIs, providing a unified interface for Drive, Gmail, Calendar, Sheets, Docs, Chat, and Admin from a single binary. Written in Rust and distributed via npm, the CLI is dynamically built from Google&#8217;s Discovery Service and ships with over 100 agent skills and a built-in MCP server.</p><ul><li><p><strong>100+ agent skills:</strong> The repo includes SKILL.md files for every supported API plus higher-level helpers, with 50 curated recipes for common workflows across Gmail, Drive, Docs, Calendar, and Sheets.</p></li><li><p><strong>Built-in MCP server:</strong> AI assistants like Claude, Gemini, and OpenClaw can connect directly to the CLI&#8217;s MCP server and operate on Google Workspace programmatically, turning Workspace into a tool-callable environment for agents.</p></li><li><p><strong>Dynamic API coverage:</strong> Instead of hardcoding endpoints, the CLI generates commands at build time from Google&#8217;s Discovery Service, meaning it automatically picks up new APIs and updates as Google ships them.</p></li><li><p><strong>Agent-first design:</strong> Each skill includes structured metadata, input/output schemas, and example prompts, making it immediately usable by coding agents and AI-powered automation pipelines without custom integration work.</p></li></ul><p><strong><a href="https://github.com/googleworkspace/cli">GitHub</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-ai-labor-market">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (February 23 - March 1)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-339</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-339</guid><pubDate>Sun, 01 Mar 2026 15:02:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!j_F0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Deep-Thinking Tokens</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MP5E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MP5E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 424w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 848w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 1272w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MP5E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png" width="674" height="378" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6377f483-06c6-474f-b370-76edcc90ef81_674x378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:674,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Deep-Thinking Tokens&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Deep-Thinking Tokens" title="Deep-Thinking Tokens" srcset="https://substackcdn.com/image/fetch/$s_!MP5E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 424w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 848w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 1272w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google researchers challenge the assumption that longer outputs indicate better reasoning. They introduce deep-thinking tokens, a metric that identifies tokens where internal model predictions shift significantly across layers before stabilizing. Unlike raw token count, which negatively correlates with accuracy (r = -0.59), the deep-thinking ratio shows a robust positive correlation (r = 0.683).</p><ul><li><p><strong>Deep-thinking ratio as a reasoning signal:</strong> For each generated token, intermediate-layer distributions are compared to the final-layer distribution using Jensen-Shannon divergence. A token qualifies as deep-thinking if its prediction only stabilizes in the final 15% of layers. This captures genuine computational effort rather than surface-level verbosity.</p></li><li><p><strong>Think@n test-time scaling:</strong> The authors introduce Think@n, a strategy that prioritizes samples with high deep-thinking ratios. It matches or exceeds standard self-consistency performance while cutting inference costs by approximately 50% through early rejection of unpromising generations based on just 50-token prefixes.</p></li><li><p><strong>Benchmark validation:</strong> Evaluated across AIME 24/25, HMMT 25, and GPQA-diamond with reasoning models including GPT-OSS, DeepSeek-R1, and Qwen3. The deep-thinking ratio consistently outperforms length-based and confidence-based baselines as a predictor of correctness.</p></li><li><p><strong>Practical implications:</strong> This reframes how we think about test-time compute. Instead of generating more tokens, we should focus on generating tokens that require deeper internal computation, enabling more efficient and accurate reasoning.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.13517">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2025239354327924833">Tweet</a></strong></p><div><hr></div><h2><strong>2. Codified Context</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vD5j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vD5j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 424w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 848w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 1272w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vD5j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png" width="1456" height="567" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8421822-3e07-49a8-a364-784f832ddad3_2040x794.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:567,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Codified Context&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Codified Context" title="Codified Context" srcset="https://substackcdn.com/image/fetch/$s_!vD5j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 424w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 848w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 1272w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Single-file AGENTS.md manifests don&#8217;t scale beyond modest codebases. A 1,000-line prototype can be fully described in a single prompt, but a 100,000-line system cannot. This paper presents a three-component codified context infrastructure developed during construction of a 108,000-line C# distributed system, evaluated across 283 development sessions.</p><ul><li><p><strong>Hot-memory constitution:</strong> A living document encoding conventions, retrieval hooks, and orchestration protocols that the agent consults at the start of every session. This provides immediate awareness of project standards without requiring the agent to rediscover them through exploration.</p></li><li><p><strong>Domain-expert agents:</strong> 19 specialized agents, each owning a bounded domain of the codebase with its own context slice. Instead of one generalist agent trying to hold the entire project in context, tasks are routed to the agent with the deepest knowledge of the relevant subsystem.</p></li><li><p><strong>Cold-memory knowledge base:</strong> 34 on-demand specification documents that agents retrieve only when needed. This tiered approach keeps the active context lean while ensuring detailed specifications are always accessible for complex implementation decisions.</p></li><li><p><strong>Session continuity results:</strong> Across 283 sessions, the infrastructure demonstrates how context propagates between sessions, preventing the common pattern where agents forget conventions, repeat known mistakes, and lose coherence on long-running projects.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.20478">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2027770787659464812">Tweet</a></strong></p><div><hr></div><h2><strong>3. Discovering Multi-Agent Learning Algorithms with LLMs</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BWzd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BWzd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 424w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 848w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 1272w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BWzd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png" width="793" height="251" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:251,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Discovering Multi-Agent Learning Algorithms&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Discovering Multi-Agent Learning Algorithms" title="Discovering Multi-Agent Learning Algorithms" srcset="https://substackcdn.com/image/fetch/$s_!BWzd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 424w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 848w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 1272w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google DeepMind uses AlphaEvolve, an evolutionary coding agent powered by LLMs, to automatically discover new multi-agent learning algorithms for imperfect-information games. Rather than relying on manual algorithm design, the system navigates vast algorithmic design spaces and discovers non-intuitive mechanisms that outperform state-of-the-art baselines.</p><ul><li><p><strong>VAD-CFR discovery:</strong> The system discovers a novel variant of iterative regret minimization featuring volatility-sensitive discounting and consistency-enforced optimism. VAD-CFR outperforms existing baselines like Discounted Predictive CFR+ on standard imperfect-information game benchmarks.</p></li><li><p><strong>SHOR-PSRO discovery:</strong> A population-based training algorithm variant that introduces a hybrid meta-solver blending Optimistic Regret Matching with temperature-controlled strategy distributions. This automates the transition from diversity exploration to equilibrium convergence.</p></li><li><p><strong>LLM-driven algorithmic evolution:</strong> AlphaEvolve generates candidate algorithm modifications, evaluates them on game-theoretic benchmarks, and iteratively refines the best variants. The discovered algorithms contain novel design choices that human researchers had not previously considered.</p></li><li><p><strong>Broader implications:</strong> This demonstrates that LLMs can serve as algorithmic designers, not just code generators. The approach could extend to discovering algorithms in other domains like optimization, scheduling, and resource allocation.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.16928">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2026044154040742150">Tweet</a></strong></p><div><hr></div><h2><strong>4. Evaluating AGENTS.md</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6t4H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6t4H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 424w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 848w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 1272w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6t4H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png" width="896" height="304" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:304,&quot;width&quot;:896,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Evaluating AGENTS.md&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Evaluating AGENTS.md" title="Evaluating AGENTS.md" srcset="https://substackcdn.com/image/fetch/$s_!6t4H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 424w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 848w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 1272w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This research evaluates whether AGENTS.md files, the repository-level context files that developers write to help AI coding agents understand their codebases, actually improve agent performance. Testing four coding agents (Claude Code with Sonnet-4.5, Codex with GPT-5.2 and GPT-5.1 mini, and Qwen Code with Qwen3-30b-coder), the findings are counterintuitive.</p><ul><li><p><strong>Context files reduce success rates:</strong> Human-written AGENTS.md files provide a modest +4% improvement in some cases, but LLM-generated ones actually hurt performance by -2%. Both consistently increase inference cost by over 20%, making the cost-benefit tradeoff questionable.</p></li><li><p><strong>Broader exploration, worse outcomes:</strong> Context files cause agents to explore more code paths and consider more files, but this expansive behavior makes tasks harder rather than easier. The additional context introduces noise that dilutes task-relevant information.</p></li><li><p><strong>Lean is better:</strong> The study recommends that developer-written context files should contain only essential information. Unnecessary requirements, coding style preferences, and broad architectural descriptions complicate agent task completion without improving results.</p></li><li><p><strong>Practical guidance:</strong> For developers maintaining AGENTS.md files, the key takeaway is to keep them minimal and focused on critical constraints. Information density matters more than comprehensiveness for current coding agents.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.11988">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2026306141181898887">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pTww!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pTww!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pTww!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pTww!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pTww!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pTww!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!pTww!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pTww!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pTww!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pTww!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8221;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><h2><strong>5. PAHF</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BMIt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BMIt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 424w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 848w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 1272w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BMIt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png" width="996" height="157" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:157,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;PAHF&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="PAHF" title="PAHF" srcset="https://substackcdn.com/image/fetch/$s_!BMIt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 424w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 848w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 1272w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Meta introduces PAHF (Personalized Agents from Human Feedback), a continual agent personalization framework that addresses a critical gap: most AI agents cannot adapt to individual user preferences that evolve over time. PAHF couples explicit per-user memory with both proactive and reactive feedback mechanisms.</p><ul><li><p><strong>Three-step personalization loop:</strong> PAHF operates through (1) pre-action clarification to resolve ambiguity before acting, (2) grounding actions in preferences retrieved from persistent memory, and (3) integrating post-action feedback to update memory when preferences drift. This dual-feedback design captures both explicit and implicit signals.</p></li><li><p><strong>Continual learning through interaction:</strong> Unlike static fine-tuning approaches, PAHF enables agents to learn from live interactions. The explicit memory store allows agents to accumulate and revise user preference profiles without retraining, making personalization practical for production deployments.</p></li><li><p><strong>Novel benchmarks:</strong> The researchers develop two benchmarks in embodied manipulation and online shopping that specifically measure an agent&#8217;s ability to learn initial preferences from scratch and then adapt when those preferences shift over time.</p></li><li><p><strong>Strong results:</strong> PAHF learns substantially faster and consistently outperforms both no-memory and single-channel baselines. It reduces initial personalization error and enables rapid adaptation to persona shifts, demonstrating that the combination of memory and dual feedback channels is essential.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.16173">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2025242624790331520">Tweet</a></strong></p><div><hr></div><h2><strong>6. Doc-to-LoRA</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j_F0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j_F0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 424w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 848w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 1272w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j_F0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png" width="1456" height="560" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:560,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Doc-to-LoRA&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Doc-to-LoRA" title="Doc-to-LoRA" srcset="https://substackcdn.com/image/fetch/$s_!j_F0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 424w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 848w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 1272w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Sakana AI introduces Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to compress long documents into LoRA adapters in a single forward pass. Instead of processing long contexts through expensive quadratic attention, D2L converts the document into parameter-space representations that the target LLM can use without re-consuming the original text.</p><ul><li><p><strong>Single-pass context compression:</strong> D2L generates LoRA adapters from unseen documents in one forward pass. Once compressed, subsequent queries are handled using only the adapter weights, eliminating the need to re-process the full document and dramatically reducing both inference latency and KV-cache memory demands.</p></li><li><p><strong>Beyond native context windows:</strong> The method achieves near-perfect zero-shot accuracy on needle-in-a-haystack tasks at sequence lengths exceeding the target LLM&#8217;s native context window by over 4x. This suggests that parametric compression can effectively extend context capabilities without architectural changes.</p></li><li><p><strong>Real-world QA performance:</strong> On practical question-answering datasets, D2L outperforms standard long-context approaches while consuming less memory. The compressed representations retain enough information for accurate retrieval and reasoning across the full document.</p></li><li><p><strong>Practical deployment benefits:</strong> For applications requiring repeated queries over the same document (customer support, legal analysis, codebase understanding), D2L compresses the document once and amortizes the cost across all subsequent interactions.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.15902">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2027385998993420571">Tweet</a></strong></p><div><hr></div><h2><strong>7. AgentConductor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zzkl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zzkl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 424w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 848w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 1272w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zzkl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png" width="996" height="635" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:635,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AgentConductor&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AgentConductor" title="AgentConductor" srcset="https://substackcdn.com/image/fetch/$s_!Zzkl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 424w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 848w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 1272w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AgentConductor introduces a reinforcement learning-enhanced multi-agent system for code generation that dynamically generates interaction topologies based on task characteristics. Rather than using fixed communication patterns between agents, an LLM-based orchestrator adapts the topology to match problem complexity, achieving state-of-the-art accuracy across five code generation datasets.</p><ul><li><p><strong>Task-adapted topologies:</strong> The orchestrator constructs density-aware layered directed acyclic graph (DAG) topologies tailored to problem difficulty. Simple problems get sparse topologies with minimal communication overhead, while complex problems get denser multi-agent collaboration.</p></li><li><p><strong>Topological density control:</strong> A novel density function and difficulty interval partitioning mechanism controls how much agents communicate. This directly addresses the problem of redundant interactions that waste tokens without improving solution quality.</p></li><li><p><strong>Strong performance gains:</strong> AgentConductor outperforms the strongest baseline by up to 14.6% in pass@1 accuracy with 13% density reduction and 68% token cost reduction. The system achieves better results while using significantly fewer computational resources.</p></li><li><p><strong>Execution feedback refinement:</strong> Topologies are refined using execution feedback from code tests. When initial solutions fail, the orchestrator adjusts the collaboration structure based on error patterns, enabling adaptive recovery.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.17100">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2027030406441341227">Tweet</a></strong></p><div><hr></div><h2><strong>8. ActionEngine</strong></h2><p>Georgia Tech and Microsoft Research introduce ActionEngine, a training-free framework that transforms GUI agents from reactive step-by-step executors into programmatic planners. It builds a state-machine memory through offline exploration, then synthesizes executable Python programs for task completion, achieving 95% success on Reddit tasks from WebArena with on average a single LLM call, reducing costs by 11.8x and latency by 2x compared to vision-only baselines.</p><p><strong><a href="https://arxiv.org/abs/2602.20502">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2026678090815123594">Tweet</a></strong></p><div><hr></div><h2><strong>9. CoT Faithfulness via REMUL</strong></h2><p>Researchers propose REMUL, a training approach for making chain-of-thought reasoning more faithful and monitorable. A speaker model generates reasoning traces that multiple listener models attempt to follow and complete, using RL to reward reasoning that is understandable to other models. Tested across BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO, REMUL improves three faithfulness metrics while also boosting overall accuracy, producing shorter and more direct reasoning chains.</p><p><strong><a href="https://arxiv.org/abs/2602.16154">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2026043400861122709">Tweet</a></strong></p><div><hr></div><h2><strong>10. Learning to Rewrite Tool Descriptions</strong></h2><p>Intuit AI Research addresses a bottleneck in LLM-agent tool use: tool descriptions are written for humans, not agents. They introduce Trace-Free+, a curriculum learning framework that optimizes tool descriptions without relying on execution traces. The approach delivers consistent gains on unseen tools, strong cross-domain generalization, and robustness as the number of candidate tools scales to over 100, demonstrating that improving tool interfaces is a practical complement to agent fine-tuning.</p><p><strong><a href="https://arxiv.org/abs/2602.20426">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2026676835539628465">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Evaluating AGENTS.md, Perplexity Computer, Nano Banana 2, Doc-to-LoRA, Hermes Agent, Mercury 2, and More]]></title><description><![CDATA[Evaluating AGENTS.md, Perplexity Computer, Nano Banana 2, Doc-to-LoRA, Hermes Agent, Mercury 2, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-evaluating-agentsmd</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-evaluating-agentsmd</guid><pubDate>Sat, 28 Feb 2026 15:02:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-XGl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>AGENTS.md files hurt coding agent performance</p></li><li><p>Perplexity launches Computer for end-to-end tasks</p></li><li><p>Google launches Nano Banana 2 for free</p></li><li><p>Sakana AI ships Doc-to-LoRA and Text-to-LoRA</p></li><li><p>Notion launches Custom Agents in 3.3</p></li><li><p>Nous Research releases Hermes Agent open source</p></li><li><p>GPT-5.3-Codex available for all developers</p></li><li><p>Mercury 2 ships reasoning diffusion LLM</p></li><li><p>Qwen 3.5 medium model series drops</p></li><li><p>Claude Code ships auto-memory across sessions</p></li><li><p>RoguePilot exposes GitHub Copilot vulnerability</p></li><li><p>Vercel open-sources Chat SDK for multi-platform bots</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Evaluating AGENTS.md: Are Context Files Helpful for Coding Agents?</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-XGl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-XGl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 424w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 848w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 1272w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-XGl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png" width="896" height="304" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:304,&quot;width&quot;:896,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Evaluating AGENTS.md&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Evaluating AGENTS.md" title="Evaluating AGENTS.md" srcset="https://substackcdn.com/image/fetch/$s_!-XGl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 424w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 848w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 1272w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers from UIUC and Microsoft Research evaluated whether repository-level context files like AGENTS.md actually improve coding agent performance. The counterintuitive finding: context files reduce task success rates compared to providing no context at all, while increasing inference costs by over 20%.</p><ul><li><p><strong>Lower success rates:</strong> Both LLM-generated and human-written context files caused agents to solve fewer tasks on SWE-bench compared to agents given no repository context, challenging the widely adopted practice of writing detailed agent instructions.</p></li><li><p><strong>Broader but less effective exploration:</strong> Context files prompted agents to explore more thoroughly, including more testing and file traversal, but the additional constraints made tasks harder rather than easier.</p></li><li><p><strong>Minimal is better:</strong> The authors recommend that context files describe only minimal requirements rather than comprehensive specifications, as unnecessary constraints actively hurt agent performance.</p></li><li><p><strong>Practical implications:</strong> The findings suggest developers should rethink how they structure AGENTS.md, CLAUDE.md, and similar context files, focusing on essential guardrails rather than exhaustive instructions.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.11988">Paper</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-evaluating-agentsmd">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>