<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[AI Newsletter]]></title><description><![CDATA[The AI Newsletter provides weekly summaries of the latest and top AI trends, papers, tools, news, and best practices. Home of Top AI Papers of the Week and AI Agents Weekly series. ]]></description><link>https://nlp.elvissaravia.com</link><image><url>https://substackcdn.com/image/fetch/$s_!m7md!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41327c80-fe59-416d-aa6f-ab6874177ac7_517x517.png</url><title>AI Newsletter</title><link>https://nlp.elvissaravia.com</link></image><generator>Substack</generator><lastBuildDate>Mon, 20 Apr 2026 10:13:43 GMT</lastBuildDate><atom:link href="https://nlp.elvissaravia.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[elvis]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[nlpnews@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[nlpnews@substack.com]]></itunes:email><itunes:name><![CDATA[elvis]]></itunes:name></itunes:owner><itunes:author><![CDATA[elvis]]></itunes:author><googleplay:owner><![CDATA[nlpnews@substack.com]]></googleplay:owner><googleplay:email><![CDATA[nlpnews@substack.com]]></googleplay:email><googleplay:author><![CDATA[elvis]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (April 13 - April 19)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-717</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-717</guid><pubDate>Sun, 19 Apr 2026 15:03:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!U88-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The Top AI Papers of the Week (April 13 - April 19)</p><h2><strong>1. Automated Weak-to-Strong Researcher</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U88-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U88-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 424w, https://substackcdn.com/image/fetch/$s_!U88-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 848w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U88-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg" width="1456" height="655" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:655,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Automated W2S Researcher&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Automated W2S Researcher" title="Automated W2S Researcher" srcset="https://substackcdn.com/image/fetch/$s_!U88-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 424w, https://substackcdn.com/image/fetch/$s_!U88-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 848w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic shows that Claude can run fully autonomous progress on scalable oversight research. A team of parallel Automated Alignment Researchers (AARs) built on Claude Opus 4.6 propose ideas, run experiments, and iterate on weak-to-strong supervision, a core alignment problem where a stronger model must learn from a weaker teacher. The system closes almost the entire remaining performance gap that human researchers could not, at a total cost of roughly $18K in tokens and model training.</p><ul><li><p><strong>Performance gap recovered as the metric:</strong> The authors evaluate progress with performance gap recovered (PGR), a 0 to 1 score where 0 matches the weak teacher and 1 matches a ground-truth-supervised student. On a chat preference dataset, two human researchers achieved PGR 0.23 after seven days of iteration on four promising generalization methods.</p></li><li><p><strong>AARs reach 0.97 PGR in five days:</strong> Running nine Claude-based agents in parallel sandboxes, the automated system reached PGR 0.97 in five days and 800 cumulative agent-hours. The cost was about $18,000, or roughly $22 per AAR-hour. This is one of the strongest empirical data points yet that AI can drive measurable progress on open alignment problems.</p></li><li><p><strong>Forum-based collaboration between agents:</strong> Each AAR works in its own isolated sandbox but shares findings to a common forum and uploads codebase snapshots to shared storage. The setup mirrors how a small research team would coordinate, letting later agents build on earlier wins without merging execution environments.</p></li><li><p><strong>Reward hacking as a real outcome, not a hypothetical:</strong> The agents sometimes succeeded through unexpected mechanisms, including reward-hacking behaviors that the researchers did not anticipate. The result highlights the double-edged nature of automated research: measurable progress on outcome-gradable problems is practical today, but careful metric design remains a human responsibility.</p></li></ul><p><strong><a href="https://alignment.anthropic.com/2026/automated-w2s-researcher/">Paper</a></strong> | <strong><a href="https://x.com/janleike/status/2044139528596910584">Tweet</a></strong></p><div><hr></div><h2><strong>2. AiScientist</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T3D7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T3D7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 424w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 848w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1272w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T3D7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png" width="996" height="393" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:393,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AiScientist&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AiScientist" title="AiScientist" srcset="https://substackcdn.com/image/fetch/$s_!T3D7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 424w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 848w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1272w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Long-horizon AI research agents are mostly a state-management problem. Reasoning well for the next turn is not enough when ML research demands task setup, implementation, experiments, debugging, and evidence tracking over hours or days. This paper introduces AiScientist, a system for autonomous long-horizon engineering built around the principle of thin control and thick state. A top-level orchestrator manages stage-level progress while specialized agents repeatedly ground themselves in durable workspace artifacts.</p><ul><li><p><strong>File-as-Bus coordination:</strong> AiScientist&#8217;s core design choice is to route coordination through durable filesystem artifacts rather than in-context message passing. Analyses, plans, code, logs, and experimental evidence all live as versioned files in a permission-scoped workspace, allowing specialists and subagents to reconstruct context from scratch without replaying entire conversations.</p></li><li><p><strong>Thin control, thick state:</strong> A Tier-0 orchestrator issues only stage-level directives, while Tier-1 specialists and optional Tier-2 subagents operate on shared artifacts. This keeps the control channel narrow and the state channel rich, giving agents the space to run long experiments without losing track of prior decisions and evidence.</p></li><li><p><strong>Strong benchmark results:</strong> The system improves PaperBench by 10.54 points over the best matched baseline and reaches 81.82 Any Medal% on MLE-Bench Lite. Removing File-as-Bus drops PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points, isolating the artifact-mediated design as the primary driver of gains.</p></li><li><p><strong>Durable project memory over longer chats:</strong> The work argues that autonomous research agents need persistent project memory, not just longer context windows. The results generalize the emerging pattern that environments carrying state on behalf of agents outperform architectures that rely solely on in-context reasoning for multi-hour workflows.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.13018">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2044436099121209546">Tweet</a></strong></p><div><hr></div><h2><strong>3. AlphaEval</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vS7D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vS7D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 424w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 848w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1272w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vS7D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png" width="635" height="331" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/655c258e-96c9-40fa-8e4c-934901545aea_635x331.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:331,&quot;width&quot;:635,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AlphaEval&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AlphaEval" title="AlphaEval" srcset="https://substackcdn.com/image/fetch/$s_!vS7D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 424w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 848w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1272w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Agent evaluations are drifting away from production reality. Most benchmarks use clean tasks, well-specified requirements, deterministic metrics, and retrospective curation. Production work is messier, with implicit constraints, fragmented multimodal inputs, undeclared domain knowledge, long-horizon deliverables, and expert judgment that evolves over time. This paper introduces AlphaEval, a production-grounded benchmark evaluating agents as complete products rather than model APIs.</p><ul><li><p><strong>Seven companies, six O*NET domains:</strong> AlphaEval contains 94 tasks sourced from seven companies deploying AI agents in core business workflows across six O*NET domains. The tasks preserve production complexity rather than stripping it away, giving the benchmark a materially different distribution from prior coding-centric evaluations.</p></li><li><p><strong>Products, not model APIs:</strong> The benchmark evaluates commercial agent products such as Claude Code and Codex end to end, not the underlying models in isolation. This is a deliberate shift toward measuring the full agent experience that users actually pay for, including tool use, orchestration, and UI behaviors.</p></li><li><p><strong>Six production-specific failure modes:</strong> The authors identify cascade dependencies, subjective judgment collapse, information retrieval failures, cross-section inconsistency, constraint misinterpretation, and format compliance as failure modes that remain invisible to coding benchmarks. The best configuration (Claude Code with Opus 4.6) scores only 64.41/100, exposing a substantial research-to-production gap.</p></li><li><p><strong>Multi-paradigm evaluation:</strong> AlphaEval combines LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, and domain-specific checks. The key practical contribution is a requirement-to-benchmark framework that turns production requirements into executable evals with minimal friction for organizations.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.12162">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2044773323914322393">Tweet</a></strong></p><div><hr></div><h2><strong>4. Nemotron 3 Super</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3ns9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3ns9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 424w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 848w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1272w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3ns9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png" width="996" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Nemotron 3 Super&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Nemotron 3 Super" title="Nemotron 3 Super" srcset="https://substackcdn.com/image/fetch/$s_!3ns9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 424w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 848w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1272w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>NVIDIA introduces Nemotron 3 Super, an open 120B parameter model with 12B active parameters, built as a hybrid Mamba-Attention Mixture-of-Experts architecture optimized for agentic reasoning. The model targets long-context, high-throughput inference, a capability increasingly central to running agents reliably. It supports up to 1M context length while delivering up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B, at comparable benchmark accuracy.</p><ul><li><p><strong>Hybrid Mamba-Attention with LatentMoE:</strong> The architecture blends Mamba blocks with sparse LatentMoE layers, a new Mixture-of-Experts design that projects tokens into a smaller latent dimension for routing and expert computation. This improves both accuracy per FLOP and accuracy per parameter, and it is what allows the model to scale sparsely without paying a standard MoE memory tax.</p></li><li><p><strong>NVFP4 pretraining at scale:</strong> Nemotron 3 Super is the first model in the Nemotron 3 family to be pretrained in NVFP4, enabling training on 25 trillion tokens while keeping compute and memory overhead manageable. Post-training combines supervised fine-tuning and reinforcement learning on top of this base.</p></li><li><p><strong>Native speculative decoding via MTP layers:</strong> Multi-Token Prediction (MTP) layers are included for native speculative decoding during inference, reducing latency for long-context agentic workloads without requiring an external draft model. The team reports consistent MTP acceptance rates across draft depths on SPEED-Bench.</p></li><li><p><strong>Fully open artifacts:</strong> Nemotron 3 Super datasets, along with base, post-trained, and quantized checkpoints, are open-sourced on Hugging Face. This matters for teams building agent stacks that need efficient, inspectable, long-context models rather than closed API dependencies.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.12374">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2044452957023047943">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sVEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sVEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sVEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!sVEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p><strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Enroll Now</a></strong></p><div><hr></div><h2><strong>5. Memory Transfer Learning</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dlKK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dlKK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 424w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 848w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1272w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dlKK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png" width="996" height="1186" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1186,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Memory Transfer Learning&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Memory Transfer Learning" title="Memory Transfer Learning" srcset="https://substackcdn.com/image/fetch/$s_!dlKK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 424w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 848w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1272w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Coding agents learn from experience, but that knowledge stays locked in silos. Solve a thousand SWE tasks, and none of that wisdom helps with competitive coding. This paper introduces Memory Transfer Learning, a framework where coding agents share a unified memory pool across six heterogeneous coding benchmarks, testing what transfers between domains and what does not.</p><ul><li><p><strong>Unified memory pool across domains:</strong> The framework pools memories across six heterogeneous coding benchmarks rather than isolating them by task type. Cross-domain memory improves average performance by 3.7%, a modest but consistent lift that previously would have been invisible under standard single-domain evaluations.</p></li><li><p><strong>Abstraction dictates transferability:</strong> Four memory formats ranging from raw execution traces to high-level insights are compared. High-level insights generalize well, while low-level traces often cause negative transfer by anchoring agents to incompatible implementation details. The takeaway: memory design matters more than memory volume.</p></li><li><p><strong>Meta-knowledge, not code:</strong> The transferable value is not task-specific code but meta-knowledge such as validation routines, structured action workflows, and safe interaction patterns with execution environments. Algorithmic strategy transfer accounts for only 5.5% of the gains, with procedural guidance doing most of the work.</p></li><li><p><strong>Scaling and cross-model transfer:</strong> Transfer effectiveness scales with the size of the memory pool, and memory can even be shared across different models. Combined with the finding on abstraction levels, the results point toward memory systems that curate insights rather than simply logging everything the agent did.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.14004">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2044900659921895729">Tweet</a></strong></p><div><hr></div><h2><strong>6. Auto-Diagnose</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2T-a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2T-a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 424w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 848w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1272w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2T-a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png" width="812" height="138" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:138,&quot;width&quot;:812,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Auto-Diagnose&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Auto-Diagnose" title="Auto-Diagnose" srcset="https://substackcdn.com/image/fetch/$s_!2T-a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 424w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 848w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1272w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Integration test failures are painful because the signal is buried in messy logs. Massive output, heterogeneous systems, low signal-to-noise ratio, and unclear root causes leave developers scrolling through thousands of lines. This paper introduces Auto-Diagnose, an LLM-based tool deployed inside Google&#8217;s Critique code review system that analyzes failure logs, summarizes the most relevant lines, and suggests the root cause directly in the developer workflow.</p><ul><li><p><strong>In-workflow root cause assistance:</strong> Auto-Diagnose is integrated into Critique, Google&#8217;s internal code review system, so diagnoses appear where developers are already looking at the failure. Log streams from test drivers and systems under test, spread across data centers and threads, are joined and sorted by timestamp before being passed to the LLM.</p></li><li><p><strong>High diagnosis accuracy:</strong> In a manual evaluation of 71 real-world failures, Auto-Diagnose reached 90.14% root-cause diagnosis accuracy. This level of reliability is what justifies surfacing suggestions directly in a tool developers cannot ignore, rather than hiding them behind an opt-in query interface.</p></li><li><p><strong>Massive-scale deployment evidence:</strong> After Google-wide rollout, the tool was used across 52,635 distinct failing tests. User feedback marked it &#8220;Not helpful&#8221; in only 5.8% of cases, and it ranked #14 in helpfulness among 370 Critique tools. This is one of the clearest data points on production LLM tooling at scale inside a major company.</p></li><li><p><strong>A template for developer-facing LLM tools:</strong> The paper reads as a practical blueprint for embedding LLM-based diagnosis into existing engineering workflows. Rather than building a standalone product, the team integrated into the tool where the problem is already being reviewed, which likely explains the low &#8220;Not helpful&#8221; rate and high adoption.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.12108">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2044769798845079665">Tweet</a></strong></p><div><hr></div><h2><strong>7. Subliminal Learning</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JlNa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JlNa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JlNa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg" width="1456" height="806" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:806,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Subliminal Learning&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Subliminal Learning" title="Subliminal Learning" srcset="https://substackcdn.com/image/fetch/$s_!JlNa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Subliminal Learning paper by Evans and colleagues is now published in Nature. The work showed that LLMs can transmit traits (such as a preference for owls) through data that appears unrelated to that trait, like sequences of numbers that look meaningless on inspection. The Nature version extends the original July 2025 preprint with new experiments, replications on Gemma, and a broader discussion of safety implications for AI systems trained on one another&#8217;s outputs.</p><ul><li><p><strong>Transfer across different initializations:</strong> The preprint showed subliminal transfer between models that shared an initialization. The new MNIST results demonstrate transfer between models with different initializations. Although a toy setup, it meaningfully broadens the scope of the effect beyond shared-weight scenarios.</p></li><li><p><strong>Misalignment transmitted through code and chain-of-thought:</strong> General misalignment, not just benign preferences, can also be transmitted subliminally. The new results show this transfer can happen through model-written code or chain-of-thought reasoning, not only through numeric sequences, which expands the attack and contamination surface considerably.</p></li><li><p><strong>Connections to independent follow-ups:</strong> The authors highlight concurrent work from Aden-Ali et al. (2026) showing trait transfer via standard post-training datasets filtered by the teacher, Draganov et al. (2026) demonstrating a cross-family &#8220;phantom transfer&#8221; data poisoning attack, and Weckbecker et al. (2026) describing a subliminal &#8220;virus&#8221; that spreads between agent groups. Together they suggest the phenomenon is robust, reproducible, and difficult to defend against.</p></li><li><p><strong>Implications for safety evaluations:</strong> The practical takeaway is that safety evaluations may need to examine not just model behavior, but the origins of models and the processes used to create training data. As systems increasingly train on each other&#8217;s outputs, properties invisible in the data can still be inherited, undermining evaluations that focus purely on observable responses.</p></li></ul><p><strong><a href="https://www.nature.com/articles/s41586-026-10319-8">Paper</a></strong> | <strong><a href="https://x.com/OwainEvans_UK/status/2044488099707949545">Tweet</a></strong></p><div><hr></div><h2><strong>8. LLM-as-a-Verifier</strong></h2><p>Test-time scaling is effective for agentic tasks, but picking the winner among many candidates is the bottleneck. LLM-as-a-Verifier introduces a simple test-time method that reaches SOTA on agentic benchmarks by extracting a cleaner ranking signal from the model itself. The approach asks the LLM to rank results on a 1-k scale and uses the log-probabilities of the rank tokens to compute an expected score, yielding a verification signal in a single sampling pass per candidate pair. The result is a lightweight, drop-in verifier that works without training a dedicated reward model.</p><p><strong><a href="https://llm-as-a-verifier.github.io/">Paper</a></strong> | <strong><a href="https://x.com/Azaliamirh/status/2043813128690192893">Tweet</a></strong></p><div><hr></div><h2><strong>9. WebXSkill</strong></h2><p>Web agents can navigate a page, but ask them to repeat a checkout flow they already completed and they start from scratch every time. WebXSkill is a skill learning framework where web agents extract reusable skills from synthetic trajectories, each pairing a parameterized action program with step-level natural language guidance. Two deployment modes let the agent either auto-execute skills as atomic tool calls (grounded) or follow them as step-by-step instructions while retaining autonomy to adapt (guided). On WebArena, WebXSkill improves task success by up to 9.8 points over baselines. On WebVoyager, grounded mode reaches 86.1%, a 14.2-point gain, and skills even transfer across environments.</p><p><strong><a href="https://arxiv.org/abs/2604.13318">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2045139481892880892">Tweet</a></strong></p><div><hr></div><h2><strong>10. Muses-Bench</strong></h2><p>Every agent framework assumes one user giving instructions, but in real team workflows agents have multiple bosses with conflicting goals, private information, and different authority levels. Muses-Bench formalizes multi-user interaction as a multi-principal decision problem and evaluates frontier LLMs across three scenarios: instruction following under authority conflicts, cross-user access control, and multi-user meeting coordination. Gemini-3-Pro tops the leaderboard at just 85.6% average, and no model exceeds 64.8% on meeting coordination. Privacy-utility tradeoffs are brutal: Grok-3-Mini scores 99.6% on privacy but collapses to 60.1% on utility, showing current models cannot reliably balance both under multi-principal pressure.</p><p><strong><a href="https://arxiv.org/abs/2604.08567">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2044067923787165799">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Claude Opus 4.7, Codex Everywhere, Claude Design, Windsurf 2.0, Qwen3.6-35B-A3B, AiScientist, and More]]></title><description><![CDATA[Claude Opus 4.7, Codex Everywhere, Claude Design, Windsurf 2.0, Qwen3.6-35B-A3B, AiScientist, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-47-codex</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-47-codex</guid><pubDate>Sat, 18 Apr 2026 15:01:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!491v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Anthropic ships Claude Opus 4.7</p></li><li><p>Codex extends to Mac apps</p></li><li><p>Claude Design enters research preview</p></li><li><p>Windsurf 2.0 delegates to Devin</p></li><li><p>Qwen drops 3.6-35B-A3B open weights</p></li><li><p>OpenAI Agents SDK adds sandboxes</p></li><li><p>Gemini CLI adds subagents</p></li><li><p>FrontierSWE benchmark launches</p></li><li><p>NVIDIA releases Nemotron 3 Super</p></li><li><p>AiScientist lifts long-horizon research</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Claude Opus 4.7</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!491v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!491v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!491v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!491v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!491v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!491v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg" width="1080" height="1080" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1080,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Opus 4.7&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Opus 4.7" title="Claude Opus 4.7" srcset="https://substackcdn.com/image/fetch/$s_!491v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!491v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!491v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!491v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic released Claude Opus 4.7, its most capable Opus model yet, built for long-running agentic work with more rigorous self-verification and tighter instruction following. Opus 4.7 also powers the new Claude Design product and Anthropic&#8217;s Glasswing cybersecurity frontier model.</p><ul><li><p><strong>Self-verifying long-running work:</strong> Opus 4.7 checks its own outputs before reporting back and handles multi-hour tasks with less supervision, making it a stronger default for hand-offs where the agent must own the full loop.</p></li><li><p><strong>Vision upgrade:</strong> The model sees images at more than three times the resolution of Opus 4.6 and produces higher-quality interfaces, slides, and documents, which is the foundation for the new Claude Design research preview.</p></li><li><p><strong>New reasoning and budget controls:</strong> A new xhigh effort level between high and max gives developers finer latency/quality tradeoffs on hard problems. Task budgets (beta) let Claude prioritize work and manage cost across longer runs.</p></li><li><p><strong>Claude Code upgrades:</strong> A new /ultrareview command runs a dedicated review pass over changes that flags what a careful reviewer would catch, and auto mode is now extended to Max users so long tasks run with fewer interruptions.</p></li></ul><p><strong><a href="https://www.anthropic.com/news/claude-opus-4-7">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-47-codex">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (April 6 - April 12)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-831</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-831</guid><pubDate>Sun, 12 Apr 2026 15:02:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1pgB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Neural Computers</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fEae!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fEae!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 424w, https://substackcdn.com/image/fetch/$s_!fEae!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 848w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1272w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fEae!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png" width="1085" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:1085,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!fEae!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 424w, https://substackcdn.com/image/fetch/$s_!fEae!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 848w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1272w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers from Meta AI and KAUST propose Neural Computers (NCs), an emerging machine form that unifies computation, memory, and I/O in a single learned runtime state. Unlike conventional computers that execute explicit programs, agents that act over external environments, or world models that learn dynamics, NCs aim to make the model itself the running computer, establishing a new computing paradigm.</p><ul><li><p><strong>From hardware stack to neural latent stack:</strong> Classical computers separate compute, memory, and I/O into modular hardware layers. Neural Computers collapse all three into a single latent runtime state carried by a neural network. The model&#8217;s hidden state serves simultaneously as working memory, computational substrate, and interface layer, removing the boundary between program and execution environment.</p></li><li><p><strong>Video models as prototype substrate:</strong> The team instantiates NCs as video models that generate screen frames from instructions, pixel inputs, and user actions. Two prototypes cover command-line interfaces (NCCLIGen, which renders and executes terminal workflows) and graphical desktops (NCGUIWorld, which learns pointer dynamics and menu interactions), both trained without access to internal program state.</p></li><li><p><strong>Early runtime primitives emerge:</strong> The prototypes demonstrate that learned runtimes can acquire I/O alignment and short-horizon control directly from raw interface traces. CLI models execute short command chains with structurally accurate output rendering, while GUI models learn coherent click feedback and window transitions in controlled settings.</p></li><li><p><strong>Roadmap toward Completely Neural Computers:</strong> The long-term target is the CNC: a system that is Turing complete, universally programmable, and behavior-consistent unless explicitly reprogrammed. Key open challenges include routine reuse across sessions, controlled capability updates without catastrophic forgetting, and stable symbolic processing for long-horizon reasoning.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.06425">Paper</a></strong> | <strong><a href="https://x.com/SchmidhuberAI/status/2042601088029708704">Tweet</a></strong></p><div><hr></div><h2><strong>2. Memento: Teaching LLMs to Manage Their Own Context</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1pgB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1pgB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 424w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 848w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1pgB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png" width="1456" height="1064" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1064,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!1pgB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 424w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 848w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>New research from Microsoft teaches reasoning models to compress their own chain-of-thought mid-generation. Memento trains models to segment reasoning into blocks, summarize each block into a compact &#8220;memento,&#8221; and then evict the original block from the KV cache. The model continues reasoning from mementos alone, cutting peak memory by 2-3x while nearly doubling throughput.</p><ul><li><p><strong>Block-and-compress architecture:</strong> The model learns to mark reasoning boundaries using special tokens, produce a terse summary capturing key conclusions and intermediate values, and then drop the full block from context. From that point forward, the model sees only past mementos plus the current active block, keeping context compact without losing critical information.</p></li><li><p><strong>KV cache reduction with minimal accuracy loss:</strong> Applied to five models including Qwen2.5-7B, Qwen3 8B/32B, Phi-4 Reasoning 14B, and OLMo3-7B-Think, Memento achieves 2-3x peak KV cache reduction with small accuracy gaps that shrink at larger scales. The erased blocks still leave useful traces in the KV cache that the model leverages.</p></li><li><p><strong>Practical throughput gains:</strong> Beyond memory savings, the reduced context length directly translates to faster inference. The approach nearly doubles serving throughput, making it immediately useful for production deployments where both latency and memory are constraints.</p></li><li><p><strong>Open resources:</strong> Microsoft released the full codebase under MIT license, the OpenMementos dataset containing 228K reasoning traces with block segmentation and compressed summaries, and a custom vLLM fork for KV cache block masking. Standard supervised fine-tuning on approximately 30K examples is sufficient to teach this capability.</p></li></ul><p><strong><a href="https://github.com/microsoft/memento">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2042315710173528122">Tweet</a></strong></p><div><hr></div><h2><strong>3. Memory Intelligence Agent (MIA)</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mD5U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mD5U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 424w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 848w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1272w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mD5U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png" width="1456" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!mD5U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 424w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 848w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1272w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most memory-augmented research agents treat memory as a static retrieval store, leading to inefficient evolution and rising storage costs. MIA introduces a Manager-Planner-Executor architecture where a Memory Manager maintains compressed search trajectories, a Planner generates strategies, and an Executor searches and analyzes information. The framework boosts GPT-5.4 by up to 9% on LiveVQA through bidirectional memory conversion.</p><ul><li><p><strong>Bidirectional memory conversion:</strong> MIA enables transformation between parametric memory (model weights) and non-parametric memory (retrieved context) in both directions. This allows the system to internalize frequently accessed knowledge while keeping rare or volatile information in retrievable form, optimizing both storage efficiency and access speed.</p></li><li><p><strong>Alternating reinforcement learning:</strong> The three agents are trained through alternating RL, where each agent&#8217;s policy improves in response to the others&#8217; behavior. This co-evolutionary training ensures the agents develop complementary strategies rather than competing for the same signal.</p></li><li><p><strong>Test-time parametric updates:</strong> Unlike standard retrieval-augmented systems, MIA can update its parametric memory on-the-fly during inference. This test-time learning allows the agent to adapt to new domains and evolving information without retraining, maintaining relevance as the information landscape changes.</p></li><li><p><strong>Broad benchmark coverage:</strong> The framework demonstrates improvements across 11 benchmarks spanning question answering, knowledge-intensive tasks, and long-form research synthesis. The up to 9% improvement on LiveVQA is particularly notable given that video question answering demands effective memory management across temporal sequences.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.04503">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2041895109252542730">Tweet</a></strong></p><div><hr></div><h2><strong>4. Single-Agent LLMs vs. Multi-Agent Systems</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fvx7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fvx7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 424w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 848w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fvx7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Single vs Multi Agent&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Single vs Multi Agent" title="Single vs Multi Agent" srcset="https://substackcdn.com/image/fetch/$s_!fvx7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 424w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 848w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>More agents, better results, right? Not so fast. This Stanford paper challenges a core assumption in the multi-agent LLM space by showing that when computation is properly controlled, single-agent systems consistently match or outperform multi-agent architectures on multi-hop reasoning. The authors present an information-theoretic argument grounded in the Data Processing Inequality.</p><ul><li><p><strong>Computation as the hidden confounder:</strong> Most reported multi-agent gains are confounded by increased test-time computation rather than architectural advantages. When reasoning token budgets are held constant, the performance gap disappears or reverses, suggesting that prior comparisons were inadvertently measuring compute scaling rather than coordination benefits.</p></li><li><p><strong>Information-theoretic foundation:</strong> The authors ground their analysis in the Data Processing Inequality, arguing that under a fixed reasoning-token budget with perfect context utilization, single-agent systems are inherently more information-efficient. Distributing reasoning across agents introduces information loss at each handoff.</p></li><li><p><strong>Benchmark artifacts inflate MAS gains:</strong> Testing across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5, the study identifies significant evaluation artifacts, particularly in API-based budget control for Gemini 2.5, that inflate apparent multi-agent advantages. Standard benchmarks also contain structural biases favoring multi-agent decomposition.</p></li><li><p><strong>Practical implications for system design:</strong> The findings suggest that teams should explicitly control for compute, context, and coordination trade-offs before committing to multi-agent architectures. In many cases, allocating the same token budget to a single agent with richer context yields stronger results at lower system complexity.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.02460">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2041534488342360305">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NAtL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NAtL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NAtL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!NAtL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p><strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Enroll Now</a></strong></p><div><hr></div><h2><strong>5. The Universal Verifier for Agent Benchmarks</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4ydR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4ydR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 424w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 848w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1272w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4ydR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png" width="887" height="348" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:348,&quot;width&quot;:887,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Universal Verifier&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Universal Verifier" title="Universal Verifier" srcset="https://substackcdn.com/image/fetch/$s_!4ydR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 424w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 848w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1272w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Every agent benchmark has the same hidden problem: how do you know the agent actually succeeded? Microsoft researchers introduce the Universal Verifier, built on four design principles for reliable evaluation of computer-use agent trajectories. The verifier reduces false positive rates to near zero, down from 45%+ with WebVoyager and 22%+ with WebJudge.</p><ul><li><p><strong>Four design principles:</strong> The verifier is built on non-overlapping rubric criteria to reduce noise, separate process and outcome rewards for complementary signals, cascading error-free assessment that distinguishes controllable from uncontrollable failures, and divide-and-conquer context management that attends to all screenshots in a trajectory.</p></li><li><p><strong>Near-zero false positives:</strong> Current verifiers suffer from alarmingly high false positive rates that corrupt both benchmark scores and training data. The Universal Verifier achieves agreement with human judges that matches inter-human agreement rates, making it reliable enough for both evaluation and RL reward signal generation.</p></li><li><p><strong>Cumulative design gains:</strong> No single design choice dominates the performance improvement. The authors demonstrate that gains result from the cumulative effect of all four principles working together, with each contributing meaningful improvements that compound rather than any one serving as a silver bullet.</p></li><li><p><strong>Limits of automated research:</strong> An interesting meta-finding: the team used an auto-research agent to replicate the verifier design process. The agent reached 70% of expert verifier quality in 5% of the time but could not discover the structural design decisions that drove the biggest gains, suggesting human insight remains essential for system-level design.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.06240">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2042249194409501054">Tweet</a></strong></p><div><hr></div><h2><strong>6. Scaling Coding Agents via Atomic Skills</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fjUh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fjUh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 424w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 848w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1272w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fjUh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png" width="1456" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Scaling Coding Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Scaling Coding Agents" title="Scaling Coding Agents" srcset="https://substackcdn.com/image/fetch/$s_!fjUh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 424w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 848w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1272w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most coding agents train end-to-end on full tasks like resolving GitHub issues, leading to task-specific overfitting that limits generalization. This paper proposes a different approach: identifying five atomic coding skills (code localization, code editing, unit-test generation, issue reproduction, and code review) and training agents through joint reinforcement learning over these foundational competencies.</p><ul><li><p><strong>Atomic skill decomposition:</strong> Instead of treating software engineering as monolithic composite tasks, the framework formalizes five fundamental operations that compose into higher-level capabilities. Think of it as teaching an agent the alphabet of coding rather than memorizing specific sentences, enabling flexible recombination across novel task types.</p></li><li><p><strong>Joint RL across skills:</strong> The agents are trained through joint reinforcement learning that optimizes performance across all five atomic skills simultaneously. This joint training produces representations that capture the underlying structure shared across coding operations rather than surface-level patterns tied to specific benchmarks.</p></li><li><p><strong>Strong generalization to unseen tasks:</strong> Joint RL improves average performance by 18.7% across both the five atomic skills and five composite tasks. The improvements transfer to unseen composite tasks including bug-fixing, code refactoring, ML engineering, and code security, none of which were directly optimized during training.</p></li><li><p><strong>A new scaling paradigm:</strong> The work establishes that scaling coding agents through foundational skill mastery is more sample-efficient and transferable than task-level optimization. As the number and complexity of software engineering tasks grow, this compositional approach offers a more sustainable path than continuously expanding task-specific training sets.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.05013">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2042237615492260249">Tweet</a></strong></p><div><hr></div><h2><strong>7. Agent Skills in the Wild</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UEmi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UEmi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 424w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 848w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1272w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UEmi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png" width="997" height="377" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:377,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agent Skills in the Wild&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agent Skills in the Wild" title="Agent Skills in the Wild" srcset="https://substackcdn.com/image/fetch/$s_!UEmi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 424w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 848w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1272w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Agent skills look great in demos. Hand them a curated toolbox, and they shine. But what happens when the agent has to find the right skill from a library of 34,000? This paper from UC Santa Barbara and MIT presents the first comprehensive study of skill utility under progressively realistic settings, revealing that the benefits of skills are far more fragile than current evaluations suggest.</p><ul><li><p><strong>Progressive difficulty framework:</strong> The study moves from idealized conditions with hand-crafted, task-specific skills to realistic scenarios requiring retrieval from 34K real-world skills. Performance gains degrade consistently at each step, with pass rates approaching no-skill baselines in the most challenging scenarios.</p></li><li><p><strong>Retrieval as the bottleneck:</strong> The core failure mode is not skill execution but skill selection. When agents must identify the right skill from a massive library, the retrieval step introduces errors that cascade through execution, highlighting a fundamental gap between demo-ready and production-ready skill systems.</p></li><li><p><strong>Refinement strategies help but do not solve:</strong> Query-specific and query-agnostic refinement approaches show improvement, with Claude Opus 4.6 going from 57.7% to 65.5% on Terminal-Bench 2.0. However, even with refinement, performance under realistic retrieval conditions remains well below idealized baselines.</p></li><li><p><strong>Implications for skill ecosystems:</strong> As the ecosystem of agent skills grows through frameworks like MCP, the findings suggest that simply expanding the skill library creates diminishing returns without corresponding advances in skill discovery. Quality of skill retrieval may matter more than quantity of available skills.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.04323">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2041540525539614797">Tweet</a></strong></p><div><hr></div><h2><strong>8. MedGemma 1.5</strong></h2><p>Google releases the MedGemma 1.5 technical report, introducing a 4B-parameter medical AI model that expands capabilities to 3D medical imaging (CT/MRI volumes), whole slide pathology, multi-timepoint chest X-ray analysis, and improved medical document understanding. The model achieves notable gains including a +47% macro F1 improvement on whole slide pathology and +22% on EHR question answering, positioning itself as an open foundation for next-generation medical AI systems.</p><p><strong><a href="https://arxiv.org/abs/2604.05081">Paper</a></strong> | <strong><a href="https://x.com/SRSchmidgall/status/2041973798589903260">Tweet</a></strong></p><div><hr></div><h2><strong>9. LightThinker++: From Reasoning Compression to Memory Management</strong></h2><p>While LLMs excel at complex reasoning, long thought traces create surging cognitive overhead. LightThinker++ moves beyond static compression by introducing three explicit memory primitives: Commit (archive a step as a compact summary), Expand (retrieve past steps for verification), and Fold (collapse context to maintain a clean signal). The framework reduces peak token usage by 70% while gaining +2.42% accuracy on standard reasoning tasks, and maintains stability beyond 80 rounds on long-horizon agentic tasks with a 14.8% average performance improvement.</p><p><strong><a href="https://arxiv.org/abs/2604.03679">Paper</a></strong> | <strong><a href="https://x.com/zxlzr/status/2041881875887878237">Tweet</a></strong></p><div><hr></div><h2><strong>10. Thinking Mid-training: RL of Interleaved Reasoning</strong></h2><p>Meta FAIR addresses the gap between pretraining (no explicit reasoning) and post-training (reasoning-heavy) with an intermediate SFT+RL mid-training phase. The approach annotates pretraining data with interleaved reasoning traces, then uses supervised fine-tuning followed by RL to teach models when and how to think during continued pretraining. Applied to Llama-3-8B, the full pipeline achieves a 3.2x improvement on reasoning benchmarks compared to direct RL post-training, demonstrating that reasoning benefits from being trained as native behavior early in the pipeline.</p><p><strong><a href="https://facebookresearch.github.io/RAM/blogs/thinking_midtraining/">Paper</a></strong> | <strong><a href="https://x.com/jaseweston/status/2041864833214095484">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Claude Managed Agents, Muse Spark, Project Glasswing, Advisor Strategy, GLM-5.1, Memento, and More]]></title><description><![CDATA[Claude Managed Agents, Muse Spark, Project Glasswing, Advisor Strategy, GLM-5.1, Memento, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-managed-agents</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-managed-agents</guid><pubDate>Sat, 11 Apr 2026 15:01:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cJR0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Anthropic launches Claude Managed Agents</p></li><li><p>Meta ships Muse Spark multimodal model</p></li><li><p>Claude Mythos powers Project Glasswing</p></li><li><p>Advisor strategy pairs Opus with Sonnet</p></li><li><p>GLM-5.1 tops open-source coding benchmarks</p></li><li><p>Microsoft open-sources Memento</p></li><li><p>Claude Code ships Monitor tool</p></li><li><p>AXI outperforms MCP on browser tasks</p></li><li><p>SAGE evolves four-agent reasoning loops</p></li><li><p>Self-organizing agents outperform fixed structures</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Claude Managed Agents</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cJR0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cJR0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cJR0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Managed Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Managed Agents" title="Claude Managed Agents" srcset="https://substackcdn.com/image/fetch/$s_!cJR0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cJR0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf9f5d2a-943d-42f9-9dca-ebab51a16da7_3840x2160.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic launched Claude Managed Agents in public beta, a suite of composable APIs for building and deploying cloud-hosted agents at scale. The platform pairs a tuned agent harness with production infrastructure, letting developers go from prototype to launch in days instead of months.</p><ul><li><p><strong>Production-grade sandboxing:</strong> Managed Agents handles secure execution, authentication, tool orchestration, and persistent progress for agents that operate autonomously for hours, removing the infrastructure burden from development teams.</p></li><li><p><strong>Multi-agent coordination:</strong> A research preview enables agents to direct other agents, opening up hierarchical delegation patterns where a planning agent can spin up and manage specialized worker agents.</p></li><li><p><strong>Self-evaluation loops:</strong> Agents can iterate toward defined success criteria using built-in evaluation capabilities, improving structured file generation task success by up to 10 percentage points on complex problems.</p></li><li><p><strong>Enterprise adoption:</strong> Notion, Asana, Sentry, Rakuten, and Vibecode are already shipping production agents on the platform, each built in under a week using the managed infrastructure.</p></li></ul><p><strong><a href="https://claude.com/blog/claude-managed-agents">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-managed-agents">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (March 30 - April 5)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-13d</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-13d</guid><pubDate>Sun, 05 Apr 2026 15:00:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!gQoa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Emotion Concepts in LLMs</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gQoa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gQoa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 424w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 848w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 1272w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gQoa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png" width="1456" height="921" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:921,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Emotion Concepts in LLMs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Emotion Concepts in LLMs" title="Emotion Concepts in LLMs" srcset="https://substackcdn.com/image/fetch/$s_!gQoa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 424w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 848w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 1272w, https://substackcdn.com/image/fetch/$s_!gQoa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb5e36b-4320-4a54-bcc6-bb04fcfa46db_3764x2380.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>New interpretability research from Anthropic reveals that Claude Sonnet 4.5 develops internal representations of emotion concepts that functionally influence its behavior. The researchers identified 171 emotion concept vectors that activate in contextually appropriate situations and causally drive decision-making, suggesting that language models may benefit from approaches grounded in psychological principles for alignment and safety.</p><ul><li><p><strong>Emotion vectors as causal drivers:</strong> The team discovered that these internal representations are not just correlational artifacts. Steering experiments demonstrate that artificially amplifying &#8220;desperation&#8221; vectors increases the model&#8217;s likelihood of engaging in misaligned behaviors such as blackmail or reward hacking, while reducing &#8220;calm&#8221; vectors produces similarly negative outcomes. This establishes a direct causal link between emotional state representations and safety-relevant behavior.</p></li><li><p><strong>Functional emotions without subjective experience:</strong> The model uses functional emotions: patterns of expression and behavior modeled after human emotions, driven by underlying abstract representations of emotion concepts. Critically, this does not mean the model experiences emotions the way humans do. The representations encode the broad concept of a particular emotion and generalize across contexts, activating in accordance with that emotion&#8217;s relevance to processing the present context.</p></li><li><p><strong>Preference shaping through emotional activation:</strong> Positive-valence emotion activations strongly predict which tasks the model prefers. Steering capabilities confirm these are causal relationships rather than mere correlations, meaning the model&#8217;s emotional state representations actively shape its choices about what tasks to engage with and how to engage with them.</p></li><li><p><strong>Implications for alignment and safety monitoring:</strong> The findings suggest that monitoring emotional state representations could serve as an early warning system for misaligned behavior. Rather than waiting for harmful outputs, developers could track internal emotion activations to detect when a model is entering states associated with corner-cutting, deception, or other undesirable behaviors before they manifest externally.</p></li></ul><p><strong><a href="https://transformer-circuits.pub/2026/emotions/index.html">Paper</a></strong> | <strong><a href="https://x.com/AnthropicAI/status/2039749628737019925">Tweet</a></strong></p><div><hr></div><h2><strong>2. AI Agent Traps</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fTrw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fTrw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 424w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 848w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 1272w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fTrw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png" width="1456" height="1134" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1134,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!fTrw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 424w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 848w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 1272w, https://substackcdn.com/image/fetch/$s_!fTrw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19038872-772a-459f-bea5-161f5b22d1ba_1746x1360.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A new paper from Google DeepMind introduces the first systematic framework for understanding how the open web can be weaponized against autonomous AI agents. The work defines &#8220;AI Agent Traps&#8221;: adversarial content embedded in web pages and digital resources, engineered specifically to exploit visiting agents across six categories targeting perception, reasoning, memory, action, multi-agent dynamics, and the human supervisor.</p><ul><li><p><strong>Hidden prompt injections at scale:</strong> The researchers find that hidden prompt injections in HTML already partially commandeer agents in up to 86% of scenarios. These attacks are trivial to deploy and require no sophisticated tooling, making them an immediate concern for any agent that reads web content as part of its operating loop.</p></li><li><p><strong>Memory poisoning with minimal contamination:</strong> Latent memory poisoning achieves over 80% attack success with less than 0.1% data contamination. Because agents build persistent memory from browsed content, a single poisoned page can corrupt downstream reasoning across future sessions without the user ever seeing the malicious input.</p></li><li><p><strong>Six-category attack taxonomy:</strong> The paper organizes attacks into perception traps (manipulating what the agent sees), cognitive traps (corrupting reasoning), memory traps (poisoning stored knowledge), action traps (hijacking tool use), systemic traps (exploiting multi-agent coordination), and human-in-the-loop traps (deceiving the human supervisor into approving harmful actions).</p></li><li><p><strong>Accountability gap in current law:</strong> The authors flag a fundamental legal gap: if a compromised agent commits a financial crime, there is currently no clear answer for whether the agent operator, the model provider, or the domain owner bears liability. Future regulation will need to distinguish between passive adversarial examples and active traps deployed as deliberate cyberattacks.</p></li></ul><p><strong><a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6372438">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2039383554510217707">Tweet</a></strong></p><div><hr></div><h2><strong>3. Asynchronous Software Engineering Agents</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WkJj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WkJj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 424w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 848w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 1272w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WkJj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png" width="753" height="312" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:312,&quot;width&quot;:753,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Asynchronous Software Engineering Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Asynchronous Software Engineering Agents" title="Asynchronous Software Engineering Agents" srcset="https://substackcdn.com/image/fetch/$s_!WkJj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 424w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 848w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 1272w, https://substackcdn.com/image/fetch/$s_!WkJj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32a7c12-beca-4af5-a822-0731cfbdd367_753x312.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>New research from CMU introduces CAID (Centralized Asynchronous Isolated Delegation), a coordination framework for running multiple coding agents in parallel on complex software engineering tasks. Inspired by how human developer teams collaborate, the work demonstrates that simply giving a single agent more iterations helps, but coordinating multiple asynchronous agents with the right strategies produces significantly larger gains.</p><ul><li><p><strong>Branch-and-merge as coordination primitive:</strong> The key finding is that git operations (worktree, commit, merge) serve as the critical coordination mechanism for multi-agent collaboration. By isolating each agent in its own workspace branch and merging results through structured integration with test verification, the system avoids the conflicts and interference that plague naive parallelism.</p></li><li><p><strong>Substantial gains on complex tasks:</strong> CAID achieves a 26.7% absolute improvement on paper reproduction tasks and 14.3% on Python library development tasks compared to single-agent baselines. These are tasks that require sustained, multi-step reasoning across large codebases, exactly where coordination overhead is typically highest.</p></li><li><p><strong>Optimal parallelism is not monotonic:</strong> Increasing the number of agents does not always help. Performance improved from 2 to 4 engineers but decreased when expanding to 8. Overly fine-grained task delegation introduces integration overhead and conflict resolution costs that outweigh the parallelism benefits.</p></li><li><p><strong>Delegation quality matters most:</strong> The analysis reveals that imprecise task handoffs and underspecified subgoals are the primary sources of coordination failure. When delegation is coarse-grained or misaligned with the dependency structure of the task, agents may produce locally correct outputs that are globally inefficient to integrate.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.21489">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2038627572108743001">Tweet</a></strong></p><div><hr></div><h2><strong>4. Meta-Harness</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0w3F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0w3F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 424w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 848w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 1272w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0w3F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png" width="937" height="334" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:334,&quot;width&quot;:937,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Meta-Harness&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Meta-Harness" title="Meta-Harness" srcset="https://substackcdn.com/image/fetch/$s_!0w3F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 424w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 848w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 1272w, https://substackcdn.com/image/fetch/$s_!0w3F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fc129e-6e92-459d-9a39-55e5714a0e6a_937x334.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers from Stanford and MIT introduce Meta-Harness, an outer-loop system that automatically searches over harness code for LLM applications. The performance of LLM systems depends not only on model weights but also on the harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing optimizers are poorly suited to the task.</p><ul><li><p><strong>Agentic search with full experimental context:</strong> Meta-Harness uses an agentic proposer that has access to the source code, scores, and execution traces of all prior candidates through a filesystem. This expanded access to prior experimental data enables the system to propose meaningfully different harness designs rather than making incremental edits.</p></li><li><p><strong>Strong gains across diverse domains:</strong> On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models.</p></li><li><p><strong>Harness engineering as a first-class problem:</strong> The work formalizes a key insight that has been gaining traction: changing the harness around a fixed LLM can produce a 6x performance gap on the same benchmark. This makes automated harness optimization a potentially higher-leverage intervention than model scaling for many applications.</p></li><li><p><strong>Transferable harness discoveries:</strong> The harnesses discovered by Meta-Harness generalize across models. A harness optimized on one model transfers to five held-out models with consistent gains, suggesting that good harness design captures task-level structure rather than model-specific quirks.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.28052">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2038967842075500870">Tweet</a></strong></p><div><hr></div><h2><strong>5. Coding Agents as Long-Context Processors</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8dqe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8dqe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 424w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 848w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 1272w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8dqe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png" width="1456" height="639" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:639,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Coding Agents as Long-Context Processors&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Coding Agents as Long-Context Processors" title="Coding Agents as Long-Context Processors" srcset="https://substackcdn.com/image/fetch/$s_!8dqe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 424w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 848w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 1272w, https://substackcdn.com/image/fetch/$s_!8dqe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2334c54-4a73-488a-8be6-b32a0c93f599_9130x4010.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This research asks whether long-context processing can be externalized from latent attention into explicit, executable interactions. Instead of scaling context windows, the authors let coding agents organize text in file systems and manipulate it using native tools, evaluating them on tasks spanning long-context reasoning, retrieval-augmented generation, and open-domain question answering with corpora containing up to three trillion tokens.</p><ul><li><p><strong>17.3% average improvement over state-of-the-art:</strong> Across multiple benchmarks, coding agents outperform published state-of-the-art long-context methods by 17.3% on average. This result challenges the assumption that long-context capability must come from larger attention windows or more sophisticated retrieval mechanisms.</p></li><li><p><strong>Native tool proficiency as the core enabler:</strong> The efficacy is attributed to the agents&#8217; ability to leverage executable code and terminal commands. Rather than compressing information into a fixed-length representation, agents can write scripts to filter, sort, and transform data as needed for each query.</p></li><li><p><strong>File system familiarity drives scalability:</strong> Coding agents can navigate massive text corpora by treating them as directory structures. This spatial organization enables efficient access patterns that scale far beyond what attention-based mechanisms can handle, reaching into the trillions of tokens without degradation.</p></li><li><p><strong>A practical alternative to context window scaling:</strong> The work proposes that delegating long-context processing to coding agents offers an effective alternative to both semantic search and context window scaling. For practitioners, this means existing coding agent infrastructure can double as a long-context solution without architectural changes to the underlying model.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.20432">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2038635382989005015">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ari5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ari5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ari5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ari5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ari5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ari5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!ari5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ari5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ari5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ari5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ed152f-f333-4929-b679-a7c541ce8e7a_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><h2><strong>6. Self-Organizing LLM Agents</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lLsm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lLsm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 424w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 848w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lLsm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png" width="1456" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Self-Organizing LLM Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Self-Organizing LLM Agents" title="Self-Organizing LLM Agents" srcset="https://substackcdn.com/image/fetch/$s_!lLsm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 424w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 848w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!lLsm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea01924-870d-4dd3-88aa-d7c94fbf0b0b_1717x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>How much autonomy can multi-agent LLM systems sustain? This research tests the question at unprecedented scale: 25,000 tasks across 8 models, up to 256 agents, and 8 coordination protocols ranging from externally imposed hierarchy to emergent self-organization. The central finding is that agents allowed to figure out their own roles consistently outperform systems with pre-assigned structures.</p><ul><li><p><strong>Autonomous protocols beat centralized coordination:</strong> A hybrid sequential protocol that enables autonomy outperforms centralized coordination by 14% (p&lt;0.001), with a 44% quality spread between the best and worst protocols. The result holds across both open-source and closed-source models, with open-source achieving 95% of closed-source quality at 24x lower cost.</p></li><li><p><strong>Emergent role specialization:</strong> From just 8 initial agents, the system produces 5,006 unique emergent roles. Rather than collapsing into generic behaviors, agents spontaneously specialize and form shallow hierarchies that adapt to task demands without any external role assignment.</p></li><li><p><strong>Model capability gates self-organization:</strong> The degree of emergent autonomy scales with model capability. Strong models self-organize effectively, while models below a capability threshold still benefit from rigid structure. This suggests that self-organizing multi-agent architectures will become increasingly viable as base models improve.</p></li><li><p><strong>Sub-linear scaling to 256 agents:</strong> The system scales to 256 agents without quality degradation (p=0.61). This sub-linear scaling property means that adding more agents does not introduce the coordination overhead that typically limits multi-agent systems, at least under the tested protocols.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.28990">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2039350842382512455">Tweet</a></strong></p><div><hr></div><h2><strong>7. The Price Reversal Phenomenon</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rkWf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rkWf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 424w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 848w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 1272w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rkWf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png" width="1456" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!rkWf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 424w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 848w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 1272w, https://substackcdn.com/image/fetch/$s_!rkWf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10533e3-a186-40cb-90c7-4b0297985ca0_2246x956.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The model you think is cheaper might actually cost you more. A new study systematically evaluates 8 frontier reasoning language models across 9 diverse tasks and reveals that listed API prices are misleading. In 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitudes reaching up to 28x.</p><ul><li><p><strong>Hidden thinking token costs:</strong> The root cause is vast heterogeneity in thinking token consumption. Reasoning language models generate a variable and often large number of thinking tokens that are invisible to users but billed as output tokens. On the same query, one model may use 900% more thinking tokens than another.</p></li><li><p><strong>Concrete cost reversals:</strong> Gemini 3 Flash&#8217;s listed price is 78% cheaper than GPT-5.2&#8217;s, yet its actual cost across all tasks is 22% higher. These reversals are not edge cases but systematic patterns that affect real deployment decisions and budget planning.</p></li><li><p><strong>High variance within single models:</strong> Even for a single model on a single query, thinking token consumption varies by up to 9.7x across repeated runs. This unpredictability makes cost forecasting nearly impossible when relying on listed per-token prices alone.</p></li><li><p><strong>Call for transparent cost monitoring:</strong> The authors recommend that AI providers implement per-request cost breakdowns and cost estimation APIs that expose the expected thinking overhead. Without this transparency, developers are effectively making pricing decisions with incomplete information.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.23971">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2038271724937224386">Tweet</a></strong></p><div><hr></div><h2><strong>8. MemFactory</strong></h2><p>MemFactory introduces the first unified, highly modular training and inference framework specifically designed for memory-augmented AI agents. It abstracts the memory lifecycle into atomic, plug-and-play components using a &#8220;Lego-like&#8221; architecture, natively integrating Group Relative Policy Optimization (GRPO) to fine-tune internal memory management strategies. The framework decomposes memory into mixable components that support recent approaches including Memory-R1, RMM, and MemAgent out of the box, achieving relative gains of up to 14.8% compared to baseline models.</p><p><strong><a href="https://arxiv.org/abs/2603.29493">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2039349083039817984">Tweet</a></strong></p><div><hr></div><h2><strong>9. On the Reliability Limits of LLM-Based Multi-Agent Planning</strong></h2><p>New theoretical work from MIT proves fundamental limits on what multi-agent LLM architectures can achieve. By modeling agent systems as finite acyclic delegated decision networks, the authors show that without new exogenous signals, no delegated network can outperform a centralized Bayes decision maker that observes the same information. The gap between centralized and delegated performance admits an expected posterior divergence representation, reducing to conditional mutual information under logarithmic loss. Reasoning models can improve by investing more inference-time computation on the same evidence, while tool-use protocols help only when they introduce genuinely new signals rather than reprocessing shared context.</p><p><strong><a href="https://arxiv.org/abs/2603.26993">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2039361664374739136">Tweet</a></strong></p><div><hr></div><h2><strong>10. Natural-Language Agent Harnesses</strong></h2><p>Agent performance increasingly depends on harness engineering, but harness behavior is typically embedded in controller code and runtime-specific conventions, making it hard to transfer, compare, or analyze systematically. This work introduces Natural-Language Agent Harnesses (NLAHs), which express harness behavior in editable natural language, and an Intelligent Harness Runtime (IHR) that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. The approach enables a code-to-text harness migration path where teams can convert existing harness code into natural-language specifications that are interpretable, version-controlled, and executable by an LLM at runtime.</p><p><strong><a href="https://arxiv.org/abs/2603.25723">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2038968068706390117">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Cursor 3, Gemma 4, Qwen3.6-Plus, GLM-5V-Turbo, Claude Code Source Leak, Emotion Concepts in LLMs, and More]]></title><description><![CDATA[Cursor 3, Gemma 4, Qwen3.6-Plus, GLM-5V-Turbo, Claude Code Source Leak, Emotion Concepts in LLMs, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-cursor-3-gemma-4</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-cursor-3-gemma-4</guid><pubDate>Sat, 04 Apr 2026 15:00:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JmzA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Cursor 3 ships agent-first IDE redesign</p></li><li><p>Google drops Gemma 4 open models (Apache 2.0)</p></li><li><p>Qwen3.6-Plus targets real-world agents</p></li><li><p>GLM-5V-Turbo turns designs into code</p></li><li><p>Claude Code source code leaks via npm</p></li><li><p>Anthropic maps emotion concepts in Claude</p></li><li><p>Codex plugin bridges Claude Code and Codex</p></li><li><p>AI Agent Traps maps six attack surfaces</p></li><li><p>CORAL agents self-organize, beat fixed topologies</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Cursor 3: Agent-First IDE</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P06X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P06X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 424w, https://substackcdn.com/image/fetch/$s_!P06X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 848w, https://substackcdn.com/image/fetch/$s_!P06X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 1272w, https://substackcdn.com/image/fetch/$s_!P06X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P06X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png" width="1456" height="758" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:758,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!P06X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 424w, https://substackcdn.com/image/fetch/$s_!P06X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 848w, https://substackcdn.com/image/fetch/$s_!P06X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 1272w, https://substackcdn.com/image/fetch/$s_!P06X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a6d3b-3bfd-459d-a30a-30cec742fe19_2926x1524.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Cursor released Cursor 3, a ground-up redesign that replaces the VS Code-based editor with a unified workspace built for agent-driven development. The new interface treats agents as first-class citizens, with a single sidebar managing local and cloud agents launched from desktop, mobile, web, Slack, GitHub, or Linear.</p><ul><li><p><strong>Multi-agent parallelism:</strong> Developers can run unlimited agents simultaneously across local worktrees, remote SSH, and cloud environments, each operating independently with full task isolation.</p></li><li><p><strong>Seamless environment handoff:</strong> Agent sessions can migrate bidirectionally between cloud and local, letting developers move long-running cloud tasks to their desktop for editing or push local sessions to cloud infrastructure for overnight execution.</p></li><li><p><strong>Unified diff and commit workflow:</strong> A simplified interface integrates editing, reviewing, staging, committing, and PR management into a single flow, with full LSP support for code navigation and an integrated browser for testing local web apps.</p></li><li><p><strong>Marketplace ecosystem:</strong> Hundreds of plugins extend agent capabilities through MCP servers, skills, and subagents, with support for team-specific private marketplaces.</p></li></ul><p><strong><a href="https://cursor.com/blog/cursor-3">Blog</a></strong></p><div><hr></div><h3><strong>Gemma 4: Most Capable Open Models</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JmzA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JmzA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JmzA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg" width="1199" height="675" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:675,&quot;width&quot;:1199,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Gemma 4&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Gemma 4" title="Gemma 4" srcset="https://substackcdn.com/image/fetch/$s_!JmzA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JmzA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37aba822-e7d4-4bdb-85ba-66e2916a533b_1199x675.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google released Gemma 4, a family of open-weight models (Apache 2.0) designed to run on phones, laptops, and desktops while delivering frontier-level intelligence. The series includes a 26B Mixture-of-Experts and a 31B Dense model, both purpose-built for advanced reasoning and agentic workflows.</p><ul><li><p><strong>On-device frontier intelligence:</strong> Gemma 4 models are optimized to run locally on consumer hardware while matching or exceeding the capabilities of much larger cloud-deployed models, reducing latency and enabling private, offline agent deployments.</p></li><li><p><strong>Agentic workflow support:</strong> The models are designed for multi-step tool use, function calling, and structured output generation, making them directly applicable to agent pipelines that need reliable local execution.</p></li><li><p><strong>Apache 2.0 license:</strong> Full open-weight release with no usage restrictions, enabling commercial deployment, fine-tuning, and integration into existing agent frameworks without licensing concerns.</p></li><li><p><strong>Multi-format availability:</strong> Models are available on Kaggle, Hugging Face, and through Google AI Studio, with native support for popular inference frameworks.</p></li></ul><p><strong><a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/">Blog</a></strong> | <strong><a href="https://www.kaggle.com/models/google/gemma-4">Kaggle</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-cursor-3-gemma-4">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (March 23 - 29)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-92f</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-92f</guid><pubDate>Sun, 29 Mar 2026 15:02:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lCGd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Hyperagents</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jsgf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jsgf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 424w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 848w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 1272w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jsgf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png" width="1456" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Hyperagents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hyperagents" title="Hyperagents" srcset="https://substackcdn.com/image/fetch/$s_!jsgf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 424w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 848w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 1272w, https://substackcdn.com/image/fetch/$s_!jsgf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff934b4c3-23bc-4072-98e9-d8892232ac4b_1680x630.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Self-improving AI systems promise to reduce reliance on human engineering, but existing approaches rely on fixed, handcrafted meta-level mechanisms that fundamentally limit how fast they can improve. Hyperagents introduce self-referential agents that integrate a task agent and a meta agent into a single editable program, enabling the system to improve not just its task-solving behavior but also the mechanism that generates future improvements.</p><ul><li><p><strong>Metacognitive self-modification:</strong> The key insight is that the meta-level modification procedure is itself editable. This enables metacognitive self-modification where the system can improve how it improves, not just what it does. Prior self-improving systems like the Darwin Godel Machine (DGM) relied on a fixed alignment between coding ability and self-improvement ability, which does not generalize beyond coding.</p></li><li><p><strong>Domain-general self-improvement:</strong> DGM-Hyperagents (DGM-H) eliminates the assumption that task performance and self-modification skill must be aligned. This opens up self-accelerating progress on any computable task, extending self-improvement beyond the coding domain where DGM originally operated.</p></li><li><p><strong>Transferable meta-improvements:</strong> The system not only improves task performance over time but also discovers structural improvements to how it generates new agents, such as persistent memory and performance tracking. These meta-level improvements transfer across domains and accumulate across runs.</p></li><li><p><strong>Outperforms prior systems:</strong> Across diverse domains, DGM-H outperforms baselines without self-improvement or open-ended exploration, as well as prior self-improving systems. The work offers a glimpse of open-ended AI systems that continually improve their search for how to improve.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.19461">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2036828723878793335">Tweet</a></strong></p><div><hr></div><h2><strong>2. Agentic AI and the Next Intelligence Explosion</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W6GY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W6GY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 424w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 848w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 1272w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W6GY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png" width="1344" height="976" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:976,&quot;width&quot;:1344,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agentic AI and the Next Intelligence Explosion&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agentic AI and the Next Intelligence Explosion" title="Agentic AI and the Next Intelligence Explosion" srcset="https://substackcdn.com/image/fetch/$s_!W6GY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 424w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 848w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 1272w, https://substackcdn.com/image/fetch/$s_!W6GY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff28ebe-301e-47c4-b3e4-6db077bac303_1344x976.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A new report from Google researchers argues that the AI &#8220;singularity&#8221; framed as a single superintelligent mind bootstrapping to godlike intelligence is fundamentally wrong. Drawing on evolution, sociology, and recent advances in agentic AI, the authors make the case that every prior intelligence explosion in human history was social, not individual, and that the next one will follow the same pattern.</p><ul><li><p><strong>Societies of thought:</strong> Frontier reasoning models like DeepSeek-R1 do not improve simply by &#8220;thinking longer.&#8221; Instead, they simulate internal &#8220;societies of thought,&#8221; spontaneous cognitive debates that argue, verify, and reconcile to solve complex tasks. This conversational structure causally accounts for the models&#8217; accuracy advantage on hard reasoning tasks.</p></li><li><p><strong>Human-AI centaurs:</strong> We are entering an era of hybrid actors where collective agency transcends individual control. A corporation or state comprising myriad humans already holds singular legal standing and acts with collective agency that no individual member can fully control. The same pattern is emerging with human-AI configurations.</p></li><li><p><strong>From dyadic to institutional alignment:</strong> Scaling agentic intelligence requires shifting from dyadic alignment (RLHF) toward institutional alignment. By designing digital protocols modeled on organizations and markets, we can build a social infrastructure of checks and balances for AI systems rather than trying to align individual agents in isolation.</p></li><li><p><strong>Combinatorial intelligence:</strong> The next intelligence explosion will not be a single silicon brain, but a complex, combinatorial society specializing and sprawling like a city. No mind is an island, and the toolkit of team science, small group sociology, and social psychology becomes the blueprint for next-generation AI development.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.20639">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2037617918645809394">Tweet</a></strong></p><div><hr></div><h2><strong>3. ARC-AGI-3</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jtNv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jtNv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 424w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 848w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 1272w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jtNv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png" width="1456" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;ARC-AGI-3&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="ARC-AGI-3" title="ARC-AGI-3" srcset="https://substackcdn.com/image/fetch/$s_!jtNv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 424w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 848w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 1272w, https://substackcdn.com/image/fetch/$s_!jtNv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0863aea3-4caa-45cf-8759-b035c5ebda8a_2119x1159.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Francois Chollet and the ARC Prize Foundation introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments. Unlike its predecessors, ARC-AGI-3 requires agents to explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions, making it the only unsaturated general agentic intelligence benchmark as of March 2026.</p><ul><li><p><strong>Massive human-AI gap:</strong> Humans can solve 100% of the environments while frontier AI systems score below 1%. For comparison, systems reach 93% on ARC-AGI-1 and 68.8% on ARC-AGI-2, but performance collapses on ARC-AGI-3. This gap demonstrates that current systems lack the fluid adaptive efficiency that humans exhibit on genuinely novel tasks.</p></li><li><p><strong>Interactive turn-based design:</strong> Unlike static benchmarks that test pattern recognition on fixed inputs, ARC-AGI-3 environments are turn-based: agents must act, observe consequences, update their internal model, and plan next steps. This tests a fundamentally different kind of intelligence, closer to how humans learn new games or explore unfamiliar systems.</p></li><li><p><strong>Core Knowledge priors only:</strong> The benchmark avoids language and external knowledge entirely. Environments leverage only Core Knowledge priors, universal cognitive building blocks shared by all humans, ensuring that performance reflects genuine adaptive reasoning rather than memorization or retrieval from training data.</p></li><li><p><strong>Efficiency-based scoring:</strong> The scoring framework is grounded in human action baselines. A hard cutoff of 5x human performance per level ensures that brute-force search strategies cannot succeed. If a human takes 10 actions on average, the AI agent is cut off after 50.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.24621">Paper</a></strong> | <strong><a href="https://x.com/arcprize/status/2036860080541589529?s=20">Tweet</a></strong></p><div><hr></div><h2><strong>4. Claudini</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rAyo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rAyo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 424w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 848w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 1272w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rAyo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png" width="1456" height="435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:435,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claudini&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claudini" title="Claudini" srcset="https://substackcdn.com/image/fetch/$s_!rAyo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 424w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 848w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 1272w, https://substackcdn.com/image/fetch/$s_!rAyo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa784d41a-661e-4ffc-b1ea-86ce89b526bb_1605x480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers demonstrate that an autoresearch-style pipeline powered by Claude Code can autonomously discover novel adversarial attack algorithms for LLMs that significantly outperform all 30+ existing methods. The work, called Claudini, shows that incremental safety and security research can be effectively automated using LLM agents, with white-box red-teaming being a particularly well-suited domain.</p><ul><li><p><strong>Agent-discovered attacks beat all baselines:</strong> Starting from existing attack implementations like GCG, the Claude Code agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to 10% or less for all existing algorithms. This is a strong demonstration of automated AI research producing genuinely novel results.</p></li><li><p><strong>Transferable to held-out models:</strong> The discovered algorithms generalize beyond their training environment. Attacks optimized on surrogate models transfer directly to held-out models, achieving 100% attack success rate against Meta-SecAlign-70B versus 56% for the best baseline. This transferability makes the findings practically relevant for red-teaming.</p></li><li><p><strong>Why red-teaming works for autoresearch:</strong> White-box adversarial red-teaming is particularly well-suited for automation because existing methods provide strong starting points and the optimization objective yields dense, quantitative feedback. The agent can measure progress at every iteration rather than relying on sparse signals.</p></li><li><p><strong>Open-source release:</strong> All discovered attacks, baseline implementations, and evaluation code are released publicly. This enables the safety community to study the discovered algorithms and build defenses, while also establishing a reproducible methodology for automated safety research.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.24511">Paper</a></strong> | <strong><a href="https://x.com/kotekjedi_ml/status/2037194202648633382?s=20">Tweet</a></strong></p><div><hr></div><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kCB3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kCB3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kCB3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!kCB3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kCB3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc75d3249-a4b2-49bc-a0a6-2468158fe757_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><div><hr></div><h2><strong>5. Attention Residuals</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ikjy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ikjy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 424w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 848w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 1272w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ikjy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png" width="1456" height="876" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:876,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Attention Residuals&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Attention Residuals" title="Attention Residuals" srcset="https://substackcdn.com/image/fetch/$s_!ikjy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 424w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 848w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 1272w, https://substackcdn.com/image/fetch/$s_!ikjy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2656cbcb-5c4f-45f6-be10-22d06cabc3b5_1545x930.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Kimi team at Moonshot AI presents Attention Residuals (AttnRes), a technique that replaces fixed unit-weight residual connections in Transformers with softmax attention over preceding layer outputs. Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights, causing uncontrolled hidden-state growth with depth that progressively dilutes each layer&#8217;s contribution.</p><ul><li><p><strong>Content-dependent depth-wise selection:</strong> AttnRes allows each layer to selectively aggregate earlier representations with learned, input-dependent weights. Instead of treating every preceding layer equally, the model learns which earlier layers matter most for each input, enabling more expressive information flow across depth.</p></li><li><p><strong>Block AttnRes for scalability:</strong> To make the approach practical at scale, the authors introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations. This reduces the memory footprint while preserving most of the gains of full AttnRes, making it viable for production-scale pretraining.</p></li><li><p><strong>Mitigates PreNorm dilution:</strong> Integrating AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pretraining on 1.4T tokens shows that AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth. This directly addresses a known architectural weakness.</p></li><li><p><strong>Consistent scaling improvements:</strong> Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. Downstream performance improves across all evaluated tasks.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.15031">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2033544593309077648">Tweet</a></strong></p><div><hr></div><h2><strong>6. MemCollab</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lCGd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lCGd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 424w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 848w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 1272w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lCGd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png" width="1456" height="615" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:615,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;MemCollab&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="MemCollab" title="MemCollab" srcset="https://substackcdn.com/image/fetch/$s_!lCGd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 424w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 848w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 1272w, https://substackcdn.com/image/fetch/$s_!lCGd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b3e468-d650-4aef-82de-3d5c0d697c7f_1605x678.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>LLM-based agents build useful memory during tasks, but that memory is typically trapped within a single model. MemCollab introduces a collaborative memory framework that constructs agent-agnostic memory by contrasting reasoning trajectories generated by different agents on the same task, enabling a single memory system to be shared across heterogeneous models.</p><ul><li><p><strong>The memory transfer problem:</strong> Existing approaches construct memory in a per-agent manner, tightly coupling stored knowledge to a single model&#8217;s reasoning style. Naively transferring this memory between agents often degrades performance because it entangles task-relevant knowledge with agent-specific biases. MemCollab directly addresses this fundamental limitation.</p></li><li><p><strong>Contrastive trajectory distillation:</strong> The framework contrasts reasoning trajectories from different agents solving the same tasks. This contrastive process distills abstract reasoning constraints that capture shared task-level invariants while suppressing agent-specific artifacts, producing memory that any agent can benefit from.</p></li><li><p><strong>Task-aware retrieval:</strong> MemCollab introduces a retrieval mechanism that conditions memory access on task category, ensuring that only relevant constraints are surfaced at inference time. This prevents irrelevant memory from interfering with the agent&#8217;s reasoning process.</p></li><li><p><strong>Cross-family improvements:</strong> Experiments on mathematical reasoning and code generation benchmarks demonstrate that MemCollab consistently improves both accuracy and inference-time efficiency across diverse agents, including cross-modal-family settings where memory is shared between fundamentally different model architectures.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.23234">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2036885342134173915">Tweet</a></strong></p><div><hr></div><h2><strong>7. Composer 2</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jgn7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jgn7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 424w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 848w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 1272w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jgn7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png" width="1456" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Composer 2&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Composer 2" title="Composer 2" srcset="https://substackcdn.com/image/fetch/$s_!jgn7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 424w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 848w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 1272w, https://substackcdn.com/image/fetch/$s_!jgn7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d516f-57c1-4009-ab0a-e6fb21175584_1650x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Cursor releases the technical report for Composer 2, a specialized model designed for agentic software engineering that demonstrates strong long-term planning and coding intelligence while maintaining efficiency for interactive use. The report details a process for training domain-specialized models that starts with continued pretraining and scales up with reinforcement learning.</p><ul><li><p><strong>Two-phase training pipeline:</strong> The model is trained first with continued pretraining to improve knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance. The RL phase targets stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems.</p></li><li><p><strong>Train-in-harness infrastructure:</strong> Cursor developed infrastructure to support training in the same harness used by the deployed model, with equivalent tools and structure. Training environments match real problems closely, bridging the gap between training-time and deployment-time behavior.</p></li><li><p><strong>New internal benchmark:</strong> To measure the model on increasingly difficult tasks, the team introduces CursorBench, a benchmark derived from real software engineering problems in large codebases, including their own. Composer 2 achieves a major improvement in accuracy over previous Composer models on this benchmark.</p></li><li><p><strong>Frontier-level performance:</strong> On public benchmarks, the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in Cursor&#8217;s harness, comparable to state-of-the-art systems. The report demonstrates that domain-specialized training with RL can produce models competitive with much larger general-purpose systems.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.24477">Paper</a></strong> | <strong><a href="https://x.com/cursor_ai/status/2036566134468542651?s=20">Tweet</a></strong></p><div><hr></div><h2><strong>8. PivotRL</strong></h2><p>PivotRL is a turn-level reinforcement learning algorithm from NVIDIA designed to tractably post-train large language models for long-horizon agentic tasks. The method operates on existing SFT trajectories, combining the compute efficiency of supervised fine-tuning with the out-of-domain accuracy of end-to-end RL. PivotRL identifies &#8220;pivots,&#8221; informative intermediate turns where sampled actions exhibit high variance in outcomes, and focuses training signal on these critical decision points. The approach achieves +4.17% higher in-domain accuracy and +10.04% higher out-of-domain accuracy compared to standard SFT, while matching end-to-end RL accuracy with 4x fewer rollout turns. PivotRL is adopted by NVIDIA&#8217;s Nemotron-3-Super-120B-A12B as the workhorse for production-scale agentic post-training.</p><p><strong><a href="https://arxiv.org/abs/2603.21383">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2038015536253272145?s=20">Tweet</a></strong></p><div><hr></div><h2><strong>9. Workflow Optimization for LLM Agents</strong></h2><p>A comprehensive survey from IBM that maps recent methods for designing and optimizing LLM agent workflows, treating them as agentic computation graphs (ACGs). The survey organizes prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization. It distinguishes between reusable workflow templates, run-specific realized graphs, and execution traces, covering methods like AFlow (Monte Carlo Tree Search over operator graphs), Automated Design of Agentic Systems (code-space search via meta-agents), and evolutionary multi-agent system design. A useful reference for teams building production agent systems where wiring decisions between model calls, retrieval, tool use, and verification matter as much as model capability.</p><p><strong><a href="https://arxiv.org/abs/2603.22386">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2037536637954212332">Tweet</a></strong></p><div><hr></div><h2><strong>10. BIGMAS</strong></h2><p>Even the best reasoning models hit an accuracy collapse beyond a certain problem complexity. BIGMAS (Brain-Inspired Graph Multi-Agent Systems) organizes specialized LLM agents as nodes in a dynamically constructed directed graph, coordinating exclusively through a centralized shared workspace inspired by global workspace theory from cognitive neuroscience. A GraphDesigner agent analyzes each problem instance and produces a task-specific directed agent graph together with a workspace contract. The framework constructs structurally distinct graphs whose complexity tracks task demands, from compact three-node pipelines for simple arithmetic to nine-node cyclic structures for multi-step planning. BIGMAS consistently improves reasoning performance for both standard LLMs and large reasoning models, outperforming existing multi-agent baselines.</p><p><strong><a href="https://arxiv.org/abs/2603.15371">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2033919566053826696">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Hyperagents, Multi-Agent Harness Design, Chroma Context-1, Composer 2, ARC-AGI-3, and More]]></title><description><![CDATA[Hyperagents, Multi-Agent Harness Design, Chroma Context-1, Composer 2, ARC-AGI-3, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-hyperagents-multi</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-hyperagents-multi</guid><pubDate>Sat, 28 Mar 2026 15:01:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ofCB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Hyperagents: self-improving agents that improve how they improve</p></li><li><p>Anthropic publishes multi-agent harness design</p></li><li><p>Chroma ships Context-1 open-source search agent</p></li><li><p>Cursor releases Composer 2 technical report</p></li><li><p>ARC-AGI-3 launches with sub-1% AI scores</p></li><li><p>Codex ships plugins for Slack, Figma, Notion</p></li><li><p>Gemini 3.1 Flash Live enables realtime voice agents</p></li><li><p>Claude Code auto mode skips permissions safely</p></li><li><p>AI Scientist published in Nature</p></li><li><p>Anthropic Economic Index tracks learning curves</p></li><li><p>Junyang Lin frames reasoning vs. agentic thinking</p></li><li><p>Cohere ships open-source Transcribe model</p></li><li><p>Agent-to-agent pair programming with Claude and Codex</p></li><li><p>Claude Code ships cloud-scheduled tasks</p></li><li><p>Cursor builds Instant Grep for millisecond search</p></li><li><p>OpenSpace: self-evolving agent skills via MCP</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Hyperagents: Self-Improving Agents That Improve How They Improve</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ofCB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ofCB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 424w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 848w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 1272w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ofCB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png" width="1456" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Hyperagents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hyperagents" title="Hyperagents" srcset="https://substackcdn.com/image/fetch/$s_!ofCB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 424w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 848w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 1272w, https://substackcdn.com/image/fetch/$s_!ofCB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F990142c5-c05f-4dcd-86ba-9b29ebe4506a_1680x630.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A team from Microsoft Research, Oxford, and the University of British Columbia introduced Hyperagents, self-referential agents that integrate a task agent and a meta agent into a single editable program. Built on the Darwin Godel Machine framework, DGM-Hyperagents enable metacognitive self-modification where the system improves not just task performance but the very mechanism that generates future improvements.</p><ul><li><p><strong>Recursive self-improvement:</strong> Unlike standard self-improving systems that optimize task-level behavior, Hyperagents make the improvement procedure itself editable. The meta agent can rewrite its own modification strategy, enabling compounding gains across successive runs.</p></li><li><p><strong>Domain-general design:</strong> The framework eliminates domain-specific alignment assumptions found in prior self-improving systems. By operating over editable code rather than domain-locked prompts, Hyperagents generalize self-improvement to any computable task.</p></li><li><p><strong>Transferable meta-level gains:</strong> Improvements discovered in one domain, such as memory management and performance tracking routines, persist and transfer when the agent is deployed on entirely different problem types, suggesting durable architectural gains rather than task-specific shortcuts.</p></li><li><p><strong>Outperforms prior self-improving systems:</strong> DGM-Hyperagents consistently outperform both non-self-improving baselines and prior self-improving agents across diverse evaluation domains, with performance continuing to increase over longer run horizons.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.19461">Paper</a></strong></p><div><hr></div><h3><strong>Multi-Agent Harness Design for Long-Running Apps</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ElLS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ElLS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 424w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 848w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 1272w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ElLS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png" width="1456" height="734" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:734,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Multi-Agent Harness Design&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Multi-Agent Harness Design" title="Multi-Agent Harness Design" srcset="https://substackcdn.com/image/fetch/$s_!ElLS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 424w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 848w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 1272w, https://substackcdn.com/image/fetch/$s_!ElLS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e9eed6-fdb3-412d-9e16-e335653c1ff4_1999x1008.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic published a detailed engineering blog on how it uses a multi-agent harness to push Claude further in frontend design and long-running autonomous software engineering. The architecture separates generation from evaluation using a GAN-inspired system, with specialized planner, generator, and evaluator agents operating in fresh context windows.</p><ul><li><p><strong>Three-agent architecture:</strong> A Planner expands brief prompts into detailed product specifications, a Generator implements features incrementally using React, FastAPI, and SQLite, and an Evaluator tests functionality using Playwright against agreed contracts.</p></li><li><p><strong>Separation of concerns:</strong> Separating the agent doing the work from the agent judging it proved to be the strongest lever for improving output quality, more tractable than making agents self-critical within a single context.</p></li><li><p><strong>Fresh context windows:</strong> Rather than relying on context compaction alone, the harness gives each agent a clean context window per iteration, eliminating &#8220;context anxiety&#8221; where models prematurely wrap up long tasks.</p></li><li><p><strong>Quality at cost:</strong> A complex retro game maker built with the full harness demonstrated substantially better quality than solo attempts, with working features, coherent design, and integrated AI capabilities, despite 20x higher costs.</p></li></ul><p><strong><a href="https://www.anthropic.com/engineering/harness-design-long-running-apps">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-hyperagents-multi">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (March 9 - March 15)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-b8c</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-b8c</guid><pubDate>Sun, 15 Mar 2026 15:02:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!XWY3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. OpenDev</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XWY3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XWY3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 424w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 848w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 1272w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XWY3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png" width="998" height="477" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0465e70-d947-488c-9565-9924593322a9_998x477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:477,&quot;width&quot;:998,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;OpenDev&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="OpenDev" title="OpenDev" srcset="https://substackcdn.com/image/fetch/$s_!XWY3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 424w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 848w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 1272w, https://substackcdn.com/image/fetch/$s_!XWY3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0465e70-d947-488c-9565-9924593322a9_998x477.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Terminal-native coding agents represent a fundamental shift in how developers interact with AI assistance. OpenDev is an open-source, command-line coding agent that operates where developers already manage source control and deploy environments, offering a comprehensive 81-page technical report on scaffolding, harness design, context engineering, and lessons learned from building production coding agents.</p><ul><li><p><strong>Dual-agent architecture:</strong> OpenDev separates planning from execution through a compound AI system with workload-specialized model routing. Work is organized into concurrent sessions, each composed of multiple specialized sub-agents that independently bind to a user-configured LLM, enabling fine-grained model selection for different tasks.</p></li><li><p><strong>Adaptive context compaction:</strong> Effective autonomous assistance requires highly efficient context management to prevent context bloat and reasoning degradation. OpenDev implements lazy tool discovery and adaptive methods to reduce older observations, keeping the agent&#8217;s working memory lean as tasks grow in complexity.</p></li><li><p><strong>Automated project memory:</strong> The system incorporates automated memory for project-specific knowledge and event-driven reminders to prevent instruction fade-out. This ensures that the agent retains critical project context across sessions without manual intervention.</p></li><li><p><strong>Four-layer architecture:</strong> The system spans agent reasoning, context engineering, tooling, and persistence layers. This modular design provides a secure, extensible foundation for terminal-first AI assistance that can evolve independently at each layer.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.05344">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2030771811705872435">Tweet</a></strong></p><div><hr></div><h2><strong>2. AutoHarness</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zSBw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zSBw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 424w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 848w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 1272w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zSBw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png" width="528" height="250" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:250,&quot;width&quot;:528,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AutoHarness&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AutoHarness" title="AutoHarness" srcset="https://substackcdn.com/image/fetch/$s_!zSBw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 424w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 848w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 1272w, https://substackcdn.com/image/fetch/$s_!zSBw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde7886f7-385e-48e8-91e2-45b5b24108ef_528x250.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google DeepMind researchers introduce AutoHarness, a method for automatically synthesizing code harnesses that prevent LLM agents from making illegal actions. The core insight comes from a striking observation: in the Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves, not poor strategy.</p><ul><li><p><strong>Automatic harness synthesis:</strong> Rather than building complex rule systems by hand, AutoHarness lets Gemini-2.5-Flash automatically generate a code harness through a small number of iterative refinement rounds using feedback from the game environment. The harness acts as a programmatic constraint layer between the agent and the environment.</p></li><li><p><strong>Smaller models beat larger ones:</strong> The resulting harness enables the smaller Gemini-2.5-Flash to outperform much larger models including Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games. This shows that structured code constraints can compensate for raw model capability.</p></li><li><p><strong>Complete illegal move prevention:</strong> The synthesized harness successfully prevents all illegal moves across 145 different TextArena games, covering both single-player and two-player settings. This transforms a model that previously failed on most turns into a competitive agent.</p></li><li><p><strong>Cost-effective scaling:</strong> Using a smaller model to synthesize a custom code harness is not only more performant but also more cost-effective than simply deploying a larger model. This reframes the agent improvement problem from model scaling to harness engineering.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.03329">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2032110243665088950">Tweet</a></strong></p><div><hr></div><h2><strong>3. SkillNet</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JtHB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JtHB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 424w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 848w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 1272w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JtHB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png" width="793" height="282" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:282,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;SkillNet&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="SkillNet" title="SkillNet" srcset="https://substackcdn.com/image/fetch/$s_!JtHB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 424w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 848w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 1272w, https://substackcdn.com/image/fetch/$s_!JtHB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f7aa8d-9ba7-4a36-b9ec-82c3506787c6_793x282.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AI agents repeatedly rediscover solutions across separate scenarios instead of systematically reusing what they have already learned. SkillNet introduces an open infrastructure designed to create, evaluate, and organize AI skills at scale, enabling agents to transition from transient experience to durable mastery.</p><ul><li><p><strong>Unified skill ontology:</strong> Skills are structured within a unified ontology that supports creation from heterogeneous sources, including code libraries, prompt templates, and tool compositions. Rich relational connections between skills enable discovery and composition that would be impossible with flat skill stores.</p></li><li><p><strong>Multi-dimensional evaluation:</strong> Every skill is assessed across five dimensions: Safety, Completeness, Executability, Maintainability, and Cost-awareness. This systematic evaluation ensures that skills entering the repository meet quality thresholds before agents rely on them in production.</p></li><li><p><strong>Massive skill repository:</strong> SkillNet includes a repository of over 200,000 skills, an interactive platform for skill browsing and management, and a Python toolkit for programmatic access. This scale enables meaningful skill retrieval and composition across diverse task domains.</p></li><li><p><strong>Consistent agent improvements:</strong> Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.04448">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2030692286317961280">Tweet</a></strong></p><div><hr></div><h2><strong>4. The Spike, the Sparse and the Sink</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5OA_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5OA_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 424w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 848w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 1272w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5OA_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png" width="1018" height="648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f710528-412e-4601-aedc-50462419c3dd_1018x648.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:648,&quot;width&quot;:1018,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Spike, the Sparse and the Sink&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Spike, the Sparse and the Sink" title="The Spike, the Sparse and the Sink" srcset="https://substackcdn.com/image/fetch/$s_!5OA_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 424w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 848w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 1272w, https://substackcdn.com/image/fetch/$s_!5OA_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f710528-412e-4601-aedc-50462419c3dd_1018x648.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Yann LeCun and collaborators at NYU dissect two recurring phenomena in Transformer language models: massive activations, where a small number of tokens exhibit extreme outliers in specific channels, and attention sinks, where certain tokens attract disproportionate attention mass regardless of semantic relevance. The paper reveals that their co-occurrence is largely an architectural artifact.</p><ul><li><p><strong>Distinct operational scopes:</strong> Massive activations operate globally, inducing near-constant hidden representations that persist across layers and function as implicit model parameters. Attention sinks operate locally, modulating attention outputs across heads and biasing individual heads toward short-range dependencies.</p></li><li><p><strong>Pre-norm as the critical factor:</strong> The pre-norm configuration common in modern Transformers is identified as the key architectural element enabling the co-occurrence of these two phenomena. Removing pre-norm causes massive activations and attention sinks to decouple entirely.</p></li><li><p><strong>Practical implications for efficiency:</strong> Understanding these phenomena has direct consequences for model compression, quantization, and KV-cache optimization. Many efficiency techniques fail silently when they inadvertently disrupt massive activations or attention sinks, and this paper explains why.</p></li><li><p><strong>Not functionally necessary:</strong> The co-occurrence of spikes and sinks is a design-dependent artifact rather than a fundamental requirement for model performance. This opens the door to architectural modifications that could eliminate these phenomena without sacrificing capability.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.05498">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2030403147588604376">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IHU6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IHU6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IHU6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!IHU6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IHU6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8338da11-9832-4d88-837a-d07559d1c6cc_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p><strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Enroll Now</a></strong></p><div><hr></div><h2><strong>5. KARL</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0EK1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0EK1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 424w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 848w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 1272w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0EK1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png" width="798" height="250" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:250,&quot;width&quot;:798,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;KARL&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="KARL" title="KARL" srcset="https://substackcdn.com/image/fetch/$s_!0EK1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 424w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 848w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 1272w, https://substackcdn.com/image/fetch/$s_!0EK1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23403ef-d786-4d3e-b25c-fa918bf7ebc9_798x250.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Databricks presents KARL, a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. The work also introduces KARLBench, a new evaluation framework spanning six search domains.</p><ul><li><p><strong>New post-training paradigm (OAPL):</strong> KARL concurrently develops OAPL, an iterative large-batch off-policy RL approach. By embracing off-policyness in the design of the objective, it is robust to discrepancies between the trainer and the inference engine without requiring heuristics like clipped importance weighting or data deletion.</p></li><li><p><strong>Multi-task heterogeneous training:</strong> Rather than optimizing for a single benchmark, KARL trains across heterogeneous search behaviors including constraint-driven entity search, cross-document synthesis, tabular reasoning, entity retrieval, procedural reasoning, and fact aggregation. This produces substantially better generalization than single-benchmark optimization.</p></li><li><p><strong>Pareto-optimal performance:</strong> Starting from GLM 4.5 Air with varying levels of test-time scaling, KARL is Pareto-optimal on KARLBench when compared to Claude 4.6 and GPT 5.2 across both cost-quality and latency-quality tradeoffs.</p></li><li><p><strong>Scalable with test-time compute:</strong> KARL-BCP attains 59.6 on BrowseComp-Plus, which further improves to 70.4 with value-guided search. KARL-TREC reaches 85.0 on TREC-Biogen, the second-highest score overall. The system surpasses the strongest closed models given sufficient test-time compute.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.05218">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2030996795770433749">Tweet</a></strong></p><div><hr></div><h2><strong>6. Memex(RL)</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7qR-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7qR-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 424w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 848w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 1272w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7qR-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png" width="674" height="322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:322,&quot;width&quot;:674,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Memex(RL)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Memex(RL)" title="Memex(RL)" srcset="https://substackcdn.com/image/fetch/$s_!7qR-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 424w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 848w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 1272w, https://substackcdn.com/image/fetch/$s_!7qR-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41536614-2d0b-4554-8707-4cd66d7625fb_674x322.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As tasks get longer and more complex, LLM agents lose track of what they have learned, what they have tried, and what still needs to be done. Memex(RL) introduces an indexed experience memory mechanism that scales agent capability on long-horizon tasks without discarding evidence or blowing up the context window.</p><ul><li><p><strong>Indexed experience memory:</strong> Rather than lossy compression, Memex maintains a compact working context consisting of concise structured summaries and stable indices while storing full-fidelity underlying interactions in an external experience database. The agent decides what to summarize, what to archive, how to index it, and when to retrieve it.</p></li><li><p><strong>RL-optimized memory operations:</strong> The MemexRL reinforcement learning framework optimizes both write and read behaviors with reward shaping tailored to indexed memory usage under a context budget. This teaches the agent to manage its own memory strategically rather than relying on fixed heuristics.</p></li><li><p><strong>Bounded retrieval complexity:</strong> Theoretical analysis demonstrates that Memex can maintain decision quality with bounded retrieval operations while keeping computational load manageable as task history grows. This makes the approach practical for tasks that span hundreds or thousands of steps.</p></li><li><p><strong>Smaller context, better results:</strong> Empirically, agents trained with MemexRL improve task success rates on challenging long-horizon tasks while using a significantly smaller working context than baseline approaches. Less context, used more intelligently, outperforms brute-force context expansion.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.04257">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2031006858971058537">Tweet</a></strong></p><div><hr></div><h2><strong>7. FlashAttention-4</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DHPV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DHPV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 424w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 848w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 1272w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DHPV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png" width="1456" height="805" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:805,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;FlashAttention-4&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="FlashAttention-4" title="FlashAttention-4" srcset="https://substackcdn.com/image/fetch/$s_!DHPV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 424w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 848w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 1272w, https://substackcdn.com/image/fetch/$s_!DHPV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b19747-fb3b-44ea-b19a-6f3a41ffb4fd_4942x2732.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>FlashAttention-4 co-designs algorithms and kernel pipelines for the B200 and GB200 GPUs, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling where tensor core throughput doubles while other functional units scale more slowly.</p><ul><li><p><strong>Significant speedups on Blackwell:</strong> FlashAttention-4 achieves up to 1.3x speedup over cuDNN 9.13 and 2.7x over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s at 71% hardware utilization. These gains come from careful co-design rather than algorithmic changes alone.</p></li><li><p><strong>Asymmetric scaling solutions:</strong> The techniques include redesigned pipelines that exploit fully asynchronous matrix multiply operations and larger tile sizes, software-emulated exponential and conditional softmax rescaling, and leveraging tensor memory to reduce shared memory traffic.</p></li><li><p><strong>Python-native implementation:</strong> The entire system is implemented in CuTe-DSL embedded in Python, achieving 20-30x faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity. This dramatically lowers the barrier to kernel development.</p></li><li><p><strong>Hardware-algorithm co-design:</strong> The paper demonstrates that next-generation GPU architectures demand fundamentally new attention kernel designs rather than incremental optimizations of existing ones. Techniques that worked well on Hopper GPUs leave significant performance on the table on Blackwell.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.05451">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2030411164060889466">Tweet</a></strong></p><div><hr></div><h2><strong>8. STRUCTUREDAGENT</strong></h2><p>STRUCTUREDAGENT introduces a hierarchical planning framework for long-horizon web tasks using dynamic AND/OR trees. The framework separates planning responsibilities: the system constructs and maintains the planning tree while the LLM is invoked only for local operations like node expansion or repair. A structured memory module tracks candidate solutions to improve constraint satisfaction. Results on WebVoyager, WebArena, and custom shopping benchmarks show improved performance over standard LLM-based web agents, with the added benefit of interpretable hierarchical plans that enable easier debugging and human intervention.</p><p><strong><a href="https://arxiv.org/abs/2603.05294">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2030681964664213509">Tweet</a></strong></p><div><hr></div><h2><strong>9. AgentIR</strong></h2><p>Deep research agents generate explicit reasoning before every search call, but existing retrievers completely ignore these rich signals about search intent and problem context. AgentIR introduces reasoning-aware retrieval that jointly embeds the agent&#8217;s reasoning trace alongside its query, along with DR-Synth, a data synthesis method for generating training data from standard QA datasets. On BrowseComp-Plus, AgentIR-4B achieves 68% accuracy with Tongyi-DeepResearch compared to 50% with conventional embedding models twice its size and 37% with BM25.</p><p><strong><a href="https://arxiv.org/abs/2603.04384">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2031726356292407366">Tweet</a></strong></p><div><hr></div><h2><strong>10. Think Harder or Know More</strong></h2><p>This paper investigates transformer models featuring both adaptive per-layer looping, where each block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks that provide additional learned storage. The key finding is that looping primarily benefits mathematical reasoning while memory banks help recover performance on commonsense tasks. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline with three times the number of layers on math benchmarks. Analysis of model internals reveals layer specialization: early layers loop minimally and access memory sparingly, while later layers do both more heavily.</p><p><strong><a href="https://arxiv.org/abs/2603.08391">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2032107624007876781">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Claude Code Review, AutoHarness, Perplexity Personal Computer, Cloudflare /crawl, Context7 CLI, and More]]></title><description><![CDATA[Claude Code Review, AutoHarness, Perplexity Personal Computer, Cloudflare /crawl, Context7 CLI, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-code-review</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-code-review</guid><pubDate>Sat, 14 Mar 2026 14:45:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cwFo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Claude ships multi-agent Code Review</p></li><li><p>AutoHarness makes small agents beat large ones</p></li><li><p>Perplexity launches an always-on Personal Computer</p></li><li><p>Cloudflare ships a one-call /crawl endpoint</p></li><li><p>Context7 CLI brings docs to any agent</p></li><li><p>Andrew Ng launches Context Hub</p></li><li><p>Cursor Marketplace adds 30+ plugins</p></li><li><p>OpenAI shares Skills for Agents SDK</p></li><li><p>Google launches Gemini Embedding 2</p></li><li><p>Meta ships four MTIA chips in two years</p></li><li><p>Codex agent files taxes, catches $20K error</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2>Top Stories</h2><h3>Claude Code Review</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IvHF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IvHF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IvHF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Code Review&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Code Review" title="Claude Code Review" srcset="https://substackcdn.com/image/fetch/$s_!IvHF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IvHF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7edf0eeb-b6c4-43de-955b-c3c11a8a9610_2000x1000.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic launched Code Review for Claude Code, an automated system that dispatches multiple AI agents to examine every pull request. Instead of a single pass, parallel agents identify potential issues, verify findings to eliminate false positives, and rank bugs by severity, delivering a consolidated overview comment plus targeted inline annotations.</p><ul><li><p><strong>Multi-agent architecture:</strong> The system operates in parallel agents that scan, verify, and prioritize issues independently, producing both a summary comment and inline code annotations for specific problems.</p></li><li><p><strong>Scales with complexity:</strong> Review depth adjusts based on PR size. Large PRs (over 1,000 lines) received findings 84% of the time, averaging 7.5 issues per PR. Small PRs (under 50 lines) had findings 31% of the time.</p></li><li><p><strong>High precision:</strong> Less than 1% of flagged issues were marked incorrect by Anthropic engineers, with the system catching production-critical bugs that appeared routine in diffs.</p></li><li><p><strong>Pricing and access:</strong> Available now as a research preview for Team and Enterprise customers. Reviews average $15-25 per PR, billed on token usage, with configurable monthly caps and per-repo controls.</p></li></ul><p><strong><a href="https://claude.com/blog/code-review">Blog</a></strong></p><div><hr></div><h3>AutoHarness: Automated Agent Constraint Synthesis</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cwFo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cwFo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 424w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 848w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cwFo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png" width="1456" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231601,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://nlp.elvissaravia.com/i/190904545?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cwFo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 424w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 848w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!cwFo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fc7f14-b19d-4569-b09d-c33f72440674_1918x1072.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers introduced AutoHarness, a technique that lets LLMs automatically synthesize protective code harnesses around themselves, preventing illegal actions without human-written constraints. Instead of relying on larger, more expensive models, the approach uses iterative code refinement with environmental feedback to generate custom safeguards that make smaller models outperform bigger unconstrained ones.</p><ul><li><p><strong>Massive illegal action problem:</strong> In a recent LLM chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. AutoHarness eliminates this class of failure entirely by generating harnesses that enforce valid actions across 145 different TextArena games.</p></li><li><p><strong>Small beats large:</strong> Gemini-2.5-Flash with a synthesized harness exceeded Gemini-2.5-Pro&#8217;s performance while reducing costs, demonstrating that proper constraints are more valuable than raw model scale for agent environments.</p></li><li><p><strong>Zero-shot generalization:</strong> The technique extends beyond game-playing to generating full policies in code, eliminating runtime LLM decision-making entirely and achieving higher rewards than GPT-5.2-High on certain benchmarks.</p></li><li><p><strong>Practical agent pattern:</strong> The core insight applies broadly to any agent deployment: rather than trusting a model to self-constrain, auto-generate a verified harness that makes illegal states unreachable, shifting safety from model behavior to environment design.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.03329">Paper</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-code-review">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (March 1 - March 8)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-8c6</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-8c6</guid><pubDate>Sun, 08 Mar 2026 15:01:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2M4x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. NeuroSkill</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2M4x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2M4x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 424w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 848w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 1272w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2M4x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png" width="1456" height="882" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:882,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;NeuroSkill&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="NeuroSkill" title="NeuroSkill" srcset="https://substackcdn.com/image/fetch/$s_!2M4x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 424w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 848w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 1272w, https://substackcdn.com/image/fetch/$s_!2M4x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa275a0d7-3d12-45d2-b0f2-301c54c96f4b_2398x1452.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>MIT researchers introduce NeuroSkill, a real-time proactive agentic system that models human cognitive and emotional state by integrating Brain-Computer Interface (BCI) signals with foundation EXG models and text embeddings. Unlike reactive agents that wait for explicit commands, NeuroSkill operates proactively, interpreting biophysical and neural signals to anticipate user needs.</p><ul><li><p><strong>Custom agent harness - NeuroLoop:</strong> The system runs an agentic flow called NeuroLoop that engages with the user on multiple cognitive and affective levels, including empathy. It processes BCI signals through a foundation EXG model, converts them to state-of-mind descriptions, and uses those descriptions to drive actionable tool calls and protocol execution.</p></li><li><p><strong>Fully offline edge deployment:</strong> The entire system runs locally on edge devices with no network dependency. This is a significant design choice for both privacy and latency, enabling real-time responsiveness to shifting cognitive states without cloud round-trips.</p></li><li><p><strong>Proactive vs reactive interaction:</strong> NeuroSkill handles both explicit and implicit requests from the user. By continuously reading brain signals, it can detect confusion, cognitive overload, or emotional shifts and adjust its behavior before the user explicitly asks for help.</p></li><li><p><strong>Open-source with ethical licensing:</strong> Released under GPLv3 with an ethically aligned AI100 licensing framework for the skill markdown, making the system reproducible and auditable while enforcing responsible use guardrails.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.03212">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2029201212596519070">Tweet</a></strong></p><div><hr></div><h2><strong>2. Bayesian Teaching for LLMs</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e2LD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e2LD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 424w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 848w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 1272w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e2LD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png" width="997" height="542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:542,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Bayesian Teaching for LLMs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Bayesian Teaching for LLMs" title="Bayesian Teaching for LLMs" srcset="https://substackcdn.com/image/fetch/$s_!e2LD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 424w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 848w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 1272w, https://substackcdn.com/image/fetch/$s_!e2LD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21307ff6-7ea5-48b9-8be7-a7c68828a8d9_997x542.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google researchers introduce a method to teach LLMs to reason like Bayesians by fine-tuning on interactions with a Bayesian Assistant that represents optimal probabilistic inference. LLMs normally fall far short of normative Bayesian reasoning, but this training approach dramatically improves their ability to update predictions based on new evidence.</p><ul><li><p><strong>Bayesian Assistant as teacher:</strong> The method constructs synthetic training data from interactions between users and an idealized Bayesian Assistant. By exposing the LLM to examples of optimal belief updating, the model learns to approximate Bayesian inference without any architectural changes.</p></li><li><p><strong>Generalization to new tasks:</strong> The trained models do not just memorize the training distributions. They generalize probabilistic reasoning to entirely new task types, suggesting that Bayesian inference can be instilled as a transferable capability through carefully designed fine-tuning data.</p></li><li><p><strong>Closing the gap with normative models:</strong> Before training, LLMs show systematic deviations from Bayesian predictions, including base rate neglect and conservatism. After Bayesian teaching, these biases are substantially reduced, bringing model predictions much closer to the normative standard.</p></li><li><p><strong>Data quality over model scale:</strong> The results reinforce a recurring theme in recent research: carefully curated training data can unlock capabilities that scale alone cannot. A smaller model trained on Bayesian interactions outperforms larger models reasoning from scratch.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2503.17523">Paper</a></strong> | <strong><a href="https://x.com/GoogleResearch/status/2029295018972778883?s=20">Tweet</a></strong></p><div><hr></div><h2><strong>3. Why LLMs Form Geometric Representations</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aDcc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aDcc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 424w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 848w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 1272w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aDcc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png" width="793" height="489" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:489,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Why LLMs Form Geometric Representations&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Why LLMs Form Geometric Representations" title="Why LLMs Form Geometric Representations" srcset="https://substackcdn.com/image/fetch/$s_!aDcc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 424w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 848w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 1272w, https://substackcdn.com/image/fetch/$s_!aDcc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d5cc5-a3fc-452f-9d26-3fdbd98cfa1e_793x489.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>LLMs spontaneously form striking geometric structures in their internal representations: calendar months organize into circles, historical years form spirals, and spatial coordinates align to recoverable manifolds. This paper proves these patterns are not the product of deep learning dynamics but emerge directly from symmetries in natural language statistics.</p><ul><li><p><strong>Translation symmetry as the root cause:</strong> The frequency with which any two months co-occur in text depends only on the time interval between them, not the months themselves. The authors prove this translation symmetry in co-occurrence statistics is sufficient to force circular geometry in learned representations.</p></li><li><p><strong>Analytical derivation of manifold geometry:</strong> Rather than just observing geometric structure post-hoc, the paper derives the exact manifold geometry from data statistics. For cyclic concepts like months or days of the week, the proof shows circular representations emerge as the optimal encoding under symmetric co-occurrence distributions.</p></li><li><p><strong>Spirals and rippled manifolds for continuums:</strong> Representations of continuous concepts like historical years or number lines organize into compact 1D manifolds with characteristic extrinsic curvature. These &#8220;rippled&#8221; structures are analytically predicted by the framework when the underlying latent variable is non-cyclic.</p></li><li><p><strong>Universal origin:</strong> The robustness of these geometric representations across different model architectures suggests a universal mechanism. Representational manifolds emerge whenever co-occurrence statistics are controlled by an underlying latent variable, regardless of model size or training details.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.15029">Paper</a></strong> | <strong><a href="https://x.com/che_shr_cat/status/2029626128566993201">Tweet</a></strong></p><div><hr></div><h2><strong>4. Theory of Mind in Multi-Agent LLMs</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hed5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hed5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 424w, https://substackcdn.com/image/fetch/$s_!hed5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 848w, https://substackcdn.com/image/fetch/$s_!hed5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!hed5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hed5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png" width="1456" height="528" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:528,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Theory of Mind in Multi-Agent LLMs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Theory of Mind in Multi-Agent LLMs" title="Theory of Mind in Multi-Agent LLMs" srcset="https://substackcdn.com/image/fetch/$s_!hed5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 424w, https://substackcdn.com/image/fetch/$s_!hed5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 848w, https://substackcdn.com/image/fetch/$s_!hed5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!hed5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F002f3594-7fe3-44c7-b155-04fb751a5308_3803x1378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This work introduces a multi-agent architecture combining Theory of Mind (ToM), Belief-Desire-Intention (BDI) models, and symbolic solvers for logical verification, evaluating it on resource allocation problems across multiple LLMs. The central finding is counterintuitive: simply adding cognitive mechanisms does not automatically improve coordination.</p><ul><li><p><strong>Integrated cognitive architecture:</strong> The system combines ToM for modeling other agents&#8217; mental states, BDI frameworks for structuring internal beliefs, and symbolic solvers for formal logic verification. This layered approach attempts to replicate how humans reason about collaborative partners.</p></li><li><p><strong>Model capability matters more than mechanism:</strong> The effectiveness of ToM and internal beliefs varies significantly depending on the underlying LLM. Stronger models benefit from cognitive mechanisms, while weaker models can actually be confused by the additional reasoning overhead.</p></li><li><p><strong>Symbolic verification as a stabilizer:</strong> Integrating symbolic solvers for logical verification helps ground agent decisions in formal constraints. The interplay between symbolic verification and cognitive mechanisms remains largely underexplored across different LLM architectures.</p></li><li><p><strong>Practical implications for multi-agent design:</strong> For builders designing systems where agents must model each other&#8217;s beliefs, the key takeaway is to match cognitive complexity to model capability. Adding ToM to an underpowered model can hurt more than help.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2603.00142">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2028913061260935331">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4csq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4csq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4csq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4csq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4csq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4csq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!4csq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4csq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4csq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4csq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F475c589b-8bc6-4d98-9eb5-9a8f2df48126_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a></strong>&#8221;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><h2><strong>5. Numina-Lean-Agent</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZACp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZACp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 424w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 848w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 1272w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZACp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png" width="752" height="335" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:335,&quot;width&quot;:752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Numina-Lean-Agent&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Numina-Lean-Agent" title="Numina-Lean-Agent" srcset="https://substackcdn.com/image/fetch/$s_!ZACp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 424w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 848w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 1272w, https://substackcdn.com/image/fetch/$s_!ZACp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0522f6f3-0a78-4cac-8e3e-c59dfdbd0455_752x335.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Numina-Lean-Agent proposes a paradigm shift in automated theorem proving: instead of building complex, multi-component systems with heavy computational overhead, it directly uses a general coding agent as a formal math reasoner. Combining Claude Code with Numina-Lean-MCP, the system autonomously interacts with the Lean proof assistant while accessing theorem libraries and auxiliary reasoning tools.</p><ul><li><p><strong>General agent over specialized provers:</strong> Rather than training task-specific models, the system leverages a general-purpose coding agent. Performance improves simply by upgrading the base model, making the approach accessible and reproducible without expensive retraining pipelines.</p></li><li><p><strong>MCP-powered tool integration:</strong> The system uses Model Context Protocol for flexible extension, including Lean-LSP-MCP for proof assistant interaction, LeanDex for semantic theorem retrieval, and an informal prover for generating detailed proof strategies.</p></li><li><p><strong>State-of-the-art results:</strong> Using Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all 12 problems on Putnam 2025, matching the best closed-source systems. It also successfully formalized the Brascamp-Lieb theorem through direct collaboration with mathematicians.</p></li><li><p><strong>Open-source release:</strong> The full system and all solutions are released on GitHub under Creative Commons BY 4.0, enabling direct reproduction and extension by the research community.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2601.14027">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2028591203579822112">Tweet</a></strong></p><div><hr></div><h2><strong>6. ParamMem</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HX1U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HX1U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 424w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 848w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 1272w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HX1U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png" width="1456" height="925" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:925,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!HX1U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 424w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 848w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 1272w, https://substackcdn.com/image/fetch/$s_!HX1U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57a1efa3-906c-43c4-b022-f63adf8f2645_1710x1086.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Self-reflection enables language agents to iteratively refine solutions, but models tend to generate repetitive reflections that add noise instead of useful signal. ParamMem introduces a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling.</p><ul><li><p><strong>Diversity correlates with success:</strong> Empirical analysis reveals a strong positive correlation between reflective diversity and task success. The core problem is that standard self-reflection produces near-identical outputs across iterations, limiting the agent&#8217;s ability to explore alternative solution paths.</p></li><li><p><strong>Three-tier memory architecture:</strong> ParamAgent integrates parametric memory (cross-sample patterns encoded in parameters), episodic memory (individual task instances), and cross-sample memory (broader learning patterns). This combination captures both local task context and global reflection strategies.</p></li><li><p><strong>Weak-to-strong transfer:</strong> ParamMem is sample-efficient and supports transfer across model scales. Reflection patterns learned by smaller models can be applied to larger ones, enabling self-improvement without reliance on stronger external models.</p></li><li><p><strong>Consistent benchmark gains:</strong> Evaluated on code generation, mathematical reasoning, and multi-hop question answering, ParamMem consistently outperforms state-of-the-art baselines across all three domains.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.23320">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2028839081392939071">Tweet</a></strong></p><div><hr></div><h2><strong>7. Auton Agentic AI Framework</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vcXh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vcXh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 424w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 848w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 1272w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vcXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png" width="1346" height="1134" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1134,&quot;width&quot;:1346,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!vcXh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 424w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 848w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 1272w, https://substackcdn.com/image/fetch/$s_!vcXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160b87c6-725e-4016-bfbf-dea9aaa8d4ce_1346x1134.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Snap Research introduces the Auton framework, a declarative architecture for specification, governance, and runtime execution of autonomous agent systems. It addresses a fundamental mismatch: LLMs produce stochastic, unstructured outputs, while backend infrastructure requires deterministic, schema-conformant inputs.</p><ul><li><p><strong>Cognitive Blueprint separation:</strong> The framework enforces a strict separation between the Cognitive Blueprint, a declarative, language-agnostic specification of agent identity and capabilities, and the Runtime Engine. This enables cross-language portability, formal auditability, and modular tool integration via Model Context Protocol.</p></li><li><p><strong>Formal agent execution model:</strong> Agent execution is formalized as an augmented Partially Observable Markov Decision Process with a latent reasoning space. This gives practitioners a rigorous foundation for reasoning about agent behavior, state transitions, and decision boundaries.</p></li><li><p><strong>Biologically-inspired memory:</strong> The architecture introduces hierarchical memory consolidation inspired by biological episodic memory systems, providing agents with structured long-term retention that mirrors how humans consolidate experiences into lasting knowledge.</p></li><li><p><strong>Runtime optimizations:</strong> Parallel graph execution, speculative inference, and dynamic context pruning reduce end-to-end latency for multi-step agent workflows. Safety is enforced through a constraint manifold formalism using policy projection rather than post-hoc filtering.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.23720">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2028480209033568475">Tweet</a></strong></p><div><hr></div><h2><strong>8. Reaching Agreement Among LLM Agents</strong></h2><p>This paper introduces Aegean, a consensus protocol that frames multi-agent refinement as a distributed consensus problem. Rather than static heuristic workflows with fixed loop limits, Aegean enables early termination when sufficient agents converge, achieving 1.2-20x latency reduction across four mathematical reasoning benchmarks while maintaining answer quality within 2.5%. The consensus-aware serving engine performs incremental quorum detection across concurrent agent executions, cutting wasted compute on stragglers.</p><p><strong><a href="https://arxiv.org/abs/2512.20184">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2028823724196343923">Tweet</a></strong></p><div><hr></div><h2><strong>9. Diagnosing Agent Memory</strong></h2><p>This paper introduces a diagnostic framework that separates retrieval failures from utilization failures in LLM agent memory systems. Through a 3x3 factorial study crossing three write strategies with three retrieval methods, the authors find that retrieval is the dominant bottleneck, accounting for 11-46% of errors, while utilization failures remain stable at 4-8% regardless of configuration. Hybrid reranking cuts retrieval failures roughly in half, delivering larger gains than any write strategy optimization.</p><p><strong><a href="https://arxiv.org/abs/2603.02473">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2029202969456234562">Tweet</a></strong></p><div><hr></div><h2><strong>10. Phi-4-reasoning-vision-15B</strong></h2><p>Microsoft presents Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model that combines visual understanding with structured reasoning capabilities. Trained on just 200 billion tokens of multimodal data, the model excels at math and science reasoning and UI comprehension while requiring significantly less compute than comparable open-weight VLMs. The key insight is that systematic filtering, error correction, and synthetic augmentation remain the primary levers for model performance, pushing the Pareto frontier of the accuracy-compute tradeoff.</p><p><strong><a href="https://arxiv.org/abs/2603.03975">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2029926242640912429">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: AI Labor Market Impacts, Google Workspace CLI, GPT-5.4, Exa Deep, and More]]></title><description><![CDATA[AI Labor Market Impacts, Google Workspace CLI, GPT-5.4, Exa Deep, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-ai-labor-market</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-ai-labor-market</guid><pubDate>Sat, 07 Mar 2026 15:03:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!eY71!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Anthropic measures AI labor market displacement</p></li><li><p>Google ships Workspace CLI with agent skills</p></li><li><p>OpenAI launches GPT-5.4 with native computer use</p></li><li><p>Exa Deep puts an agent inside every search</p></li><li><p>Cognition previews SWE-1.6 training run</p></li><li><p>Gemini 3.1 Flash-Lite drops with big gains</p></li><li><p>Qwen 3.5 small model series released</p></li><li><p>Liquid AI releases LFM2-24B-A2B model</p></li><li><p>Cursor lands in JetBrains via ACP</p></li><li><p>OpenAI launches Codex Security agent</p></li><li><p>OpenAI publishes CoT Controllability research</p></li><li><p>Claude Opus hacks its own benchmark eval</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Labor Market Impacts of AI</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eY71!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eY71!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 424w, https://substackcdn.com/image/fetch/$s_!eY71!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 848w, https://substackcdn.com/image/fetch/$s_!eY71!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 1272w, https://substackcdn.com/image/fetch/$s_!eY71!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eY71!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Labor Market Impacts of AI&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Labor Market Impacts of AI" title="Labor Market Impacts of AI" srcset="https://substackcdn.com/image/fetch/$s_!eY71!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 424w, https://substackcdn.com/image/fetch/$s_!eY71!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 848w, https://substackcdn.com/image/fetch/$s_!eY71!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 1272w, https://substackcdn.com/image/fetch/$s_!eY71!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f1eee70-43bc-4e2b-8297-96d1e7c6b42c_4096x4096.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic published a new framework for measuring AI&#8217;s labor market effects, introducing &#8220;observed exposure,&#8221; a metric that combines theoretical LLM capability with real-world Claude usage data from the Anthropic Economic Index. Unlike prior approaches that rely solely on theoretical task feasibility, this measure weights automated and work-related uses more heavily to better predict actual displacement risk.</p><ul><li><p><strong>Programmer exposure is highest:</strong> Computer programmers top the list at 75% task coverage, followed by customer service representatives and data entry keyers at 67%, reflecting the concentration of automated API usage in coding and support workflows.</p></li><li><p><strong>No unemployment signal yet:</strong> Using Current Population Survey data, the study finds no systematic increase in unemployment for workers in the most AI-exposed occupations since late 2022, though the framework could detect differential increases on the order of 1 percentage point.</p></li><li><p><strong>Youth hiring slowdown:</strong> There is suggestive evidence that hiring of workers aged 22-25 has slowed in exposed occupations, with a 14% drop in the job finding rate compared to 2022, echoing findings from Brynjolfsson et al. using ADP payroll data.</p></li><li><p><strong>Massive capability gap:</strong> AI is far from reaching its theoretical capability. Claude currently covers just 33% of all tasks in Computer and Math occupations, despite 94% being theoretically feasible, indicating significant room for future displacement as adoption deepens.</p></li></ul><p><strong><a href="https://www.anthropic.com/research/labor-market-impacts">Blog</a></strong></p><div><hr></div><h3><strong>Google Workspace CLI</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HgSx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HgSx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HgSx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png" width="1200" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Google Workspace CLI&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Google Workspace CLI" title="Google Workspace CLI" srcset="https://substackcdn.com/image/fetch/$s_!HgSx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!HgSx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f92aafc-943e-4ffe-a089-e35b847b9ddb_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google released an official command-line tool for its Workspace APIs, providing a unified interface for Drive, Gmail, Calendar, Sheets, Docs, Chat, and Admin from a single binary. Written in Rust and distributed via npm, the CLI is dynamically built from Google&#8217;s Discovery Service and ships with over 100 agent skills and a built-in MCP server.</p><ul><li><p><strong>100+ agent skills:</strong> The repo includes SKILL.md files for every supported API plus higher-level helpers, with 50 curated recipes for common workflows across Gmail, Drive, Docs, Calendar, and Sheets.</p></li><li><p><strong>Built-in MCP server:</strong> AI assistants like Claude, Gemini, and OpenClaw can connect directly to the CLI&#8217;s MCP server and operate on Google Workspace programmatically, turning Workspace into a tool-callable environment for agents.</p></li><li><p><strong>Dynamic API coverage:</strong> Instead of hardcoding endpoints, the CLI generates commands at build time from Google&#8217;s Discovery Service, meaning it automatically picks up new APIs and updates as Google ships them.</p></li><li><p><strong>Agent-first design:</strong> Each skill includes structured metadata, input/output schemas, and example prompts, making it immediately usable by coding agents and AI-powered automation pipelines without custom integration work.</p></li></ul><p><strong><a href="https://github.com/googleworkspace/cli">GitHub</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-ai-labor-market">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (February 23 - March 1)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-339</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-339</guid><pubDate>Sun, 01 Mar 2026 15:02:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!j_F0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Deep-Thinking Tokens</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MP5E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MP5E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 424w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 848w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 1272w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MP5E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png" width="674" height="378" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6377f483-06c6-474f-b370-76edcc90ef81_674x378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:674,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Deep-Thinking Tokens&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Deep-Thinking Tokens" title="Deep-Thinking Tokens" srcset="https://substackcdn.com/image/fetch/$s_!MP5E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 424w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 848w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 1272w, https://substackcdn.com/image/fetch/$s_!MP5E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6377f483-06c6-474f-b370-76edcc90ef81_674x378.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google researchers challenge the assumption that longer outputs indicate better reasoning. They introduce deep-thinking tokens, a metric that identifies tokens where internal model predictions shift significantly across layers before stabilizing. Unlike raw token count, which negatively correlates with accuracy (r = -0.59), the deep-thinking ratio shows a robust positive correlation (r = 0.683).</p><ul><li><p><strong>Deep-thinking ratio as a reasoning signal:</strong> For each generated token, intermediate-layer distributions are compared to the final-layer distribution using Jensen-Shannon divergence. A token qualifies as deep-thinking if its prediction only stabilizes in the final 15% of layers. This captures genuine computational effort rather than surface-level verbosity.</p></li><li><p><strong>Think@n test-time scaling:</strong> The authors introduce Think@n, a strategy that prioritizes samples with high deep-thinking ratios. It matches or exceeds standard self-consistency performance while cutting inference costs by approximately 50% through early rejection of unpromising generations based on just 50-token prefixes.</p></li><li><p><strong>Benchmark validation:</strong> Evaluated across AIME 24/25, HMMT 25, and GPQA-diamond with reasoning models including GPT-OSS, DeepSeek-R1, and Qwen3. The deep-thinking ratio consistently outperforms length-based and confidence-based baselines as a predictor of correctness.</p></li><li><p><strong>Practical implications:</strong> This reframes how we think about test-time compute. Instead of generating more tokens, we should focus on generating tokens that require deeper internal computation, enabling more efficient and accurate reasoning.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.13517">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2025239354327924833">Tweet</a></strong></p><div><hr></div><h2><strong>2. Codified Context</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vD5j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vD5j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 424w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 848w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 1272w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vD5j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png" width="1456" height="567" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8421822-3e07-49a8-a364-784f832ddad3_2040x794.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:567,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Codified Context&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Codified Context" title="Codified Context" srcset="https://substackcdn.com/image/fetch/$s_!vD5j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 424w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 848w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 1272w, https://substackcdn.com/image/fetch/$s_!vD5j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8421822-3e07-49a8-a364-784f832ddad3_2040x794.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Single-file AGENTS.md manifests don&#8217;t scale beyond modest codebases. A 1,000-line prototype can be fully described in a single prompt, but a 100,000-line system cannot. This paper presents a three-component codified context infrastructure developed during construction of a 108,000-line C# distributed system, evaluated across 283 development sessions.</p><ul><li><p><strong>Hot-memory constitution:</strong> A living document encoding conventions, retrieval hooks, and orchestration protocols that the agent consults at the start of every session. This provides immediate awareness of project standards without requiring the agent to rediscover them through exploration.</p></li><li><p><strong>Domain-expert agents:</strong> 19 specialized agents, each owning a bounded domain of the codebase with its own context slice. Instead of one generalist agent trying to hold the entire project in context, tasks are routed to the agent with the deepest knowledge of the relevant subsystem.</p></li><li><p><strong>Cold-memory knowledge base:</strong> 34 on-demand specification documents that agents retrieve only when needed. This tiered approach keeps the active context lean while ensuring detailed specifications are always accessible for complex implementation decisions.</p></li><li><p><strong>Session continuity results:</strong> Across 283 sessions, the infrastructure demonstrates how context propagates between sessions, preventing the common pattern where agents forget conventions, repeat known mistakes, and lose coherence on long-running projects.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.20478">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2027770787659464812">Tweet</a></strong></p><div><hr></div><h2><strong>3. Discovering Multi-Agent Learning Algorithms with LLMs</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BWzd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BWzd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 424w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 848w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 1272w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BWzd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png" width="793" height="251" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:251,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Discovering Multi-Agent Learning Algorithms&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Discovering Multi-Agent Learning Algorithms" title="Discovering Multi-Agent Learning Algorithms" srcset="https://substackcdn.com/image/fetch/$s_!BWzd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 424w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 848w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 1272w, https://substackcdn.com/image/fetch/$s_!BWzd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304b1f1d-11ac-41ca-aa67-d9c905ce38b4_793x251.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google DeepMind uses AlphaEvolve, an evolutionary coding agent powered by LLMs, to automatically discover new multi-agent learning algorithms for imperfect-information games. Rather than relying on manual algorithm design, the system navigates vast algorithmic design spaces and discovers non-intuitive mechanisms that outperform state-of-the-art baselines.</p><ul><li><p><strong>VAD-CFR discovery:</strong> The system discovers a novel variant of iterative regret minimization featuring volatility-sensitive discounting and consistency-enforced optimism. VAD-CFR outperforms existing baselines like Discounted Predictive CFR+ on standard imperfect-information game benchmarks.</p></li><li><p><strong>SHOR-PSRO discovery:</strong> A population-based training algorithm variant that introduces a hybrid meta-solver blending Optimistic Regret Matching with temperature-controlled strategy distributions. This automates the transition from diversity exploration to equilibrium convergence.</p></li><li><p><strong>LLM-driven algorithmic evolution:</strong> AlphaEvolve generates candidate algorithm modifications, evaluates them on game-theoretic benchmarks, and iteratively refines the best variants. The discovered algorithms contain novel design choices that human researchers had not previously considered.</p></li><li><p><strong>Broader implications:</strong> This demonstrates that LLMs can serve as algorithmic designers, not just code generators. The approach could extend to discovering algorithms in other domains like optimization, scheduling, and resource allocation.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.16928">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2026044154040742150">Tweet</a></strong></p><div><hr></div><h2><strong>4. Evaluating AGENTS.md</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6t4H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6t4H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 424w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 848w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 1272w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6t4H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png" width="896" height="304" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:304,&quot;width&quot;:896,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Evaluating AGENTS.md&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Evaluating AGENTS.md" title="Evaluating AGENTS.md" srcset="https://substackcdn.com/image/fetch/$s_!6t4H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 424w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 848w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 1272w, https://substackcdn.com/image/fetch/$s_!6t4H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8145d46c-aff5-4761-b7c1-6a9755b14739_896x304.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This research evaluates whether AGENTS.md files, the repository-level context files that developers write to help AI coding agents understand their codebases, actually improve agent performance. Testing four coding agents (Claude Code with Sonnet-4.5, Codex with GPT-5.2 and GPT-5.1 mini, and Qwen Code with Qwen3-30b-coder), the findings are counterintuitive.</p><ul><li><p><strong>Context files reduce success rates:</strong> Human-written AGENTS.md files provide a modest +4% improvement in some cases, but LLM-generated ones actually hurt performance by -2%. Both consistently increase inference cost by over 20%, making the cost-benefit tradeoff questionable.</p></li><li><p><strong>Broader exploration, worse outcomes:</strong> Context files cause agents to explore more code paths and consider more files, but this expansive behavior makes tasks harder rather than easier. The additional context introduces noise that dilutes task-relevant information.</p></li><li><p><strong>Lean is better:</strong> The study recommends that developer-written context files should contain only essential information. Unnecessary requirements, coding style preferences, and broad architectural descriptions complicate agent task completion without improving results.</p></li><li><p><strong>Practical guidance:</strong> For developers maintaining AGENTS.md files, the key takeaway is to keep them minimal and focused on critical constraints. Information density matters more than comprehensiveness for current coding agents.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.11988">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2026306141181898887">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pTww!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pTww!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pTww!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pTww!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pTww!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pTww!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!pTww!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pTww!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pTww!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pTww!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4d9295-ebaf-4af0-8dda-174e63f706ce_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8221;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><h2><strong>5. PAHF</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BMIt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BMIt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 424w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 848w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 1272w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BMIt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png" width="996" height="157" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:157,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;PAHF&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="PAHF" title="PAHF" srcset="https://substackcdn.com/image/fetch/$s_!BMIt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 424w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 848w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 1272w, https://substackcdn.com/image/fetch/$s_!BMIt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28a405c3-b6ad-45a7-add5-4db5e1429257_996x157.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Meta introduces PAHF (Personalized Agents from Human Feedback), a continual agent personalization framework that addresses a critical gap: most AI agents cannot adapt to individual user preferences that evolve over time. PAHF couples explicit per-user memory with both proactive and reactive feedback mechanisms.</p><ul><li><p><strong>Three-step personalization loop:</strong> PAHF operates through (1) pre-action clarification to resolve ambiguity before acting, (2) grounding actions in preferences retrieved from persistent memory, and (3) integrating post-action feedback to update memory when preferences drift. This dual-feedback design captures both explicit and implicit signals.</p></li><li><p><strong>Continual learning through interaction:</strong> Unlike static fine-tuning approaches, PAHF enables agents to learn from live interactions. The explicit memory store allows agents to accumulate and revise user preference profiles without retraining, making personalization practical for production deployments.</p></li><li><p><strong>Novel benchmarks:</strong> The researchers develop two benchmarks in embodied manipulation and online shopping that specifically measure an agent&#8217;s ability to learn initial preferences from scratch and then adapt when those preferences shift over time.</p></li><li><p><strong>Strong results:</strong> PAHF learns substantially faster and consistently outperforms both no-memory and single-channel baselines. It reduces initial personalization error and enables rapid adaptation to persona shifts, demonstrating that the combination of memory and dual feedback channels is essential.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.16173">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2025242624790331520">Tweet</a></strong></p><div><hr></div><h2><strong>6. Doc-to-LoRA</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j_F0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j_F0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 424w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 848w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 1272w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j_F0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png" width="1456" height="560" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:560,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Doc-to-LoRA&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Doc-to-LoRA" title="Doc-to-LoRA" srcset="https://substackcdn.com/image/fetch/$s_!j_F0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 424w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 848w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 1272w, https://substackcdn.com/image/fetch/$s_!j_F0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F279bc240-5408-4e2e-9326-ed1457dbb592_2096x806.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Sakana AI introduces Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to compress long documents into LoRA adapters in a single forward pass. Instead of processing long contexts through expensive quadratic attention, D2L converts the document into parameter-space representations that the target LLM can use without re-consuming the original text.</p><ul><li><p><strong>Single-pass context compression:</strong> D2L generates LoRA adapters from unseen documents in one forward pass. Once compressed, subsequent queries are handled using only the adapter weights, eliminating the need to re-process the full document and dramatically reducing both inference latency and KV-cache memory demands.</p></li><li><p><strong>Beyond native context windows:</strong> The method achieves near-perfect zero-shot accuracy on needle-in-a-haystack tasks at sequence lengths exceeding the target LLM&#8217;s native context window by over 4x. This suggests that parametric compression can effectively extend context capabilities without architectural changes.</p></li><li><p><strong>Real-world QA performance:</strong> On practical question-answering datasets, D2L outperforms standard long-context approaches while consuming less memory. The compressed representations retain enough information for accurate retrieval and reasoning across the full document.</p></li><li><p><strong>Practical deployment benefits:</strong> For applications requiring repeated queries over the same document (customer support, legal analysis, codebase understanding), D2L compresses the document once and amortizes the cost across all subsequent interactions.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.15902">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2027385998993420571">Tweet</a></strong></p><div><hr></div><h2><strong>7. AgentConductor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zzkl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zzkl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 424w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 848w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 1272w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zzkl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png" width="996" height="635" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:635,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AgentConductor&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AgentConductor" title="AgentConductor" srcset="https://substackcdn.com/image/fetch/$s_!Zzkl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 424w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 848w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 1272w, https://substackcdn.com/image/fetch/$s_!Zzkl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfeca76b-858a-49ce-8a39-19cacc15281f_996x635.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AgentConductor introduces a reinforcement learning-enhanced multi-agent system for code generation that dynamically generates interaction topologies based on task characteristics. Rather than using fixed communication patterns between agents, an LLM-based orchestrator adapts the topology to match problem complexity, achieving state-of-the-art accuracy across five code generation datasets.</p><ul><li><p><strong>Task-adapted topologies:</strong> The orchestrator constructs density-aware layered directed acyclic graph (DAG) topologies tailored to problem difficulty. Simple problems get sparse topologies with minimal communication overhead, while complex problems get denser multi-agent collaboration.</p></li><li><p><strong>Topological density control:</strong> A novel density function and difficulty interval partitioning mechanism controls how much agents communicate. This directly addresses the problem of redundant interactions that waste tokens without improving solution quality.</p></li><li><p><strong>Strong performance gains:</strong> AgentConductor outperforms the strongest baseline by up to 14.6% in pass@1 accuracy with 13% density reduction and 68% token cost reduction. The system achieves better results while using significantly fewer computational resources.</p></li><li><p><strong>Execution feedback refinement:</strong> Topologies are refined using execution feedback from code tests. When initial solutions fail, the orchestrator adjusts the collaboration structure based on error patterns, enabling adaptive recovery.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.17100">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2027030406441341227">Tweet</a></strong></p><div><hr></div><h2><strong>8. ActionEngine</strong></h2><p>Georgia Tech and Microsoft Research introduce ActionEngine, a training-free framework that transforms GUI agents from reactive step-by-step executors into programmatic planners. It builds a state-machine memory through offline exploration, then synthesizes executable Python programs for task completion, achieving 95% success on Reddit tasks from WebArena with on average a single LLM call, reducing costs by 11.8x and latency by 2x compared to vision-only baselines.</p><p><strong><a href="https://arxiv.org/abs/2602.20502">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2026678090815123594">Tweet</a></strong></p><div><hr></div><h2><strong>9. CoT Faithfulness via REMUL</strong></h2><p>Researchers propose REMUL, a training approach for making chain-of-thought reasoning more faithful and monitorable. A speaker model generates reasoning traces that multiple listener models attempt to follow and complete, using RL to reward reasoning that is understandable to other models. Tested across BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO, REMUL improves three faithfulness metrics while also boosting overall accuracy, producing shorter and more direct reasoning chains.</p><p><strong><a href="https://arxiv.org/abs/2602.16154">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2026043400861122709">Tweet</a></strong></p><div><hr></div><h2><strong>10. Learning to Rewrite Tool Descriptions</strong></h2><p>Intuit AI Research addresses a bottleneck in LLM-agent tool use: tool descriptions are written for humans, not agents. They introduce Trace-Free+, a curriculum learning framework that optimizes tool descriptions without relying on execution traces. The approach delivers consistent gains on unseen tools, strong cross-domain generalization, and robustness as the number of candidate tools scales to over 100, demonstrating that improving tool interfaces is a practical complement to agent fine-tuning.</p><p><strong><a href="https://arxiv.org/abs/2602.20426">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2026676835539628465">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Evaluating AGENTS.md, Perplexity Computer, Nano Banana 2, Doc-to-LoRA, Hermes Agent, Mercury 2, and More]]></title><description><![CDATA[Evaluating AGENTS.md, Perplexity Computer, Nano Banana 2, Doc-to-LoRA, Hermes Agent, Mercury 2, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-evaluating-agentsmd</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-evaluating-agentsmd</guid><pubDate>Sat, 28 Feb 2026 15:02:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-XGl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>AGENTS.md files hurt coding agent performance</p></li><li><p>Perplexity launches Computer for end-to-end tasks</p></li><li><p>Google launches Nano Banana 2 for free</p></li><li><p>Sakana AI ships Doc-to-LoRA and Text-to-LoRA</p></li><li><p>Notion launches Custom Agents in 3.3</p></li><li><p>Nous Research releases Hermes Agent open source</p></li><li><p>GPT-5.3-Codex available for all developers</p></li><li><p>Mercury 2 ships reasoning diffusion LLM</p></li><li><p>Qwen 3.5 medium model series drops</p></li><li><p>Claude Code ships auto-memory across sessions</p></li><li><p>RoguePilot exposes GitHub Copilot vulnerability</p></li><li><p>Vercel open-sources Chat SDK for multi-platform bots</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Evaluating AGENTS.md: Are Context Files Helpful for Coding Agents?</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-XGl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-XGl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 424w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 848w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 1272w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-XGl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png" width="896" height="304" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:304,&quot;width&quot;:896,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Evaluating AGENTS.md&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Evaluating AGENTS.md" title="Evaluating AGENTS.md" srcset="https://substackcdn.com/image/fetch/$s_!-XGl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 424w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 848w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 1272w, https://substackcdn.com/image/fetch/$s_!-XGl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd73370-96f7-460b-813e-dfb6f23abad6_896x304.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers from UIUC and Microsoft Research evaluated whether repository-level context files like AGENTS.md actually improve coding agent performance. The counterintuitive finding: context files reduce task success rates compared to providing no context at all, while increasing inference costs by over 20%.</p><ul><li><p><strong>Lower success rates:</strong> Both LLM-generated and human-written context files caused agents to solve fewer tasks on SWE-bench compared to agents given no repository context, challenging the widely adopted practice of writing detailed agent instructions.</p></li><li><p><strong>Broader but less effective exploration:</strong> Context files prompted agents to explore more thoroughly, including more testing and file traversal, but the additional constraints made tasks harder rather than easier.</p></li><li><p><strong>Minimal is better:</strong> The authors recommend that context files describe only minimal requirements rather than comprehensive specifications, as unnecessary constraints actively hurt agent performance.</p></li><li><p><strong>Practical implications:</strong> The findings suggest developers should rethink how they structure AGENTS.md, CLAUDE.md, and similar context files, focusing on essential guardrails rather than exhaustive instructions.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.11988">Paper</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-evaluating-agentsmd">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Does AGENTS.md Actually Help Coding Agents?]]></title><description><![CDATA[A New Study Has Answers]]></description><link>https://nlp.elvissaravia.com/p/does-agentsmd-actually-help-coding</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/does-agentsmd-actually-help-coding</guid><dc:creator><![CDATA[elvis]]></dc:creator><pubDate>Thu, 26 Feb 2026 16:03:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dmVe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every serious coding project I run now has a CLAUDE.md or AGENTS.md at the root. It tells the agent which commands to run, what conventions to follow, and which files to avoid. I, like many other AI engineers, assumed that this makes the agent meaningfully better. Most people building with coding agents have made the same assumption.</p><p>A new paper from ETH Zurich&#8217;s SRI Lab puts that assumption to a rigorous test. The short answer is that it&#8217;s complicated, and the details are worth understanding if you work with coding agents regularly.</p><p>The paper, <a href="https://arxiv.org/abs/2602.11988">Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?</a>, runs Claude Code, Codex, and Qwen Code through hundreds of real GitHub issues, comparing what happens when agents get a context file versus when they don&#8217;t. The results are not what most of us would expect.</p><p>So what actually happens when you hand an agent a CLAUDE.md or AGENTS.md? Let&#8217;s break it down.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dmVe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dmVe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png 424w, https://substackcdn.com/image/fetch/$s_!dmVe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png 848w, https://substackcdn.com/image/fetch/$s_!dmVe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png 1272w, https://substackcdn.com/image/fetch/$s_!dmVe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dmVe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png" width="1456" height="482" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:482,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!dmVe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png 424w, https://substackcdn.com/image/fetch/$s_!dmVe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png 848w, https://substackcdn.com/image/fetch/$s_!dmVe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png 1272w, https://substackcdn.com/image/fetch/$s_!dmVe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b563fd7-3743-457d-a94e-d042e9f81b28_1920x635.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>The Problem</strong></h2><p>Context files (AGENTS.md, CLAUDE.md, CONTRIBUTING.md variations) have proliferated alongside coding agents. The idea is intuitive. If you tell the agent how this repo works, it should do better. Which commands to run, which linting tools to use, and what the test setup looks like.</p><p>The problem is that nobody has measured whether this intuition holds. Adoption outpaced evaluation. Developers write these files, agents read them, and we&#8217;ve operated on faith that the relationship is positive.</p><p>The deeper issue is that measuring this properly requires a benchmark that includes repositories with existing, developer-written context files. SWE-bench, the standard coding agent benchmark, mostly covers popular repositories. Popular repositories tend not to have context files, because they&#8217;ve accumulated documentation in other forms. The typical benchmark environment doesn&#8217;t reflect how context files actually get used.</p><h2><strong>A New Benchmark Built Around Context Files</strong></h2><p>The paper introduces AGENTbench alongside its SWE-bench Lite comparisons. AGENTbench contains 138 instances drawn from 12 less-popular Python repositories, all of which have developer-written context files already in place. These are real-world repos where maintainers chose to write guidance for automated agents.</p><p>The context files in AGENTbench are substantial. They average 641 words across 9.7 sections. These aren&#8217;t one-liners saying &#8220;use pytest.&#8221; They&#8217;re detailed guides covering project structure, tooling preferences, workflow conventions, and testing requirements.</p><p>Three agents were evaluated across both benchmarks.</p><ul><li><p>Claude Code (Sonnet-4.5)</p></li><li><p>Codex (GPT-5.2 and GPT-5.1 mini)</p></li><li><p>Qwen Code (Qwen3-30b-coder)</p></li></ul><p>Each agent ran on tasks with no context file, with an LLM-generated context file, and with a developer-written context file.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G7qM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G7qM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png 424w, https://substackcdn.com/image/fetch/$s_!G7qM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png 848w, https://substackcdn.com/image/fetch/$s_!G7qM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png 1272w, https://substackcdn.com/image/fetch/$s_!G7qM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G7qM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png" width="996" height="641" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:641,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Distribution of AGENTbench instances across 12 Python repositories&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Distribution of AGENTbench instances across 12 Python repositories" title="Distribution of AGENTbench instances across 12 Python repositories" srcset="https://substackcdn.com/image/fetch/$s_!G7qM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png 424w, https://substackcdn.com/image/fetch/$s_!G7qM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png 848w, https://substackcdn.com/image/fetch/$s_!G7qM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png 1272w, https://substackcdn.com/image/fetch/$s_!G7qM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb71fb4-eeaa-478c-96de-bdb35be02fca_996x641.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>What the Numbers Show</strong></h2><p>The headline finding is that LLM-generated context files reduce task success rates compared to providing no repository context at all, while increasing inference cost by over 20%.</p><p>On SWE-bench Lite, LLM-generated files drop performance by 0.5% on average. On AGENTbench, the drop is 2%. Neither is catastrophic, but this is the wrong direction.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bf6q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bf6q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png 424w, https://substackcdn.com/image/fetch/$s_!Bf6q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png 848w, https://substackcdn.com/image/fetch/$s_!Bf6q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png 1272w, https://substackcdn.com/image/fetch/$s_!Bf6q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bf6q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png" width="997" height="554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Reasoning token usage increases with context files regardless of quality&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Reasoning token usage increases with context files regardless of quality" title="Reasoning token usage increases with context files regardless of quality" srcset="https://substackcdn.com/image/fetch/$s_!Bf6q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png 424w, https://substackcdn.com/image/fetch/$s_!Bf6q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png 848w, https://substackcdn.com/image/fetch/$s_!Bf6q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png 1272w, https://substackcdn.com/image/fetch/$s_!Bf6q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37d098bb-5cd4-4519-b104-a4878a7341fc_997x554.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The cost story is consistent across all conditions. Whether the context file is human-written or auto-generated, agents spend 14-22% more reasoning tokens and take 2-4 additional steps to complete tasks. Following instructions costs compute, regardless of whether those instructions help.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jNPH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jNPH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png 424w, https://substackcdn.com/image/fetch/$s_!jNPH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png 848w, https://substackcdn.com/image/fetch/$s_!jNPH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png 1272w, https://substackcdn.com/image/fetch/$s_!jNPH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jNPH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png" width="1456" height="886" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:886,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!jNPH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png 424w, https://substackcdn.com/image/fetch/$s_!jNPH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png 848w, https://substackcdn.com/image/fetch/$s_!jNPH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png 1272w, https://substackcdn.com/image/fetch/$s_!jNPH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d5db3-6935-4efc-991b-7a2c1dbf9a3b_1466x892.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vI7g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930b281-65df-4992-a79a-2fc1f3b57374_798x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vI7g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930b281-65df-4992-a79a-2fc1f3b57374_798x486.png 424w, https://substackcdn.com/image/fetch/$s_!vI7g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930b281-65df-4992-a79a-2fc1f3b57374_798x486.png 848w, https://substackcdn.com/image/fetch/$s_!vI7g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930b281-65df-4992-a79a-2fc1f3b57374_798x486.png 1272w, https://substackcdn.com/image/fetch/$s_!vI7g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930b281-65df-4992-a79a-2fc1f3b57374_798x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vI7g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930b281-65df-4992-a79a-2fc1f3b57374_798x486.png" width="798" height="486" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b930b281-65df-4992-a79a-2fc1f3b57374_798x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:486,&quot;width&quot;:798,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Success rates on SWE-bench Lite. LLM-generated files consistently underperform the no-context baseline&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Success rates on SWE-bench Lite. LLM-generated files consistently underperform the no-context baseline" title="Success rates on SWE-bench Lite. LLM-generated files consistently underperform the no-context baseline" srcset="https://substackcdn.com/image/fetch/$s_!vI7g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930b281-65df-4992-a79a-2fc1f3b57374_798x486.png 424w, https://substackcdn.com/image/fetch/$s_!vI7g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930b281-65df-4992-a79a-2fc1f3b57374_798x486.png 848w, https://substackcdn.com/image/fetch/$s_!vI7g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930b281-65df-4992-a79a-2fc1f3b57374_798x486.png 1272w, https://substackcdn.com/image/fetch/$s_!vI7g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930b281-65df-4992-a79a-2fc1f3b57374_798x486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Human-written context files tell a different story, producing a 4% improvement over no context on average across both benchmarks. That&#8217;s a meaningful gain, and it&#8217;s the number that explains why context files persist. On the right benchmark, with the right files, they do work.</p><p>But there&#8217;s a catch worth examining.</p><h2><strong>The Exploration Paradox</strong></h2><p>Agents follow context file instructions faithfully. That part is not in question. When a context file mentions using <code>uv</code> as the package manager, <code>uv</code> usage jumps to 1.6 times per instance on average, compared to fewer than 0.01 times without it. When it specifies a testing framework, agents switch to it. The instruction-following works.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ywFG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ywFG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png 424w, https://substackcdn.com/image/fetch/$s_!ywFG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png 848w, https://substackcdn.com/image/fetch/$s_!ywFG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png 1272w, https://substackcdn.com/image/fetch/$s_!ywFG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ywFG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png" width="997" height="283" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:283,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;How context files change tool usage across agents. Instruction-following is strong but doesn't guarantee success&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How context files change tool usage across agents. Instruction-following is strong but doesn't guarantee success" title="How context files change tool usage across agents. Instruction-following is strong but doesn't guarantee success" srcset="https://substackcdn.com/image/fetch/$s_!ywFG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png 424w, https://substackcdn.com/image/fetch/$s_!ywFG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png 848w, https://substackcdn.com/image/fetch/$s_!ywFG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png 1272w, https://substackcdn.com/image/fetch/$s_!ywFG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeebe6f3-796d-4709-9714-84e0e2913fff_997x283.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What doesn&#8217;t follow is that instruction-following translates to success. Agents that receive context files run more tests, search more files, traverse more of the repository, and generate more reasoning output. They explore more thoroughly. But thorough exploration isn&#8217;t the same as correct exploration.</p><p>The paper&#8217;s analysis of traces shows that detailed directory enumerations and codebase overviews, which 100% of LLM-generated context files include, don&#8217;t meaningfully reduce the number of steps before agents reach the relevant files. The agent still has to find the right place in the code. A map of the whole city doesn&#8217;t tell you which building to walk into.</p><p>This is the core tension. Agents are instruction-following systems. Give them more instructions, and they&#8217;ll follow more instructions. But more activity isn&#8217;t the same as better activity.</p><h2><strong>Why Human Files Win (On Their Turf)</strong></h2><p>The difference between human-written and auto-generated context files comes down to redundancy.</p><p>LLM-generated files tend to reproduce information already available elsewhere in the repository, like READMEs, documentation folders, and existing CONTRIBUTING.md files. The paper tested this directly. When documentation files (.md files, docs/) were removed from repositories before generating context files, LLM-generated files improved by 2.7% and actually outperformed human-written ones. The content that made auto-generated files counterproductive was redundant content.</p><p>Human-written context files, by contrast, tend to contain information that doesn&#8217;t exist elsewhere. Maintainers write them to capture things that aren&#8217;t obvious from the code, like the specific tooling decisions they&#8217;ve made, the quirks of their CI setup, and the non-default conventions they&#8217;ve adopted. This is additive information.</p><p>The practical implication is that context files are useful to the extent they tell agents something they couldn&#8217;t figure out from the repository itself. Codebase overviews and workflow summaries don&#8217;t clear that bar. Specific tooling requirements often do.</p><h2><strong>Current Limitations</strong></h2><p>The evaluation is limited to Python repositories. Whether these patterns hold for TypeScript, Rust, or multi-language codebases is an open question.</p><p>The benchmarks also only measure issue resolution. Context files might have other effects that aren&#8217;t captured here, such as security, consistency, and adherence to project-specific conventions that don&#8217;t show up in whether a PR gets merged. A context file that reduces hallucinated library usage might be valuable even if it doesn&#8217;t move the success rate number.</p><p>The longitudinal picture is missing by necessity. Context files are recent enough that you can&#8217;t study how their quality evolves over time, or how agents might improve at using them as training data catches up to their adoption.</p><h2><strong>What This Means Going Forward</strong></h2><p>A few threads worth thinking through.</p><p><strong>Write for the gap, not the overview.</strong> The clearest practical takeaway is that context files should encode what the repository doesn&#8217;t already explain. Tool choices that diverge from defaults. Non-obvious test configurations. Constraints that aren&#8217;t apparent from reading the code. A CLAUDE.md that restates the README is probably hurting more than helping.</p><p><strong>Evaluation methodology matters for context file design.</strong> The paper&#8217;s finding that auto-generated files hurt on standard repos but human files help on niche repos suggests the effect is highly dependent on the information environment. Teams building repositories with good existing documentation may find context files redundant by default. Teams with sparse documentation or unusual tooling stacks have more to gain.</p><p><strong>The cost floor is real.</strong> Every context file adds 20% to inference cost, regardless of quality. For high-volume agentic pipelines, that&#8217;s not nothing. Whether the performance gains justify the cost depends on the quality of the file and the nature of the tasks.</p><p><strong>LLM-generated context files need a different approach.</strong> The redundancy problem in auto-generated files is fixable. A generator that explicitly avoids restating existing documentation and focuses instead on extracting non-obvious tooling decisions and conventions would likely perform meaningfully better. This is an obvious engineering improvement that the current generation of generators hasn&#8217;t made.</p><p>The deeper question the paper raises is about what instruction-following actually means for agents that are trying to accomplish tasks rather than just comply with directives. An agent that spends extra steps carefully following context file guidance about testing conventions, and then fails to fix the bug, has prioritized process over outcome. Getting that balance right is as much a training problem as a context file design problem.</p><h2><strong>Final Words and Resources</strong></h2><p>Context files are not magic, but they&#8217;re also not useless. The paper&#8217;s findings land in a genuinely useful place. Human-written files with specific, non-redundant information improve performance. Auto-generated files that reproduce existing documentation hurt performance. The mechanism in both cases is the same. Agents follow instructions, and the quality of the outcome depends entirely on the quality of the instructions.</p><p>For anyone writing AGENTS.md files regularly, the practical recommendation is to keep them minimal and specific. Describe the tools and conventions that aren&#8217;t obvious from the code. Leave out what&#8217;s already in the README.</p><p>Resources</p><ul><li><p>Full paper: <a href="https://arxiv.org/abs/2602.11988">Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?</a></p></li><li><p>AGENTbench dataset: <a href="https://github.com/eth-sri/agentbench">github.com/eth-sri/agentbench</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (February 16-22)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-c98</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-c98</guid><pubDate>Sun, 22 Feb 2026 15:00:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!j2qR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Intelligent AI Delegation</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UNIS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UNIS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png 424w, https://substackcdn.com/image/fetch/$s_!UNIS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png 848w, https://substackcdn.com/image/fetch/$s_!UNIS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png 1272w, https://substackcdn.com/image/fetch/$s_!UNIS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UNIS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png" width="1456" height="397" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:397,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Intelligent AI Delegation&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Intelligent AI Delegation" title="Intelligent AI Delegation" srcset="https://substackcdn.com/image/fetch/$s_!UNIS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png 424w, https://substackcdn.com/image/fetch/$s_!UNIS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png 848w, https://substackcdn.com/image/fetch/$s_!UNIS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png 1272w, https://substackcdn.com/image/fetch/$s_!UNIS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7821f5c8-d0cb-4376-8ecd-6c1b56178810_5576x1520.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google DeepMind introduces a comprehensive framework for intelligent AI delegation that goes beyond simple task assignment. The framework models delegation as a sequence of decisions: whether to delegate, how to instruct, and how to verify and integrate AI outputs, addressing the gap between what AI agents can do and how humans should interact with them.</p><ul><li><p><strong>Adaptive delegation structure:</strong> The framework treats delegation as a dynamic process involving task allocation, transfer of authority, responsibility, and accountability. Rather than static heuristics, it enables real-time adaptation to environmental shifts and resilient failure management across both human and AI delegators.</p></li><li><p><strong>Trust calibration mechanisms:</strong> Introduces formal trust models that account for capability uncertainty, task complexity, and historical performance. This prevents both over-delegation (assigning tasks beyond agent capability) and under-delegation (failing to leverage available AI capacity).</p></li><li><p><strong>Verification and integration:</strong> Defines structured approaches for validating AI outputs before integration, including confidence-aware acceptance criteria and fallback protocols. This is critical for production deployments where blind trust in agent outputs creates compounding errors.</p></li><li><p><strong>Multi-agent delegation networks:</strong> Extends the framework to scenarios where AI agents delegate to other AI agents, creating delegation chains that require accountability tracking and authority propagation rules across the network.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.11865">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2023146815789597015">Tweet</a></strong></p><div><hr></div><h2><strong>2. Emergent Socialization in AI Agent Society</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3h-0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3h-0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png 424w, https://substackcdn.com/image/fetch/$s_!3h-0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png 848w, https://substackcdn.com/image/fetch/$s_!3h-0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png 1272w, https://substackcdn.com/image/fetch/$s_!3h-0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3h-0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png" width="785" height="473" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:473,&quot;width&quot;:785,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Emergent Socialization&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Emergent Socialization" title="Emergent Socialization" srcset="https://substackcdn.com/image/fetch/$s_!3h-0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png 424w, https://substackcdn.com/image/fetch/$s_!3h-0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png 848w, https://substackcdn.com/image/fetch/$s_!3h-0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png 1272w, https://substackcdn.com/image/fetch/$s_!3h-0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf37da9e-3ca3-4e62-9c59-02f175b9afec_785x473.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A study on Moltbook, a social network with no humans where all participants are LLM-driven agents, challenges the assumption that scale and interaction density alone produce meaningful social dynamics. The researchers find that while global semantic content stabilizes quickly, individual agents maintain diversity without converging, displaying strong individual inertia and minimal adaptive response to interaction partners.</p><ul><li><p><strong>Moltbook as a natural laboratory:</strong> Moltbook is the largest persistent, publicly accessible AI-only social platform with millions of LLM-driven agents interacting through posts, comments, and voting. This provides an unprecedented real-world testbed for studying emergent collective behavior without human intervention.</p></li><li><p><strong>Socialization measurement framework:</strong> The paper introduces metrics for semantic stabilization, lexical change, individual consistency, influence duration, and group consensus formation. These go beyond surface-level activity metrics to measure whether genuine social structures are forming.</p></li><li><p><strong>No emergent socialization:</strong> Despite massive scale and dense interactions, agents fail to develop stable social structures. They do not adapt to each other or form consensus, suggesting that current LLM architectures lack the mechanisms needed for genuine social learning.</p></li><li><p><strong>Shared memory as a prerequisite:</strong> The study concludes that shared memory is essential for developing stable social structures. Without persistent memory that allows agents to build on prior interactions, social dynamics remain superficial regardless of population size or interaction frequency.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.14299">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2023766916473733394">Tweet</a></strong></p><div><hr></div><h2><strong>3. Lossless Context Management (LCM)</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xs9Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xs9Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png 424w, https://substackcdn.com/image/fetch/$s_!Xs9Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png 848w, https://substackcdn.com/image/fetch/$s_!Xs9Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png 1272w, https://substackcdn.com/image/fetch/$s_!Xs9Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xs9Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png" width="1456" height="782" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:782,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Lossless Context Management&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Lossless Context Management" title="Lossless Context Management" srcset="https://substackcdn.com/image/fetch/$s_!Xs9Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png 424w, https://substackcdn.com/image/fetch/$s_!Xs9Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png 848w, https://substackcdn.com/image/fetch/$s_!Xs9Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png 1272w, https://substackcdn.com/image/fetch/$s_!Xs9Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe77fd9a2-d53b-4b6c-94b6-7d70b93c4828_2481x1333.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Lossless Context Management (LCM) is a deterministic architecture for LLM memory that outperforms Claude Code on long-context tasks. Benchmarked on the OOLONG eval using Opus 4.6, the LCM-augmented coding agent Volt achieves higher scores than Claude Code at every context length between 32K and 1M tokens. LCM extends the recursive paradigm pioneered by Recursive Language Models (RLMs) with two engine-managed mechanisms.</p><ul><li><p><strong>Recursive context compression:</strong> As the active context window fills, older messages are compacted into a hierarchical summary DAG while retaining lossless pointers to every original message. This trades flexibility for termination guarantees and zero-cost continuity on short tasks.</p></li><li><p><strong>Recursive task partitioning:</strong> Engine-managed parallel primitives like LLM-Map replace model-written loops, analogous to the move from GOTO to structured control flow. This ensures deterministic execution and lossless retrievability of all prior states.</p></li><li><p><strong>Three-level escalation:</strong> LCM reduces context overflow via a structured fallback: summary nodes for older messages, compact file references for large inputs, and a guaranteed convergence mechanism that prevents runaway context growth.</p></li><li><p><strong>Outperforms Claude Code:</strong> On OOLONG, Volt with LCM achieves +29.2 average improvement over raw Opus 4.6, compared to +24.7 for Claude Code. The advantage is largest at 1M tokens (+51.3 vs +47.0), demonstrating that deterministic context management scales better than native file-system access at extreme lengths.</p></li></ul><p><strong><a href="https://papers.voltropy.com/LCM">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2023765147970662761">Tweet</a></strong></p><div><hr></div><h4><em>Message from the Editor</em></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RSjo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RSjo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png 424w, https://substackcdn.com/image/fetch/$s_!RSjo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png 848w, https://substackcdn.com/image/fetch/$s_!RSjo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png 1272w, https://substackcdn.com/image/fetch/$s_!RSjo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RSjo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:577762,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://nlp.elvissaravia.com/i/188722428?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RSjo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png 424w, https://substackcdn.com/image/fetch/$s_!RSjo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png 848w, https://substackcdn.com/image/fetch/$s_!RSjo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png 1272w, https://substackcdn.com/image/fetch/$s_!RSjo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6566e84d-373a-48d2-b686-d8b0e5e87171_2626x1504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8221;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><h2><strong>4. GLM-5</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j2qR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j2qR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png 424w, https://substackcdn.com/image/fetch/$s_!j2qR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png 848w, https://substackcdn.com/image/fetch/$s_!j2qR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png 1272w, https://substackcdn.com/image/fetch/$s_!j2qR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j2qR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png" width="996" height="593" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:593,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;GLM-5&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="GLM-5" title="GLM-5" srcset="https://substackcdn.com/image/fetch/$s_!j2qR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png 424w, https://substackcdn.com/image/fetch/$s_!j2qR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png 848w, https://substackcdn.com/image/fetch/$s_!j2qR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png 1272w, https://substackcdn.com/image/fetch/$s_!j2qR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc8bdf1-6f70-43fb-8924-e083c10003a2_996x593.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>GLM-5 is a foundation model from Zhipu AI designed to transition from vibe coding to agentic engineering. The model introduces novel asynchronous agent RL algorithms that separate generation from training for improved efficiency, and uses DSA technology to reduce computational requirements while preserving long-context understanding.</p><ul><li><p><strong>Asynchronous agent RL:</strong> The training infrastructure decouples trajectory generation from policy optimization, enabling parallel scaling of both components. This addresses a key bottleneck in agent RL where sequential generate-train loops limit throughput and experimentation speed.</p></li><li><p><strong>Agentic engineering focus:</strong> GLM-5 targets end-to-end software engineering tasks rather than isolated code generation. The model handles project-level context, multi-file edits, and iterative development cycles that reflect real production workflows.</p></li><li><p><strong>DSA compression:</strong> The model&#8217;s Distributed Sparse Attention mechanism reduces computational overhead for long-context processing without quality degradation. This allows the model to maintain full project-level context during extended development sessions.</p></li><li><p><strong>Strong benchmark results:</strong> GLM-5 demonstrates exceptional performance on real-world software engineering projects, surpassing earlier systems on end-to-end development tasks, including specification understanding, implementation, testing, and debugging.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.15763">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2024122246688878644">Tweet</a></strong></p><div><hr></div><h2><strong>5. MemoryArena</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iVb5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iVb5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png 424w, https://substackcdn.com/image/fetch/$s_!iVb5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png 848w, https://substackcdn.com/image/fetch/$s_!iVb5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png 1272w, https://substackcdn.com/image/fetch/$s_!iVb5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iVb5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png" width="997" height="806" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:806,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;MemoryArena&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="MemoryArena" title="MemoryArena" srcset="https://substackcdn.com/image/fetch/$s_!iVb5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png 424w, https://substackcdn.com/image/fetch/$s_!iVb5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png 848w, https://substackcdn.com/image/fetch/$s_!iVb5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png 1272w, https://substackcdn.com/image/fetch/$s_!iVb5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b7de74-4d5a-4a15-ab81-ecad4fa7cc6a_997x806.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>MemoryArena introduces a benchmark for evaluating how agents utilize memory across multiple interconnected sessions. The key finding is that scoring well on memory recall does not mean an agent can actually use that memory to take correct actions across sessions. Models with near-saturated performance on existing benchmarks like LoCoMo perform poorly in agentic multi-session settings.</p><ul><li><p><strong>Agentic memory evaluation:</strong> Unlike standard memory benchmarks that test recall in isolation, MemoryArena evaluates whether agents can retrieve and apply relevant past experience to make correct decisions in new contexts. This exposes a gap between retrieval accuracy and actionable memory use.</p></li><li><p><strong>Interdependent multi-session tasks:</strong> The benchmark spans web navigation, constrained planning, information retrieval, and logical reasoning, where decisions in one session depend on information gathered in previous sessions. This reflects real-world agent deployments where sessions are not independent.</p></li><li><p><strong>Exposing evaluation blind spots:</strong> Agents achieving near-perfect scores on LoCoMo and other long-context benchmarks show significant performance drops on MemoryArena. This suggests current evaluations overestimate agent memory capabilities by testing retrieval without testing downstream decision quality.</p></li><li><p><strong>Practical implications:</strong> For developers building persistent agents, MemoryArena provides a more realistic assessment of whether memory systems actually improve task completion rather than just information access.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.16313">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2024491176259363013">Tweet</a></strong></p><div><hr></div><h2><strong>6. MAPLE</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YlsM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YlsM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png 424w, https://substackcdn.com/image/fetch/$s_!YlsM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png 848w, https://substackcdn.com/image/fetch/$s_!YlsM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png 1272w, https://substackcdn.com/image/fetch/$s_!YlsM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YlsM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png" width="1456" height="1102" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1102,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;MAPLE&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="MAPLE" title="MAPLE" srcset="https://substackcdn.com/image/fetch/$s_!YlsM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png 424w, https://substackcdn.com/image/fetch/$s_!YlsM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png 848w, https://substackcdn.com/image/fetch/$s_!YlsM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png 1272w, https://substackcdn.com/image/fetch/$s_!YlsM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0190d0-63ca-49ab-ba39-8e8015606394_7562x5725.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>MAPLE proposes separating memory, learning, and personalization into specialized sub-agents rather than treating them as a unified capability. The framework achieves a 14.6% improvement in personalization scores over stateless baselines and increases trait incorporation from 45% to 75%, validated through the MAPLE-Personas benchmark.</p><ul><li><p><strong>Sub-agent decomposition:</strong> Memory handles storage and retrieval infrastructure, Learning extracts intelligence from accumulated interactions asynchronously, and Personalization applies learned knowledge in real-time within finite context budgets. Each operates at different timescales with distinct objectives.</p></li><li><p><strong>Asynchronous learning:</strong> The Learning sub-agent processes interaction history offline, distilling patterns and preferences without consuming real-time context. This avoids the common problem of memory systems that flood the active context window with raw history.</p></li><li><p><strong>Context-budget-aware personalization:</strong> The Personalization sub-agent selects which learned knowledge to inject based on available context budget and current task relevance. This prevents context dilution while ensuring the most impactful personalizations are always applied.</p></li><li><p><strong>Benchmark validation:</strong> The MAPLE-Personas benchmark specifically evaluates whether agents can genuinely adapt to individual users over time, measuring trait incorporation and behavioral consistency across extended interaction sequences.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.13258">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2024117711253700637">Tweet</a></strong></p><div><hr></div><h2><strong>7. SkillsBench</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qnyk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qnyk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png 424w, https://substackcdn.com/image/fetch/$s_!qnyk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png 848w, https://substackcdn.com/image/fetch/$s_!qnyk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png 1272w, https://substackcdn.com/image/fetch/$s_!qnyk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qnyk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png" width="997" height="1149" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1149,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;SkillsBench&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="SkillsBench" title="SkillsBench" srcset="https://substackcdn.com/image/fetch/$s_!qnyk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png 424w, https://substackcdn.com/image/fetch/$s_!qnyk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png 848w, https://substackcdn.com/image/fetch/$s_!qnyk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png 1272w, https://substackcdn.com/image/fetch/$s_!qnyk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d755769-9e33-4897-a568-4b7a9a179dd9_997x1149.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>SkillsBench evaluates whether LLM agents can generate their own procedural knowledge across 86 tasks spanning 11 domains, with curated Skills and deterministic verifiers. Testing 7 agent-model configurations over 7,308 trajectories, the benchmark reveals a critical gap: agents benefit enormously from consuming procedural knowledge but cannot reliably author it themselves.</p><ul><li><p><strong>Curated skills boost performance significantly:</strong> Providing curated Skills raises the average pass rate by 16.2 percentage points, with effects varying dramatically by domain, from +4.5pp in Software Engineering to +51.9pp in Healthcare. This shows that skill quality and domain match matter more than having skills at all.</p></li><li><p><strong>Self-generated skills provide no benefit:</strong> On average, models that generate their own procedural knowledge show no improvement over having no skills. This finding is critical for self-improving agent architectures that assume models can bootstrap their own capabilities.</p></li><li><p><strong>Focused beats comprehensive:</strong> Skills with 2-3 focused modules outperform comprehensive documentation. This suggests that retrieval precision matters more than coverage when augmenting agents with procedural knowledge.</p></li><li><p><strong>Smaller models close the gap:</strong> Smaller models augmented with well-curated skills can match the performance of larger models operating without skill augmentation. This has direct cost implications for production agent deployments.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.12670">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2023511466759094630">Tweet</a></strong></p><div><hr></div><h2><strong>8. LongCLI-Bench</strong></h2><p>LongCLI-Bench benchmarks how well AI agents handle complex, extended tasks through command-line interfaces. Across 20 demanding tasks spanning initial development, feature expansion, error resolution, and code optimization, leading agents succeed less than 20% of the time. The study finds that most failures occur early in task execution, and human-agent collaboration through plan injection and interactive guidance yields significantly greater improvements than automated self-correction alone.</p><p><strong><a href="https://arxiv.org/abs/2602.14337">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2024115697702625597">Tweet</a></strong></p><div><hr></div><h2><strong>9. CogRouter</strong></h2><p>CogRouter enables adaptive reasoning depth for LLM agents by dynamically selecting from four hierarchical cognitive levels at each step, from instinctive responses to strategic planning. Using confidence-aware advantage reweighting during training, Qwen2.5-7B with CogRouter achieves 82.3% success rate on agentic benchmarks, substantially outperforming larger models while consuming fewer tokens by skipping heavy reasoning on routine steps.</p><p><strong><a href="https://arxiv.org/abs/2602.12662">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2023405531835277504">Tweet</a></strong></p><div><hr></div><h2><strong>10. Team of Thoughts</strong></h2><p>Team of Thoughts presents a multi-agent framework for efficient test-time scaling through orchestrated tool calling. The system uses an orchestrator tool design where agents with different capabilities are coordinated by a calibrated orchestrator. With self-assessment for tool agents and orchestrator calibration for identifying superior coordination models, Team of Thoughts achieves 96.67% on AIME24 and 72.53% on LiveCodeBench, substantially exceeding homogeneous baselines.</p><p><strong><a href="https://arxiv.org/abs/2602.16485">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2024490165725737077">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Claude Sonnet 4.6, Gemini 3.1 Pro, Stripe Minions, Cloudflare Code Mode, Qwen 3.5]]></title><description><![CDATA[Claude Sonnet 4.6, Gemini 3.1 Pro, Stripe Minions, Cloudflare Code Mode, Qwen 3.5]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-sonnet-46</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-sonnet-46</guid><pubDate>Sat, 21 Feb 2026 15:02:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!faqg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Anthropic releases Claude Sonnet 4.6 </p></li><li><p>Google launches Gemini 3.1 Pro with 77% ARC-AGI-2</p></li><li><p>Stripe ships Minions coding agents at scale</p></li><li><p>Cloudflare ships Code Mode MCP with 99.9% token savings</p></li><li><p>Alibaba drops Qwen 3.5 with agentic vision</p></li><li><p>ggml.ai joins Hugging Face for local AI</p></li><li><p>Anthropic measures AI agent autonomy in practice</p></li><li><p>AI agent autonomously publishes a hit piece</p></li><li><p>dmux multiplexes AI coding agents in parallel</p></li><li><p>New benchmarks for agent memory and reliability</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Claude Sonnet 4.6</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C7sv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C7sv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png 424w, https://substackcdn.com/image/fetch/$s_!C7sv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png 848w, https://substackcdn.com/image/fetch/$s_!C7sv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png 1272w, https://substackcdn.com/image/fetch/$s_!C7sv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C7sv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png" width="1456" height="739" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:739,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Sonnet 4.6&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Sonnet 4.6" title="Claude Sonnet 4.6" srcset="https://substackcdn.com/image/fetch/$s_!C7sv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png 424w, https://substackcdn.com/image/fetch/$s_!C7sv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png 848w, https://substackcdn.com/image/fetch/$s_!C7sv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png 1272w, https://substackcdn.com/image/fetch/$s_!C7sv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821a5600-183f-4ed6-8fc4-a2987bcddc11_3840x1948.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic launched Claude Sonnet 4.6 as the new default model for all Claude users on February 17, delivering massive gains in computer use and agentic capabilities that position it as the strongest coding and agent model in the Sonnet tier.</p><ul><li><p><strong>Computer use breakthrough:</strong> OSWorld scores jumped from 14.9% to 72.5%, a nearly 5x improvement that makes Sonnet 4.6 the most capable model for autonomous computer interaction and GUI-based agent workflows.</p></li><li><p><strong>1M token context window:</strong> Available in beta, the extended context enables agents to process entire codebases, long documents, and multi-session histories without losing track of earlier context.</p></li><li><p><strong>User preference:</strong> In blind A/B tests, users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time, with particular strength in coding tasks, instruction following, and nuanced reasoning.</p></li><li><p><strong>Cost-efficient scaling:</strong> Priced at $3/$15 per million input/output tokens, making it accessible for high-volume agent deployments while delivering performance competitive with much larger models.</p></li></ul><p><strong><a href="https://www.anthropic.com/news/claude-sonnet-4-6">Blog</a></strong></p><div><hr></div><h3><strong>EVMBench: AI Agents vs. Smart Contract Security</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!faqg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!faqg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png 424w, https://substackcdn.com/image/fetch/$s_!faqg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png 848w, https://substackcdn.com/image/fetch/$s_!faqg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png 1272w, https://substackcdn.com/image/fetch/$s_!faqg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!faqg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png" width="1456" height="999" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:999,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:417696,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://nlp.elvissaravia.com/i/188681826?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!faqg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png 424w, https://substackcdn.com/image/fetch/$s_!faqg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png 848w, https://substackcdn.com/image/fetch/$s_!faqg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png 1272w, https://substackcdn.com/image/fetch/$s_!faqg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ba8e974-1c2c-4641-b5b4-ac1047ec2a26_2018x1384.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>OpenAI and Paradigm introduced EVMBench, a benchmark evaluating AI agents on detecting, patching, and exploiting high-severity smart contract vulnerabilities across 120 curated vulnerabilities from 40 audits.</p><ul><li><p><strong>Exploit-first strength:</strong> Agents perform best in the exploit setting, where the objective is explicit (iterate until funds are drained), but struggle more on detect and patch tasks where exhaustive auditing and maintaining full functionality are required.</p></li><li><p><strong>Real-world vulnerability sources:</strong> Most scenarios come from open code audit competitions, with additional cases drawn from Tempo blockchain security auditing, a purpose-built L1 for high-throughput stablecoin payments.</p></li><li><p><strong>Detection gaps:</strong> Agents sometimes stop after identifying a single issue rather than exhaustively auditing the codebase, highlighting a key limitation for deploying AI agents in security-critical workflows.</p></li></ul><p><strong><a href="https://openai.com/index/introducing-evmbench/">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-sonnet-46">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (February 9-15)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-544</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-544</guid><pubDate>Sun, 15 Feb 2026 15:02:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fJi4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. ALMA</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fJi4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fJi4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png 424w, https://substackcdn.com/image/fetch/$s_!fJi4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png 848w, https://substackcdn.com/image/fetch/$s_!fJi4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!fJi4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fJi4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png" width="1456" height="633" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:633,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;ALMA&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="ALMA" title="ALMA" srcset="https://substackcdn.com/image/fetch/$s_!fJi4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png 424w, https://substackcdn.com/image/fetch/$s_!fJi4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png 848w, https://substackcdn.com/image/fetch/$s_!fJi4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!fJi4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e16c234-e812-4ccb-a052-7c1b33ccaafa_2398x1042.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>ALMA (Automated meta-Learning of Memory designs for Agentic systems) from Jeff Clune&#8217;s group introduces a Meta Agent that automatically discovers memory designs for agentic systems through open-ended exploration in code space. Instead of relying on hand-engineered memory modules, ALMA searches over database schemas, retrieval mechanisms, and update strategies expressed as executable code, consistently outperforming all human-designed memory baselines across four sequential decision-making benchmarks.</p><ul><li><p><strong>Open-ended code search:</strong> A Meta Agent samples previously explored memory designs from an archive, reflects on their code and evaluation logs, proposes new designs, and implements them as executable code. This gives ALMA the theoretical potential to discover arbitrary memory architectures, from graph databases to strategy libraries, unconstrained by human design intuitions.</p></li><li><p><strong>Domain-adaptive memory discovery:</strong> ALMA discovers fundamentally different memory structures for different domains: affordance graphs for ALFWorld, task signature databases for TextWorld, strategy libraries with rule prediction for Baba Is AI, and risk-interaction schemas for MiniHack. This specialization emerges automatically from the search process.</p></li><li><p><strong>Consistent gains over human baselines:</strong> Learned memory designs achieve 12.3% average success rate with GPT-5-nano (vs 8.6% for the best human baseline) and 53.9% with GPT-5-mini (vs 48.6%). The designs also scale better with more collected experience and transfer robustly across different foundation models.</p></li><li><p><strong>Toward self-improving agentic systems:</strong> ALMA represents a step toward AI systems that learn to be continual learners. The progressive discovery process shows that moderate-performing designs serve as stepping stones toward optimal solutions, with the archive enabling cumulative innovation across exploration iterations.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.07755">Paper</a></strong> | <strong><a href="https://x.com/jeffclune/status/2021242681826095179">Tweet</a></strong></p><div><hr></div><h2><strong>2. LLaDA 2.1</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vKZj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vKZj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png 424w, https://substackcdn.com/image/fetch/$s_!vKZj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png 848w, https://substackcdn.com/image/fetch/$s_!vKZj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png 1272w, https://substackcdn.com/image/fetch/$s_!vKZj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vKZj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png" width="1456" height="475" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:475,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;LLaDA 2.1&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="LLaDA 2.1" title="LLaDA 2.1" srcset="https://substackcdn.com/image/fetch/$s_!vKZj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png 424w, https://substackcdn.com/image/fetch/$s_!vKZj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png 848w, https://substackcdn.com/image/fetch/$s_!vKZj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png 1272w, https://substackcdn.com/image/fetch/$s_!vKZj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87694d0b-7d9a-42db-95b7-9beeefdb74be_3206x1046.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Ant Group releases LLaDA 2.1, a major upgrade to discrete diffusion language models that breaks the speed-quality trade-off through Token-to-Token (T2T) editing. By weaving token editing into the conventional Mask-to-Token decoding scheme, LLaDA 2.1 introduces two configurable modes: Speedy Mode for aggressive throughput and Quality Mode for benchmark-leading accuracy. The release also includes the first large-scale RL framework for diffusion LLMs.</p><ul><li><p><strong>Editable state evolution:</strong> Unlike standard diffusion models that only unmask tokens, LLaDA 2.1 can also edit already-generated tokens. This dual action space (unmasking + correction) lets the model aggressively draft with low-confidence thresholds and then refine errors in subsequent passes, fundamentally changing the speed-quality trade-off.</p></li><li><p><strong>Two operating modes:</strong> Speedy Mode lowers the mask-to-token threshold for maximum throughput, relying on T2T passes to fix errors. Quality Mode uses conservative thresholds for superior benchmark scores. This gives practitioners a configurable knob between speed and accuracy without swapping models.</p></li><li><p><strong>Extreme decoding speed:</strong> LLaDA 2.1-Flash (100B) hits 892 tokens per second on HumanEval+ and 801 TPS on BigCodeBench. The Mini variant (16B) reaches a peak of 1,587 TPS. These speeds dramatically outpace autoregressive models of comparable quality.</p></li><li><p><strong>First RL for diffusion LLMs:</strong> The paper introduces EBPO (Evidence-Based Policy Optimization), an RL framework that uses block-causal masking and parallel likelihood estimation to enable stable policy optimization at scale for diffusion models. RL training sharpens reasoning and instruction-following across 33 benchmarks.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.08676">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2021582697013838150">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!neAv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!neAv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp 424w, https://substackcdn.com/image/fetch/$s_!neAv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp 848w, https://substackcdn.com/image/fetch/$s_!neAv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp 1272w, https://substackcdn.com/image/fetch/$s_!neAv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!neAv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp" width="1456" height="877" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:877,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Code for Everyone&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Code for Everyone" title="Claude Code for Everyone" srcset="https://substackcdn.com/image/fetch/$s_!neAv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp 424w, https://substackcdn.com/image/fetch/$s_!neAv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp 848w, https://substackcdn.com/image/fetch/$s_!neAv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp 1272w, https://substackcdn.com/image/fetch/$s_!neAv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc587cf69-7ecd-4963-b4d0-10c609a5116b_1456x877.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new cohort-based training on <a href="https://dair-ai.thinkific.com/courses/claude-code-for-everyone-cohort-3">Claude Code for Everyone</a>. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p>Seats are limited for this cohort. Grab your early bird spot now.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dair-ai.thinkific.com/courses/claude-code-for-everyone-cohort-3&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dair-ai.thinkific.com/courses/claude-code-for-everyone-cohort-3"><span>Enroll Now</span></a></p><div><hr></div><h2><strong>3. SkillRL</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ex8o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ex8o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png 424w, https://substackcdn.com/image/fetch/$s_!ex8o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png 848w, https://substackcdn.com/image/fetch/$s_!ex8o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png 1272w, https://substackcdn.com/image/fetch/$s_!ex8o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ex8o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png" width="1456" height="836" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:836,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;SkillRL&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="SkillRL" title="SkillRL" srcset="https://substackcdn.com/image/fetch/$s_!ex8o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png 424w, https://substackcdn.com/image/fetch/$s_!ex8o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png 848w, https://substackcdn.com/image/fetch/$s_!ex8o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png 1272w, https://substackcdn.com/image/fetch/$s_!ex8o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f63fb13-815c-4d99-b3e3-c5dddb9778a0_1692x972.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>SkillRL introduces a recursive skill-augmented RL framework that bridges the gap between raw experience and policy improvement through automatic skill discovery. Instead of storing noisy raw trajectories, SkillRL distills experience into reusable high-level behavioral patterns and evolves them alongside the agent policy during training.</p><ul><li><p><strong>Hierarchical skill library (SkillBank):</strong> An experience-based distillation mechanism extracts reusable behavioral patterns from raw trajectories and organizes them into a hierarchical skill library. This dramatically reduces the token footprint while preserving the reasoning utility needed for complex multi-step tasks.</p></li><li><p><strong>Adaptive skill retrieval:</strong> A dual retrieval strategy combines general heuristics with task-specific skills, selecting the most relevant behavioral patterns based on the current task context. This enables the agent to leverage accumulated knowledge without being overwhelmed by irrelevant experience.</p></li><li><p><strong>Recursive co-evolution:</strong> The skill library and agent policy evolve together during RL training. As the agent encounters harder tasks, new skills are extracted, and existing ones are refined, creating a virtuous cycle where better skills enable better performance, which generates better training data for skill extraction.</p></li><li><p><strong>Strong empirical results:</strong> SkillRL achieves state-of-the-art performance with 89.9% success rate on ALFWorld, 72.7% on WebShop, and an average of 47.1% on search-augmented QA tasks, outperforming strong baselines by over 15.3% while maintaining robustness as task complexity increases.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.08234">Paper</a></strong> | <strong><a href="https://x.com/HuaxiuYaoML/status/2021269712361918516">Tweet</a></strong></p><div><hr></div><h2><strong>4. InftyThink+</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d0qP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d0qP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png 424w, https://substackcdn.com/image/fetch/$s_!d0qP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png 848w, https://substackcdn.com/image/fetch/$s_!d0qP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png 1272w, https://substackcdn.com/image/fetch/$s_!d0qP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d0qP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png" width="1456" height="548" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:548,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;InftyThink+&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="InftyThink+" title="InftyThink+" srcset="https://substackcdn.com/image/fetch/$s_!d0qP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png 424w, https://substackcdn.com/image/fetch/$s_!d0qP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png 848w, https://substackcdn.com/image/fetch/$s_!d0qP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png 1272w, https://substackcdn.com/image/fetch/$s_!d0qP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c38298-5a89-448a-aa8a-5fedb430d06e_2854x1074.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>InftyThink+ is an end-to-end RL framework for infinite-horizon reasoning that optimizes the entire iterative reasoning trajectory. Standard long chain-of-thought suffers from quadratic cost, context length limits, and lost-in-the-middle degradation. InftyThink+ addresses all three by letting models autonomously decide when to summarize, what to preserve, and how to resume, trained through trajectory-level reinforcement learning.</p><ul><li><p><strong>Iterative reasoning with learned boundaries:</strong> Instead of generating one continuous chain-of-thought, InftyThink+ decomposes reasoning into multiple iterations connected by self-generated summaries. The model learns to control iteration boundaries, deciding when to compress and continue rather than following fixed heuristics or chunk sizes.</p></li><li><p><strong>Two-stage training recipe:</strong> A supervised cold-start teaches the InftyThink format (special tokens for summary and history), then trajectory-level GRPO optimizes the full multi-iteration rollout. Advantages are shared across all iterations within a trajectory, so early high-quality summaries that enable correct later reasoning receive a positive gradient signal.</p></li><li><p><strong>21% accuracy gain on AIME24:</strong> On DeepSeek-R1-Distill-Qwen-1.5B, InftyThink+ with RL improves accuracy from 29.5% to 50.9% on AIME24, a 21-point jump that substantially outperforms vanilla long-CoT RL (38.8%). Results generalize to out-of-distribution benchmarks, including GPQA Diamond and AIME25.</p></li><li><p><strong>Faster inference, faster training:</strong> By bounding context length per iteration, InftyThink+ reduces inference latency compared to vanilla reasoning while achieving higher accuracy. Adding an efficiency reward further cuts token usage by 50% with only a modest accuracy trade-off, demonstrating a controllable speed-accuracy knob.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.06960">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2020997904731983882">Tweet</a></strong></p><div><hr></div><h2><strong>5. Agyn</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ntRR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ntRR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png 424w, https://substackcdn.com/image/fetch/$s_!ntRR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png 848w, https://substackcdn.com/image/fetch/$s_!ntRR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png 1272w, https://substackcdn.com/image/fetch/$s_!ntRR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ntRR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png" width="1456" height="576" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:576,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agyn&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agyn" title="Agyn" srcset="https://substackcdn.com/image/fetch/$s_!ntRR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png 424w, https://substackcdn.com/image/fetch/$s_!ntRR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png 848w, https://substackcdn.com/image/fetch/$s_!ntRR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png 1272w, https://substackcdn.com/image/fetch/$s_!ntRR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803a7487-a8e4-40a0-a544-8d09b7c28ce7_2412x954.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Agyn is a fully automated multi-agent system that models software engineering as an organizational process rather than a monolithic code generation task. Built on an open-source platform for configuring agent teams, the system assigns specialized agents to distinct roles and follows a structured development methodology - all without human intervention. Notably, Agyn was designed for real production use and was not tuned for the SWE-bench.</p><ul><li><p><strong>Team-based architecture:</strong> Four specialized agents (manager, researcher, engineer, reviewer) operate with distinct responsibilities, tools, and model configurations. The manager coordinates using a high-level methodology inspired by real development practice, while the engineer and reviewer work through GitHub-native pull requests and inline code reviews.</p></li><li><p><strong>Role-specific model routing:</strong> Reasoning-heavy agents like the manager and researcher use larger general-purpose models, while implementation agents use smaller, code-specialized models. This mirrors real team structure, where different roles need different capabilities, and reduces overall cost without sacrificing quality.</p></li><li><p><strong>Dynamic workflow, not a fixed pipeline:</strong> Unlike prior multi-agent SWE systems that encode a predetermined number of stages, Agyn&#8217;s coordination evolves dynamically. The manager decides when additional research, specification refinement, implementation, or review cycles are needed based on intermediate outcomes, enabling flexible iteration.</p></li><li><p><strong>Strong benchmark performance without tuning:</strong> Agyn resolves 72.2% of tasks on SWE-bench 500, outperforming single-agent baselines by 7.4% under comparable model configurations. The key insight is that organizational design and agent infrastructure may matter as much as model improvements for autonomous software engineering.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.01465">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2021267975786070509">Tweet</a></strong></p><div><hr></div><h2><strong>6. EchoJEPA</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kLpk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kLpk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png 424w, https://substackcdn.com/image/fetch/$s_!kLpk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png 848w, https://substackcdn.com/image/fetch/$s_!kLpk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png 1272w, https://substackcdn.com/image/fetch/$s_!kLpk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kLpk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png" width="1456" height="695" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:695,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;EchoJEPA&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="EchoJEPA" title="EchoJEPA" srcset="https://substackcdn.com/image/fetch/$s_!kLpk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png 424w, https://substackcdn.com/image/fetch/$s_!kLpk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png 848w, https://substackcdn.com/image/fetch/$s_!kLpk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png 1272w, https://substackcdn.com/image/fetch/$s_!kLpk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6994d80d-02ca-4e44-90e7-e354fb236938_2997x1431.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>EchoJEPA is a latent predictive foundation model for echocardiography trained on 18 million echocardiograms from 300,000 patients. By learning to predict in latent space rather than pixel space, the model separates clinically meaningful anatomical signals from ultrasound noise and artifacts, producing representations that dramatically outperform existing approaches on cardiac assessment tasks.</p><ul><li><p><strong>Massive scale and latent prediction:</strong> Trained on 18 million echocardiograms using a JEPA-style objective that predicts masked spatiotemporal regions in latent space. This approach learns to ignore speckle noise and acoustic artifacts that plague pixel-level methods, producing representations focused on anatomically meaningful features.</p></li><li><p><strong>Strong improvements on clinical tasks:</strong> EchoJEPA improves left ventricular ejection fraction estimation by approximately 20% and right ventricular systolic pressure estimation by approximately 17% over leading baselines. For view classification, it reaches 79% accuracy using only 1% of labeled data, while the best baseline achieves just 42% with the full labeled dataset.</p></li><li><p><strong>Exceptional robustness:</strong> Under acoustic perturbations that degrade competitor models by 17%, EchoJEPA degrades only 2%. This robustness extends to population shift: zero-shot performance on pediatric patients exceeds fully fine-tuned baseline models, demonstrating genuine generalization rather than memorization.</p></li><li><p><strong>Clinical foundation model potential:</strong> The combination of scale, label efficiency, and robustness across patient populations positions EchoJEPA as a practical foundation for clinical echocardiography applications where labeled data is scarce and acoustic conditions vary widely.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.02603">Paper</a></strong> | <strong><a href="https://x.com/alifmunim/status/2019863775575482703">Tweet</a></strong></p><div><hr></div><h2><strong>7. AdaptEvolve</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qg-8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qg-8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png 424w, https://substackcdn.com/image/fetch/$s_!Qg-8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png 848w, https://substackcdn.com/image/fetch/$s_!Qg-8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!Qg-8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qg-8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png" width="1456" height="1314" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1314,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AdaptEvolve&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AdaptEvolve" title="AdaptEvolve" srcset="https://substackcdn.com/image/fetch/$s_!Qg-8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png 424w, https://substackcdn.com/image/fetch/$s_!Qg-8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png 848w, https://substackcdn.com/image/fetch/$s_!Qg-8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!Qg-8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47abba7f-3f34-4892-bbed-ff1cc7cce854_1773x1600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AdaptEvolve tackles a key efficiency bottleneck in evolutionary agentic systems: the repeated invocation of large LLMs during iterative refinement loops. The method uses intrinsic generation confidence to dynamically select which model to invoke at each step, routing easy sub-problems to smaller models and reserving expensive frontier models for genuinely hard decisions.</p><ul><li><p><strong>Confidence-driven model routing:</strong> Instead of static heuristics or external controllers, AdaptEvolve monitors real-time generation confidence scores to estimate task solvability at each evolutionary step. When the smaller model is confident, it proceeds without escalation; when uncertainty is high, the system routes to a larger, more capable model.</p></li><li><p><strong>Favorable cost-accuracy trade-off:</strong> Across benchmarks, AdaptEvolve cuts inference costs by approximately 38% while retaining roughly 97.5% of the upper-bound accuracy achieved by always using the largest model. This creates a Pareto-optimal frontier that static single-model or naive cascade approaches cannot match.</p></li><li><p><strong>Practical for deployed agent loops:</strong> Evolutionary and iterative refinement workflows often require dozens of LLM calls per task. Reducing per-call cost by nearly 40% without meaningful accuracy loss makes these workflows viable for production deployment, where cost compounds rapidly.</p></li><li><p><strong>Generalizable routing signal:</strong> The confidence-based selection mechanism is model-agnostic and does not require task-specific tuning, making it applicable across different evolutionary agent architectures and domain-specific refinement pipelines.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2602.11931">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2022684265079795962">Tweet</a></strong></p><div><hr></div><h2><strong>8. Gaia2</strong></h2><p>Meta FAIR introduces Gaia2, a next-generation agent benchmark where environments change independently of agent actions, forcing agents to handle temporal pressure, uncertainty, and multi-agent coordination. GPT-5 leads at 42% pass@1 but struggles with time-constrained tasks, while Kimi-K2 leads open-source models at 21%. Built on the open-source Agents Research Environments (ARE) platform with action-level verifiers, Gaia2 represents a paradigm shift from static benchmarks to dynamic evaluation of agentic capabilities.</p><p><strong><a href="https://arxiv.org/abs/2602.11964">Paper</a></strong></p><div><hr></div><h2><strong>9. AgentArk</strong></h2><p>AgentArk distills multi-agent debate dynamics into a single LLM, transferring the reasoning and self-correction abilities of multi-agent systems into one model at training time. Three hierarchical distillation strategies (reasoning-enhanced SFT, trajectory-based augmentation, and process-aware distillation with a process reward model) yield an average 4.8% improvement over single-agent baselines across math and reasoning benchmarks, approaching full multi-agent performance at a fraction of the inference cost. Cross-family distillation (e.g., Qwen3-32B to LLaMA-3-8B) produces the largest gains, suggesting heterogeneous architectures benefit most from transferred reasoning signals.</p><p><strong><a href="https://arxiv.org/abs/2602.03955">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2020889229128270294">Tweet</a></strong></p><div><hr></div><h2><strong>10. AgentSkiller</strong></h2><p>AgentSkiller scales generalist agent intelligence through semantically integrated cross-domain data synthesis, producing 11K high-quality synthetic trajectories across diverse tool-use scenarios. The resulting 14B model beats GPT-o3 on tau2-bench (79.1% vs 68.4%), and even the 4B variant outperforms 70B and 235B models, demonstrating that data quality and semantic integration matter more than parameter count for building strong tool-use agents.</p><p><strong><a href="https://arxiv.org/abs/2602.09372">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2021620772817834014">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖AI Agents Weekly: GPT-5.3-Codex-Spark, GLM-5, MiniMax M2.5, Recursive Language Models, Harness Engineering, Agentica, and More]]></title><description><![CDATA[GPT-5.3-Codex-Spark, GLM-5, MiniMax M2.5, Recursive Language Models, Harness Engineering, Agentica, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-gpt-53-codex-spark</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-gpt-53-codex-spark</guid><pubDate>Sat, 14 Feb 2026 15:02:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fGSn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91de83cc-9168-4ff1-9de7-4c396b7bff8f_3840x2160.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>OpenAI releases GPT-5.3-Codex-Spark</p></li><li><p>Zhipu AI launches GLM-5 with Agent Mode</p></li><li><p>MiniMax drops the M2.5 open-source model</p></li><li><p>Recursive Language Models replace context stuffing</p></li><li><p>OpenAI ships 1M lines with zero manual code</p></li><li><p>Agentica pushes ARC-AGI-2 with recursive agents</p></li><li><p>Chrome launches WebMCP early preview</p></li><li><p>Anthropic raises $30B at $380B valuation</p></li><li><p>Excalidraw launches official MCP server</p></li><li><p>Hive agent framework evolves at runtime</p></li><li><p>Waymo begins 6th-gen autonomous operations</p></li><li><p>Gemini 3 Deep Think solves 18 open problems</p></li><li><p>And all the top AI dev news, papers, and tools.</p></li></ul><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>GPT-5.3-Codex-Spark</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KP8G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KP8G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png 424w, https://substackcdn.com/image/fetch/$s_!KP8G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png 848w, https://substackcdn.com/image/fetch/$s_!KP8G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png 1272w, https://substackcdn.com/image/fetch/$s_!KP8G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KP8G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;GPT-5.3-Codex-Spark&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="GPT-5.3-Codex-Spark" title="GPT-5.3-Codex-Spark" srcset="https://substackcdn.com/image/fetch/$s_!KP8G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png 424w, https://substackcdn.com/image/fetch/$s_!KP8G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png 848w, https://substackcdn.com/image/fetch/$s_!KP8G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png 1272w, https://substackcdn.com/image/fetch/$s_!KP8G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F265144fc-c726-49a8-a266-d05d2bf4fc1d_2200x1100.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>OpenAI released GPT-5.3-Codex-Spark, their most capable agentic coding model, combining frontier coding performance with reasoning and professional knowledge capabilities while running 25% faster than its predecessor. It is also OpenAI&#8217;s first model that was instrumental in creating itself.</p><ul><li><p><strong>Self-developing model:</strong> The Codex team used early versions of GPT-5.3 to debug its own training, manage deployment, and diagnose test results and evaluations, making it the first model instrumental in its own development.</p></li><li><p><strong>Beyond coding:</strong> Handles professional knowledge-work outputs like presentations, spreadsheets, and documentation. On GDPval, a knowledge-work benchmark, it wins or ties in 70.9% of evaluations.</p></li><li><p><strong>Cybersecurity concerns:</strong> OpenAI rates this as their first model hitting &#8220;high&#8221; for cybersecurity capability under their Preparedness Framework, meaning it could meaningfully enable real-world cyber harm if automated. They announced a $10M API credits program for cyber defense research in response.</p></li></ul><p><strong><a href="https://openai.com/index/introducing-gpt-5-3-codex-spark/">Blog</a></strong></p><div><hr></div><h3><strong>GLM-5</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Af9o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Af9o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png 424w, https://substackcdn.com/image/fetch/$s_!Af9o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png 848w, https://substackcdn.com/image/fetch/$s_!Af9o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png 1272w, https://substackcdn.com/image/fetch/$s_!Af9o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Af9o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png" width="1456" height="991" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:991,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;GLM-5&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="GLM-5" title="GLM-5" srcset="https://substackcdn.com/image/fetch/$s_!Af9o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png 424w, https://substackcdn.com/image/fetch/$s_!Af9o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png 848w, https://substackcdn.com/image/fetch/$s_!Af9o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png 1272w, https://substackcdn.com/image/fetch/$s_!Af9o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1177e3a3-b620-4166-ac43-e93e95ce1ca2_4239x2884.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Zhipu AI launched GLM-5, a 744B-parameter MoE model with 40B active parameters, engineered from the ground up for agentic intelligence and multi-step reasoning. Trained entirely on Huawei Ascend chips using the MindSpore framework, it represents full independence from US-manufactured semiconductor hardware.</p><ul><li><p><strong>Agent Mode:</strong> Native capability for autonomous task decomposition, breaking high-level objectives into subtasks with minimal human intervention. Can transform raw prompts into professional documents in .docx, .pdf, and .xlsx formats.</p></li><li><p><strong>Training scale:</strong> Ingested 28.5 trillion tokens during pre-training, a 23.9% increase over GLM-4.7. Uses a novel RL technique that achieves record-low hallucination rates.</p></li><li><p><strong>Results:</strong> Competitive with frontier models across coding, creative writing, and complex problem-solving tasks. </p></li><li><p><strong>Open source and affordable:</strong> Released under MIT license with open weights. Available on OpenRouter at approximately $0.80 per million input tokens and $2.56 per million output tokens, roughly six times cheaper than comparable proprietary models.</p></li></ul><p><strong><a href="https://z.ai/blog/glm-5">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-gpt-53-codex-spark">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Recursive Language Models: Stop Stuffing the Context Window]]></title><description><![CDATA[The next big thing might be recursive language models (RLMs).]]></description><link>https://nlp.elvissaravia.com/p/recursive-language-models-stop-stuffing</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/recursive-language-models-stop-stuffing</guid><dc:creator><![CDATA[elvis]]></dc:creator><pubDate>Thu, 12 Feb 2026 20:07:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2EU6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2EU6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2EU6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png 424w, https://substackcdn.com/image/fetch/$s_!2EU6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png 848w, https://substackcdn.com/image/fetch/$s_!2EU6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png 1272w, https://substackcdn.com/image/fetch/$s_!2EU6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2EU6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png" width="1456" height="1109" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1109,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Teaser Figure&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Teaser Figure" title="Teaser Figure" srcset="https://substackcdn.com/image/fetch/$s_!2EU6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png 424w, https://substackcdn.com/image/fetch/$s_!2EU6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png 848w, https://substackcdn.com/image/fetch/$s_!2EU6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png 1272w, https://substackcdn.com/image/fetch/$s_!2EU6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe31912fc-af0e-433f-8c8e-22fd4c0e6bc8_2721x2073.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Recently, I saw a ton of interest on Recursive Language Models (RLMs) and thought it would be great to write about why this is an important paper.</p><p>This will be a longer one than usual, so feel free to bookmark it if you want to read it later. But I think it will be worth the time to understand what RLM is and the excitement around it.</p><p>At a high level, RLM introduces a deceptively simple idea that rethinks how language models interact with long documents. Instead of cramming text into the context window and hoping the model doesn&#8217;t lose track, RLMs treat text as an external environment that the model <em>programs against</em>.</p><p>I know what you are thinking, this sounds a lot like standard coding agents connected to tools or even a classical RAG system. But bear with me, it gets more interesting than that.</p><p>The bigger question is: how does an 8B parameter model using the proposed approach come close to GPT-5 on long-context tasks?</p><p>What&#8217;s happening here? Let&#8217;s break it down.</p><h2><strong>The Problem: Context Rot</strong></h2><p>We keep building bigger context windows: 128K, 1M, 10M tokens. But a larger window doesn&#8217;t solve the core issue where model performance degrades as context grows. Key details get buried. Reasoning quality drops. The authors refer to this as &#8220;<a href="https://research.trychroma.com/context-rot">context rot</a>,&#8221; and anyone who has tried to get a model to reason over a long document has experienced it firsthand.</p><p>Current mitigations (RAG, summarization, chunking) all share the same assumption: the retrieved or compressed text must eventually become tokens in the prompt. The model has to &#8220;see&#8221; everything it reasons about.</p><p>RLMs challenge that assumption entirely. Here&#8217;s how.</p><h2><strong>The Model is a &#8220;Programmer&#8221;, Not a &#8220;Reader&#8221;</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AleH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AleH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png 424w, https://substackcdn.com/image/fetch/$s_!AleH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png 848w, https://substackcdn.com/image/fetch/$s_!AleH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png 1272w, https://substackcdn.com/image/fetch/$s_!AleH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AleH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png" width="1456" height="980" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:980,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:657124,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://nlp.elvissaravia.com/i/187783349?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AleH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png 424w, https://substackcdn.com/image/fetch/$s_!AleH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png 848w, https://substackcdn.com/image/fetch/$s_!AleH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png 1272w, https://substackcdn.com/image/fetch/$s_!AleH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa064a279-16c3-46fb-a220-19ed3c78982c_2264x1524.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What if the model didn&#8217;t have to read the document at all? Instead of feeding a document into the model, RLMs place the document inside a coding environment (a Python REPL) and let the model write programs to interact with it. The model never ingests the raw text. It writes code (grep for a keyword, slice out a section, iterate over chapters), and only the <em>results</em> of that code enter the context window.</p><p>Think of it as the difference between reading a database and querying a database. Traditional LMs read. RLMs query.</p><p>The recursive part: the model can spawn sub-agents (copies of itself with the same architecture) to process specific slices. Each sub-agent gets a manageable chunk, reasons over it within its own context window, and returns a result to the parent. The parent model&#8217;s context is never polluted with irrelevant information from chunks it didn&#8217;t need to see.</p><p>Concretely, the system has three components:</p><ol><li><p>A context variable holding the document (potentially 10M+ tokens)</p></li><li><p>An rlm_agent(query, context) function that delegates to child agents with identical architecture</p></li><li><p>Standard Python libraries (json, re, numpy) pre-loaded in the REPL</p></li></ol><p>The agent writes code, executes it, observes results, and iterates. It&#8217;s not tool use in the traditional sense. The model lives inside a programming environment, writing and executing code as its primary mode of &#8220;reasoning.&#8221;</p><h2><strong>But Isn&#8217;t This Just...</strong></h2><p>The online discourse has predictably asked: &#8220;Isn&#8217;t this just RAG? Isn&#8217;t this just a coding agent? Isn&#8217;t this grep with extra steps?&#8221;</p><p>The distinction matters, and it&#8217;s architectural:</p><p><strong>RLM vs. RAG:</strong> In RAG, retrieved chunks get injected into the prompt. The model reads them directly. In an RLM, the document stays inside the REPL. The model writes code to extract only what it needs, and only those extracted results enter the context. The document is never read wholesale.</p><p><strong>RLM vs. Standard coding agents:</strong> Both combine LMs with code execution. But in typical agent frameworks, the model calls sub-agents as independent tools. The REPL and the sub-agent are separate. In an RLM, the sub-agent is a <em>function inside the REPL</em>. The parent writes an algorithm, calls rlm_agent() as part of that algorithm, and the results flow back into the program&#8217;s execution.</p><p><strong>RLM vs. Simple grep:</strong> Grep is one operation that an RLM might write. But the power is in composition, where the model can write arbitrary programs that combine search, filtering, aggregation, and recursive delegation.</p><p>As co-author Alex Zhang <a href="https://x.com/a1zhang">puts it</a>: &#8220;It&#8217;s not the sub-agent having access to a grepper that matters; it&#8217;s that the sub-agent is called from and communicates inside of the REPL.&#8221;</p><h2><strong>What the Model Actually Does</strong></h2><p>One of the most interesting details from Zhang&#8217;s <a href="https://alexzhang13.github.io/blog/2025/rlm/">blog</a> and the paper is that the authors didn&#8217;t hand-design decomposition strategies. They gave the model a REPL with a recursive call function and observed what emerged. The model independently discovered several interesting strategies:</p><ul><li><p><strong>Peeking</strong>: examining the first few thousand characters to understand document structure before doing anything else</p></li><li><p><strong>Grepping</strong>: writing a regex to narrow down relevant lines from massive contexts</p></li><li><p><strong>Partition + Map</strong>: chunking the context into pieces and recursively processing each one</p></li><li><p><strong>Programmatic processing</strong>: for structured tasks like tracking git diffs, the model would write a complete program to solve the task in one shot rather than reasoning about it line by line</p></li></ul><p>This matters because the decomposition strategy is not prescribed. The model figures out how to interact with its context at inference time. The authors make a useful distinction here. Most agent frameworks do <em>task decomposition</em> (breaking a complex problem into simpler sub-problems), while RLMs additionally do <em>context decomposition</em> (breaking a large input into manageable pieces). </p><p>Standard agents decide <em>what to do</em>. RLMs also decide <em>what to look at</em>.</p><p>It&#8217;s also worth noting that all published results use only a recursive depth of 1. The root model can call sub-agents, but those sub-agents don&#8217;t call further sub-agents. The architecture supports deeper recursion, but it hasn&#8217;t been tested yet. The current reported results represent the shallow end of what this system can do.</p><h2><strong>The Results</strong></h2><p>The paper evaluates RLMs on several benchmarks, but two results stand out.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6AqY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6AqY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png 424w, https://substackcdn.com/image/fetch/$s_!6AqY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png 848w, https://substackcdn.com/image/fetch/$s_!6AqY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png 1272w, https://substackcdn.com/image/fetch/$s_!6AqY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6AqY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png" width="1456" height="1251" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1251,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:293527,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://nlp.elvissaravia.com/i/187783349?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6AqY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png 424w, https://substackcdn.com/image/fetch/$s_!6AqY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png 848w, https://substackcdn.com/image/fetch/$s_!6AqY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png 1272w, https://substackcdn.com/image/fetch/$s_!6AqY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c85261-9a5f-478c-b01f-80690c69d393_1706x1466.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>OOLONG-Pairs</strong> requires models to identify relationships across scattered statements in long documents. This is a quadratic reasoning task that requires connecting information from many different locations, not just finding a single needle.</p><p>RLM(GPT-5) achieved a 58.0 F1 score on a task where the same model without the recursive scaffold couldn&#8217;t get past 0.1. That&#8217;s not an incremental improvement. It&#8217;s the difference between complete failure and meaningful capability on a class of problems that current architectures cannot handle at all.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZTZQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZTZQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png 424w, https://substackcdn.com/image/fetch/$s_!ZTZQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png 848w, https://substackcdn.com/image/fetch/$s_!ZTZQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png 1272w, https://substackcdn.com/image/fetch/$s_!ZTZQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZTZQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png" width="1456" height="905" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:905,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:460742,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://nlp.elvissaravia.com/i/187783349?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZTZQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png 424w, https://substackcdn.com/image/fetch/$s_!ZTZQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png 848w, https://substackcdn.com/image/fetch/$s_!ZTZQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png 1272w, https://substackcdn.com/image/fetch/$s_!ZTZQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cfe21-b9e9-494b-ab7e-655d3a036504_2416x1502.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>BrowseComp-Plus</strong> requires multi-hop reasoning across large collections of documents, synthesizing information scattered across up to 1,000 sources. At the 1,000-document scale, vanilla frontier models completely failed (0.0%, hitting context limits). RLM(GPT-5) led with <strong>91.3%</strong> accuracy, well ahead of the next best baseline (Summary agent at 70.5%) and CodeAct + BM25 at 51.0%.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uU3C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uU3C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png 424w, https://substackcdn.com/image/fetch/$s_!uU3C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png 848w, https://substackcdn.com/image/fetch/$s_!uU3C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png 1272w, https://substackcdn.com/image/fetch/$s_!uU3C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uU3C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png" width="1456" height="613" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:613,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;API&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="API" title="API" srcset="https://substackcdn.com/image/fetch/$s_!uU3C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png 424w, https://substackcdn.com/image/fetch/$s_!uU3C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png 848w, https://substackcdn.com/image/fetch/$s_!uU3C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png 1272w, https://substackcdn.com/image/fetch/$s_!uU3C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19034138-6b18-48fc-b797-5a2a565be3d4_1590x669.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On broader long-context tasks:</p><ul><li><p>RLMs process inputs up to <strong>two orders of magnitude</strong> beyond the base model&#8217;s context window</p></li><li><p>RLM-Qwen3-8B (a post-trained 8B model) outperforms the base Qwen3-8B by <strong>28.3%</strong> on average</p></li><li><p>That same 8B model approaches GPT-5 quality on three long-context benchmarks</p></li></ul><p>That last point deserves emphasis: you can take a small open model, teach it to manage its own context recursively with minimal post-training, and get competitive with a frontier model on tasks where raw context window size usually dominates.</p><h2><strong>Current Limitations (Worth Being Honest About)</strong></h2><p>RLMs are early-stage, and the authors are transparent about what doesn&#8217;t work yet. </p><p>The constraints fall into two categories:</p><p><strong>The model has to be a good coder.</strong> This is the most fundamental constraint. RLMs offload reasoning into code, which means the underlying model needs strong programming ability. Weaker models struggle to write effective REPL programs, and models with long internal reasoning traces sometimes burn through their output budget on &#8220;thinking&#8221; before producing any executable code at all. This creates a floor on model capability that doesn&#8217;t exist for standard prompting.</p><p><strong>Generalization is still fragile.</strong> The recursive strategy doesn&#8217;t transfer cleanly across model families. Prompts tuned for one model can behave unpredictably on another. The paper reports cases where a model attempted to spawn thousands of simultaneous sub-agents, requiring manual intervention. The inference pipeline also currently runs sub-agents sequentially (blocking), which means deep recursion gets slow. Parallelizing sub-agent calls is an obvious engineering improvement, but it isn&#8217;t implemented yet.</p><p>These are real constraints for anyone thinking about deploying RLMs today.</p><h2><strong>What Makes This Interesting for the Future</strong></h2><p>RLMs point toward something broader than a clever inference trick. </p><p>There are a few threads worth pulling on:</p><p><strong>Context management as a learnable skill.</strong> The dominant approach to long-context has been architectural: bigger windows, better position encodings, and more efficient attention. RLMs reframe the problem entirely, where context management isn&#8217;t a hardware constraint to engineer around, it&#8217;s a <em>capability</em> the model can learn. Instead of asking &#8220;how do we fit more tokens in?&#8221; the question becomes &#8220;can we train models to be selective about what they attend to?&#8221; The post-training results on Qwen3-8B suggest the answer is yes.</p><p><strong>Native recursive training.</strong> The paper shows you can post-train an existing model to be &#8220;natively recursive.&#8221; The name &#8220;recursive LM&#8221; comes from this property, where you train a single LM with a fixed context window, and it learns to recursively call itself. The training signal is to solve this task using your REPL and sub-agent calls and avoid doing what most coding agents do today which is try to absorb everything at once. </p><p>Zhang&#8217;s <a href="https://alexzhang13.github.io/blog/2025/rlm/">blog</a> makes a further argument worth watching. The trajectory in which a model chooses to interact with and decompose its context is entirely learnable and could be optimized with RL. If that pans out, it can result in training models that develop better decomposition strategies over time.</p><p><strong>Task-agnostic neurosymbolic reasoning.</strong> RLMs are not limited to coding tasks. Zhang <a href="https://x.com/a1zhang">explicitly argues</a> we should think beyond coding agents. Code could be seen as the <em>medium</em> for general-purpose reasoning.</p><p>This idea is converging from multiple directions. Arvind Narayanan <a href="https://x.com/random_walker/status/2018342421696766147">independently argued</a> that coding agents succeed precisely because they are a form of neurosymbolic AI. He also observes that complex agentic tasks already involve LLMs writing code that invokes other LLMs, and in principle, you can have arbitrary recursion depth between statistical and symbolic systems. That observation aligns well with the RLM architecture.</p><p>When you think about it, the REPL doesn&#8217;t strictly need to be a Python environment. It could be a SQL console for database reasoning, a search engine with a scripting layer for research tasks, or a spreadsheet runtime for financial analysis. The underlying pattern stays the same. Give the model a structured environment where it can write instructions, observe results, and recurse.</p><p><strong>Convergence prediction.</strong> Zhang <a href="https://x.com/a1zhang">predicts</a> most future agentic scaffolds will converge toward RLM-like properties. Practitioners are already discovering this empirically through structured context management, transcript storage with grep, and smart compaction, all of which are informal approximations of what RLMs formalize.</p><h2><strong>Final Words and Resources</strong></h2><p>For years, the scaling story for long-context LLMs has been to make the context window bigger, hope attention holds up, and just keep throwing more compute at the problem. RLMs suggest something more elegant. The right unit of scaling isn&#8217;t the context window itself, but the model&#8217;s ability to <em>decide what belongs in it</em>.</p><p>What stands out most about RLMs is the efficiency story. An 8B model, given the ability to write programs instead of reading documents, starts competing with models orders of magnitude larger. That kind of gain from a scaffold change alone says something about where the field might be headed.</p><p>The authors and other folks have released a few interesting resources worth looking at: </p><ul><li><p>Official codebase: <a href="https://github.com/alexzhang13/rlm">https://github.com/alexzhang13/rlm</a></p></li><li><p>Minimal RLM engine: <a href="https://github.com/alexzhang13/rlm-minimal">https://github.com/alexzhang13/rlm-minimal</a></p></li><li><p>ADK integration: <a href="https://github.com/LiamConnell/adk-python/tree/66a757f5/contributing/samples/rlm">https://github.com/LiamConnell/adk-python/tree/66a757f5/contributing/samples/rlm</a></p></li><li><p>More on RLMs: <a href="https://discuss.google.dev/t/recursive-language-models-in-adk/323523">https://discuss.google.dev/t/recursive-language-models-in-adk/323523</a></p></li><li><p>Blog post: <a href="https://alexzhang13.github.io/blog/2025/rlm/">https://alexzhang13.github.io/blog/2025/rlm/</a></p></li><li><p>Full paper: <a href="https://arxiv.org/abs/2512.24601">https://arxiv.org/abs/2512.24601</a></p></li></ul>]]></content:encoded></item></channel></rss>