<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[AI Newsletter]]></title><description><![CDATA[The AI Newsletter provides weekly summaries of the latest and top AI trends, papers, tools, news, and best practices. Home of Top AI Papers of the Week and AI Agents Weekly series. ]]></description><link>https://nlp.elvissaravia.com</link><image><url>https://substackcdn.com/image/fetch/$s_!m7md!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41327c80-fe59-416d-aa6f-ab6874177ac7_517x517.png</url><title>AI Newsletter</title><link>https://nlp.elvissaravia.com</link></image><generator>Substack</generator><lastBuildDate>Sat, 20 Jun 2026 01:57:29 GMT</lastBuildDate><atom:link href="https://nlp.elvissaravia.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[elvis]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[nlpnews@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[nlpnews@substack.com]]></itunes:email><itunes:name><![CDATA[elvis]]></itunes:name></itunes:owner><itunes:author><![CDATA[elvis]]></itunes:author><googleplay:owner><![CDATA[nlpnews@substack.com]]></googleplay:owner><googleplay:email><![CDATA[nlpnews@substack.com]]></googleplay:email><googleplay:author><![CDATA[elvis]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Autonomous Long-Running Coding Agents]]></title><description><![CDATA[What is the big deal with loop engineering and autonomous long-running agents.]]></description><link>https://nlp.elvissaravia.com/p/autonomous-long-running-coding-agents</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/autonomous-long-running-coding-agents</guid><dc:creator><![CDATA[elvis]]></dc:creator><pubDate>Mon, 15 Jun 2026 20:44:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vDdf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Autonomous coding is moving from better prompting to better control systems. The important shift is that engineers are learning how to wrap agents in goals, evaluators, loops, and artifacts that let them keep working after the human stops typing.</p><p>This matters because most serious engineering work spans long horizons: ambiguous requirements, hidden constraints, partial failures, changing context, and repeated verification. The new frontier is designing the system around the agent so it can plan, execute, check its work, recover from mistakes, and keep making progress without constant human steering.</p><p><em>This piece is based on a <a href="https://academy.dair.ai/events/cmplo7v3b000e04l1pxprat4d">DAIR.AI Academy session on autonomous long-running coding agents</a>, where I walked through Claude Code&#8217;s <a href="https://code.claude.com/docs/en/goal">/goal</a> mode, the newer <a href="https://docs.anthropic.com/en/release-notes/claude-code">/loop</a> command, verifiers, artifacts, and orchestration patterns in practice. Written in collaboration with Codex and Claude Code. </em></p><h2><strong>From Prompting to Goal Design</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vDdf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vDdf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!vDdf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!vDdf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!vDdf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vDdf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg" width="680" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!vDdf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!vDdf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!vDdf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!vDdf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eac97c5-6f0c-4a5b-86b7-53ab1f06d6ed_680x380.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The core idea behind features like Claude Code&#8217;s /goal is simple. A coding agent remains the executor, but the human no longer interacts with it turn by turn. Instead, the human specifies the desired end state, the evidence required to prove success, the constraints that must not be violated, and, where possible, the number of turns and budget. </p><p>That goal works more like a contract than a longer prompt. A weak goal gives the model room to stop early, take shortcuts, or redefine success in a way that looks plausible in the transcript but fails in the real system. A strong goal gives the agent a target it can repeatedly measure itself against.</p><p>Engineering judgment still matters here. The best goals encode domain knowledge that the model would otherwise guess. For a research experiment, that might mean a target benchmark score, a held-out evaluation, a required loss curve, and a rule that the result must beat an initial baseline. For a UI task, it might mean a screenshot reference, concrete layout constraints, and a browser verification step. The model can execute, but the human still defines what &#8220;done&#8221; actually means.</p><h2><strong>The Evaluator Becomes a First-Class Component</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z4bZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z4bZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!z4bZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!z4bZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!z4bZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z4bZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg" width="680" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!z4bZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!z4bZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!z4bZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!z4bZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e785e5a-6fbf-46b9-9ff4-1b1071b02453_680x380.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Long-running agents need a second role besides the goal. That evaluator can be another coding agent, an LLM-as-judge, a script, a test suite, a benchmark harness, or a mix of all of them. The key design choice is matching the evaluator to the task. When success is crisp, deterministic checks are better. Type checks, unit tests, lint rules, integration tests, and benchmark scripts should be used whenever they can express the condition clearly.</p><p>When success is fuzzy, an agent evaluator becomes useful. A script can tell you whether tests pass, but it cannot easily decide whether a generated research report is coherent, whether an implementation faithfully follows a paper, or whether a UI matches a design intent. This is where the evaluator benefits from language, judgment, and sometimes vision.</p><p>The practical pattern uses deterministic checks as the floor and agent evaluation as the higher-level review. That combination reduces hallucinated success while still allowing autonomy on tasks that do not fit cleanly into a test assertion.</p><h2><strong>Verifiers Define the Boundary of Trust</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4p1-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4p1-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4p1-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4p1-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4p1-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4p1-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg" width="680" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!4p1-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4p1-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4p1-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4p1-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff3dcb0-f33c-4215-b2ad-f6c0094786b2_680x380.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The deeper point is that autonomy only works when the system has a reliable verifier. A coding agent can generate a plan, implement a feature, and explain why it believes the work is complete, but that explanation should not be treated as evidence. Evidence comes from an external check that the agent cannot easily talk its way around.</p><p>For code, the verifier might be a test suite, type checker, benchmark, browser run, screenshot comparison, or reproducible script. For research work, it might be a held-out evaluation, a reproduced table, a loss curve, or a benchmark score that improves over the baseline. For design work, it might be a reference screenshot plus a visual review step. The verifier is what turns a long-running agent from a confident text generator into a system that can be trusted with more time.</p><p>Most shortcuts appear at this boundary. If the verifier is vague, the model will often satisfy the easiest interpretation of the task. If the verifier is too narrow, the model may overfit to it and miss the broader intent. A good autonomous workflow, therefore, needs layered verification, with cheap deterministic checks catching basic failures and higher-level review catching judgment-heavy failures. A few of the frontier models can already achieve some level of verification, but based on my research, there is still an evident OOD problem, where if the verification task you assign to the agent falls outside the training distribution, models struggle significantly.  </p><p>Verifiers are still an open area of research, but I anticipate more companies will start to make huge investments in this area. The concept of fine-tuned verifiers is also in high demand in the enterprise.</p><h2><strong>Loops Make Autonomy Durable</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XIro!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XIro!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!XIro!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!XIro!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!XIro!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XIro!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg" width="680" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!XIro!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!XIro!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!XIro!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!XIro!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859efe5f-9cc2-4c2c-aaee-61009f05adf0_680x380.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A goal gives the agent direction, but a loop keeps the work alive. This distinction is important because models often stop before the real task is finished. They may hit a turn limit, lose confidence, exhaust context, or decide that a partial solution is enough.</p><p>The loop is the outer control system. It wakes up, inspects progress, runs checks, compares the result against the goal, and sends the agent back in with the next instruction when the goal has not been met. In its simplest form, this is the Ralph loop pattern with a coding agent and a deterministic condition. In a more flexible form, the loop includes an evaluator agent that can reason about progress and decide what should happen next.</p><p>Long-running autonomy works as repeated effort under supervision from a control layer, not as one continuous act of intelligence. The agent can still fail, but the loop gives the system a way to notice the failure and continue instead of silently declaring victory.</p><h2><strong>Planning Is Where Expertise Enters</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1r3x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1r3x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!1r3x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!1r3x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!1r3x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1r3x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg" width="680" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!1r3x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!1r3x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!1r3x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!1r3x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dd97421-9e20-4e9b-ac98-860d3a079a3b_680x380.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One of the strongest themes from the session was that planning remains critical. You can ask a frontier model to generate a plan, but you still need to inspect it, challenge assumptions, and make the success criteria sharper before handing the task to an autonomous loop.</p><p>This leads to a useful division of labor. A stronger planning model can help define the goal, identify missing constraints, and structure the evaluation. A different execution model can then run the implementation once the plan is clear. In practice, this means engineers should stop thinking of &#8220;the model&#8221; as a single choice. Model choice becomes an architecture decision.</p><p>Some models are better planners. Some are better executors. Some are cheaper evaluators. Some are better at vision-based review. A good orchestrator lets you swap these roles instead of waiting for one vendor to provide the perfect coding agent interface.</p><h2><strong>Visual Artifacts Become Control Surfaces</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2ZgQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2ZgQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2ZgQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2ZgQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2ZgQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2ZgQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg" width="680" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!2ZgQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2ZgQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2ZgQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2ZgQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb88bdfff-cac5-4101-8459-78ae9b6e9941_680x380.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Terminal transcripts do not scale when many agents are running. Once you have several sessions working in parallel, raw text becomes a poor interface for understanding progress.</p><p>Live artifacts matter because a dashboard with loss curves, benchmark scores, task states, screenshots, cost estimates, and recent decisions gives the human a much better way to supervise autonomy. The artifact becomes the control surface for deciding when to intervene, rather than a report generated after the fact.</p><p>The most useful pattern is to separate storage from presentation. Markdown or a vault can store durable evidence, logs, notes, plans, and results. HTML artifacts can render that state into something visual and interactive. The agent can search the Markdown, while the human can monitor the artifact.</p><p>For UI and product work, visual cues are especially powerful. A screenshot reference can communicate design intent more precisely than prose, and a vision-capable evaluator can compare the implementation against that reference. This reduces the common failure mode where the agent technically implements the requested component but misses spacing, hierarchy, alignment, or product feel.</p><h2><strong>Session Mining Turns Usage Into Memory</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8L0I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8L0I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!8L0I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!8L0I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!8L0I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8L0I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg" width="680" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!8L0I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!8L0I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!8L0I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!8L0I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08cf4ea-f9f4-4cc5-acdd-a56046340ac3_680x380.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Another important insight is that past agent sessions are a rich source of workflow data. If an agent repeatedly fails in the same way, forgets to run the same check, uses the wrong path, or retries the same broken command, that pattern should not stay buried in logs.</p><p>Session mining turns those transcripts into operating rules. An agent can scan the last thirty days of work, find recurring failure modes, and propose updates to project instructions, vault learnings, or agent rules. This is how a team can gradually improve its harness without manually remembering every mistake.</p><p>The goal is to make the local environment smarter without training a model from scratch. A small rule in an agent instruction file can prevent repeated failures across future sessions, especially when the rule is specific to the project.</p><h2><strong>A Practical Operating Model</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m5kS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m5kS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!m5kS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!m5kS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!m5kS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m5kS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg" width="680" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!m5kS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!m5kS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!m5kS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!m5kS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc95a91d2-e987-48fc-9077-3460d5cb2ddd_680x380.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For AI engineers, the emerging workflow looks like this.</p><ul><li><p>Start with a small, cheap subset before launching the full autonomous run.</p></li><li><p>Write a goal with measurable success criteria, explicit constraints, and a turn or time budget (where possible).</p></li><li><p>Separate the executor from the evaluator so implementation and judgment are not collapsed into one role.</p></li><li><p>Define external verifiers before the long-running loop starts.</p></li><li><p>Use deterministic checks wherever possible, then add agent review for fuzzy criteria.</p></li><li><p>Require proof artifacts such as logs, screenshots, benchmark curves, or changed files.</p></li><li><p>Mine past sessions and promote repeated lessons into project instructions.</p></li></ul><p>That is the difference between using a coding agent and engineering an autonomous coding system. One gives you a conversation. The other gives you a harness.</p><h2><strong>What Still Breaks</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GP-f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GP-f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GP-f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GP-f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GP-f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GP-f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg" width="680" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!GP-f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GP-f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GP-f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GP-f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae917782-2bb4-400b-aeda-6f0292817497_680x380.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>None of this removes the hard problems. Agents still take shortcuts. They still stop early. They still overestimate completion. They still produce confident but weak plans, especially on recent papers, unfamiliar benchmarks, or systems outside their training distribution.</p><p>Trusting them more will not solve this. Better control systems will. Goals, loops, evaluators, deterministic checks, visual artifacts, and session memory are all ways of making autonomy observable and correctable.</p><p>The direction is clear. The future of coding agents depends on better orchestration around more capable models, where engineers design the conditions under which agents can safely run for hours or days and still produce work that can be verified.</p>]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (June 7 - June 14)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-352</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-352</guid><pubDate>Sun, 14 Jun 2026 15:00:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!H_t_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. MiniMax Sparse Attention</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H_t_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H_t_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png 424w, https://substackcdn.com/image/fetch/$s_!H_t_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png 848w, https://substackcdn.com/image/fetch/$s_!H_t_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png 1272w, https://substackcdn.com/image/fetch/$s_!H_t_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H_t_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png" width="1456" height="693" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:693,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;MiniMax Sparse Attention&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="MiniMax Sparse Attention" title="MiniMax Sparse Attention" srcset="https://substackcdn.com/image/fetch/$s_!H_t_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png 424w, https://substackcdn.com/image/fetch/$s_!H_t_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png 848w, https://substackcdn.com/image/fetch/$s_!H_t_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png 1272w, https://substackcdn.com/image/fetch/$s_!H_t_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709ac27b-f2a0-4b88-abf3-742b27ddd6ee_2598x1236.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Ultra-long context is now a core requirement for agents, codebase-scale reasoning, multimodal workflows, and persistent memory, but dense softmax attention still makes million-token deployment expensive. MiniMax Sparse Attention (MSA) tackles this by adding blockwise sparsity on top of Grouped Query Attention, with a lightweight routing branch that chooses which key-value blocks each query group should actually attend to.</p><ul><li><p><strong>Two-branch attention design:</strong> The Index Branch scores the full causal context and selects Top-k key-value blocks independently for each GQA group, while the Main Branch performs exact sparse attention only over those selected blocks.</p></li><li><p><strong>Hardware-aware implementation:</strong> The paper co-designs the sparse pattern with GPU kernels, using exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access.</p></li><li><p><strong>Large speedups at scale:</strong> On a 109B-parameter natively multimodal model, MSA matches GQA performance while reducing per-token attention compute by 28.4x at 1M context. The paired kernel reaches 14.2x prefill and 7.6x decoding wall-clock speedups on H800.</p></li><li><p><strong>Why it matters:</strong> Long context is only useful if it can be served cheaply. MSA is compelling because it keeps the mechanism simple, trains it directly into a production-scale model, open-sources the inference kernel, and powers the released MiniMax-M3 model.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2606.13392">Paper</a></strong> | <strong><a href="https://x.com/MiniMax_AI/status/2065436935188058208">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H_lk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H_lk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!H_lk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!H_lk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!H_lk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H_lk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;30 Days of Hermes Agent&quot;,&quot;title&quot;:&quot;30 Days of Hermes Agent&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="30 Days of Hermes Agent" title="30 Days of Hermes Agent" srcset="https://substackcdn.com/image/fetch/$s_!H_lk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!H_lk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!H_lk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!H_lk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d626eb-7721-449b-8a86-51b590d0cd8b_1600x900.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We just released 30 Days of Hermes Agent, a hands-on lab that teaches agent workflows in a real, interactive terminal. Across 30 short labs, you use Hermes Agent to turn a messy Personal Knowledge Vault into a working knowledge operations system with readable notes, searchable context, reusable templates, review workflows, task boards, safety rules, and handoff docs.</p><p><strong><a href="https://academy.dair.ai/labs/30-days-of-hermes-agent">Start 30 Days of Hermes Agent</a></strong></p><div><hr></div><h2><strong>2. Self-Harness</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Illx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Illx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png 424w, https://substackcdn.com/image/fetch/$s_!Illx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png 848w, https://substackcdn.com/image/fetch/$s_!Illx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png 1272w, https://substackcdn.com/image/fetch/$s_!Illx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Illx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png" width="793" height="566" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Self-Harness&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Self-Harness" title="Self-Harness" srcset="https://substackcdn.com/image/fetch/$s_!Illx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png 424w, https://substackcdn.com/image/fetch/$s_!Illx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png 848w, https://substackcdn.com/image/fetch/$s_!Illx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png 1272w, https://substackcdn.com/image/fetch/$s_!Illx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f439c2-4126-4914-b1e9-2c1538acd1c7_793x566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most agent scaffolds are built once by hand and then frozen, even as the underlying models keep changing. This paper introduces Self-Harness, a paradigm where an LLM agent improves its own operating harness, the prompts, tools, memory, and orchestration around the base model, without human engineers or a stronger external agent. Because every model fails in its own way, the system mines those model-specific weaknesses and turns them into concrete, executable harness edits rather than generic advice.</p><ul><li><p><strong>A three-stage self-improvement loop:</strong> Self-Harness runs Weakness Mining, which clusters execution traces into model-specific failure patterns, then Harness Proposal, which generates diverse but minimal edits tied to those failures, then Proposal Validation, which accepts edits only after regression testing on held-in and held-out splits.</p></li><li><p><strong>Consistent gains across base models:</strong> On Terminal-Bench-2.0, held-out pass rates rise for every model tested. MiniMax M2.5 improves from 40.5% to 61.9%, Qwen3.5-35B-A3B from 23.8% to 38.1%, and GLM-5 from 42.9% to 57.1%.</p></li><li><p><strong>Weaknesses become edits:</strong> Rather than appending generic instructions, the loop converts each observed failure mode into a targeted change to memory, tools, or prompts, with reported relative improvements as high as 138%.</p></li><li><p><strong>Why it matters:</strong> As models proliferate and evolve, hand-tuning a bespoke harness for each one does not scale. Self-Harness shows the scaffold itself can be made to adapt, closing the gap between a frozen harness and the model it wraps.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2606.09498">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2064429834999304247">Tweet</a></strong></p><div><hr></div><h2><strong>3. Agents&#8217; Last Exam</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YohJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YohJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png 424w, https://substackcdn.com/image/fetch/$s_!YohJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png 848w, https://substackcdn.com/image/fetch/$s_!YohJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png 1272w, https://substackcdn.com/image/fetch/$s_!YohJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YohJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png" width="1096" height="544" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:544,&quot;width&quot;:1096,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agents' Last Exam&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agents' Last Exam" title="Agents' Last Exam" srcset="https://substackcdn.com/image/fetch/$s_!YohJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png 424w, https://substackcdn.com/image/fetch/$s_!YohJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png 848w, https://substackcdn.com/image/fetch/$s_!YohJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png 1272w, https://substackcdn.com/image/fetch/$s_!YohJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e17d06e-7b49-4fba-a3cd-9163a807508f_1096x544.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From Berkeley RDI, Agents&#8217; Last Exam (ALE) is a living benchmark built to measure whether agents can do economically valuable work, not just score well on academic tests. It was assembled with more than 250 industry experts and maps over 1,000 verifiable tasks to the U.S. federal occupational taxonomy, organized as 55 subfields across 13 industry clusters. Every task has an objective, checkable outcome, so there is no subjective human grading, and the pool is designed to keep growing as new workflows are onboarded.</p><ul><li><p><strong>Grounded in real occupations:</strong> Tasks are defined against O*NET and SOC 2018 and span non-physical industries, deliberately targeting the professional workflows where agents would actually be deployed rather than puzzle-style problems.</p></li><li><p><strong>Three difficulty tiers:</strong> Work is split into Near-Term, Full-Spectrum, and Last-Exam tiers, letting the benchmark track both near-term usefulness and the long tail of hard, multi-step jobs.</p></li><li><p><strong>Far from saturated:</strong> The hardest tier sits at just a 2.6% average full pass rate across mainstream harnesses, and even strong setups like Codex with GPT-5.5 score below 50% on the easiest tier and under 10% on the hardest.</p></li><li><p><strong>Why it matters:</strong> Strong scores on existing benchmarks have not translated into economically meaningful deployment. ALE reframes evaluation around verifiable, expert-curated work, giving a moving target that should resist saturation as agents improve.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2606.05405">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2062916866235068607">Tweet</a></strong></p><div><hr></div><h2><strong>4. How AI Agents Reshape Knowledge Work</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eACh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eACh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png 424w, https://substackcdn.com/image/fetch/$s_!eACh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png 848w, https://substackcdn.com/image/fetch/$s_!eACh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png 1272w, https://substackcdn.com/image/fetch/$s_!eACh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eACh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png" width="594" height="589" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:589,&quot;width&quot;:594,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;How AI Agents Reshape Knowledge Work&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How AI Agents Reshape Knowledge Work" title="How AI Agents Reshape Knowledge Work" srcset="https://substackcdn.com/image/fetch/$s_!eACh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png 424w, https://substackcdn.com/image/fetch/$s_!eACh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png 848w, https://substackcdn.com/image/fetch/$s_!eACh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png 1272w, https://substackcdn.com/image/fetch/$s_!eACh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc068da6a-b379-4ea1-903b-d6776d15e27b_594x589.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This economics paper, drawing on large-scale production data from Perplexity, studies how the shift from conversational assistants to autonomous agents is reshaping knowledge work. It compares Search, a conversational assistant, with Computer, a general-purpose agent system, along three dimensions: autonomy, efficiency, and the scope of tasks people take on. The framing is a cost-structure model in which agents carry higher fixed and delegation costs but lower per-step marginal costs, so they win once tasks are complex enough.</p><ul><li><p><strong>Autonomy looks different in practice:</strong> Computer performs around 26 minutes of autonomous machine work per session versus roughly 33 seconds for Search, and per-query dissatisfaction is 55% lower on the agent, 1.3% against 2.9%.</p></li><li><p><strong>Large efficiency gains:</strong> On matched tasks, Computer cuts completion time from 269 to 36 minutes, an 87% reduction in time and about a 94% reduction in cost relative to humans working with Search alone.</p></li><li><p><strong>Scope shifts upward:</strong> Agent queries are more cognitively complex, 71% abstract or non-routine versus 53%, with twice as much create-level work, and they bundle interdependent subtasks that cross occupational boundaries.</p></li><li><p><strong>Why it matters:</strong> The data supports a clean prediction. As the fixed costs of delegation fall, agents move the affordable value frontier toward higher-value, multi-step knowledge work, which is exactly where adoption grew fastest, reaching 84 times its first-week volume over the study.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2606.07489">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2064076252584222933">Tweet</a></strong></p><div><hr></div><h2><strong>5. Agentopia</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EDj-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60d084b-77ff-47df-a47f-0a36d2621211_793x444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EDj-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60d084b-77ff-47df-a47f-0a36d2621211_793x444.png 424w, https://substackcdn.com/image/fetch/$s_!EDj-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60d084b-77ff-47df-a47f-0a36d2621211_793x444.png 848w, https://substackcdn.com/image/fetch/$s_!EDj-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60d084b-77ff-47df-a47f-0a36d2621211_793x444.png 1272w, https://substackcdn.com/image/fetch/$s_!EDj-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60d084b-77ff-47df-a47f-0a36d2621211_793x444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EDj-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60d084b-77ff-47df-a47f-0a36d2621211_793x444.png" width="793" height="444" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b60d084b-77ff-47df-a47f-0a36d2621211_793x444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:444,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agentopia&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agentopia" title="Agentopia" srcset="https://substackcdn.com/image/fetch/$s_!EDj-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60d084b-77ff-47df-a47f-0a36d2621211_793x444.png 424w, https://substackcdn.com/image/fetch/$s_!EDj-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60d084b-77ff-47df-a47f-0a36d2621211_793x444.png 848w, https://substackcdn.com/image/fetch/$s_!EDj-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60d084b-77ff-47df-a47f-0a36d2621211_793x444.png 1272w, https://substackcdn.com/image/fetch/$s_!EDj-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60d084b-77ff-47df-a47f-0a36d2621211_793x444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Agentopia is one of the most ambitious agent-society testbeds yet, a 79-page release that drops 100 LLM agents into a persistent world and lets them live, form relationships, and pursue goals over 10 simulated years, a horizon orders of magnitude longer than prior day-level work. Beyond observing emergent social behavior, the authors use the simulation as a training signal, optimizing models toward a life reward that reflects human well-being via rejection sampling.</p><ul><li><p><strong>Long-horizon by design:</strong> Where earlier agent societies ran at the granularity of days, Agentopia simulates a decade of life per world, surfacing unscripted social strategies and interpersonal dynamics that only appear over long timescales.</p></li><li><p><strong>Simulation as a training signal:</strong> The life-reward metric is used to fine-tune more anthropomorphic models, and the improvements transfer beyond the simulation to downstream role-playing benchmarks rather than staying trapped in the sandbox.</p></li><li><p><strong>Measured gains:</strong> Trained agents improve overall CoSER Test performance by 15.6%, with the biggest jumps in Anthropomorphism at 23.7% and Character Fidelity at 16.4%, and they are respected by 24.2% more peers and liked by 15.9% more.</p></li><li><p><strong>Why it matters:</strong> A single 10-year, 100-agent run consumes 13.7 billion tokens across 567,000 LLM calls. That scale is a statement about where agent research is heading: living, learning populations as both an object of study and a source of training data.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2606.07513">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2064075015960875347">Tweet</a></strong></p><div><hr></div><h2><strong>6. The Geometry of On-Policy Distillation</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eUm_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eUm_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png 424w, https://substackcdn.com/image/fetch/$s_!eUm_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png 848w, https://substackcdn.com/image/fetch/$s_!eUm_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png 1272w, https://substackcdn.com/image/fetch/$s_!eUm_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eUm_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png" width="996" height="498" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:498,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Geometry of On-Policy Distillation&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Geometry of On-Policy Distillation" title="The Geometry of On-Policy Distillation" srcset="https://substackcdn.com/image/fetch/$s_!eUm_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png 424w, https://substackcdn.com/image/fetch/$s_!eUm_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png 848w, https://substackcdn.com/image/fetch/$s_!eUm_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png 1272w, https://substackcdn.com/image/fetch/$s_!eUm_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f09959c-a3e5-44d1-818d-40c93bc16792_996x498.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On-policy distillation (OPD) has become one of the most discussed post-training recipes of the year, but it has mostly been treated as a black box sitting somewhere between supervised fine-tuning and RL. This paper opens it up, characterizing how OPD changes a model&#8217;s weights at the level of parameter geometry, and argues OPD is not a midpoint between SFT and RLVR but its own distinct kind of update.</p><ul><li><p><strong>It touches fewer weights:</strong> Compared with SFT, OPD updates affect far fewer parameters and largely avoid the dominant principal directions of weight space, which helps explain its sample efficiency.</p></li><li><p><strong>Early subspace locking:</strong> OPD&#8217;s cumulative updates rapidly collapse into a narrow, low-dimensional subspace early in training, rather than spreading across many directions as SFT does.</p></li><li><p><strong>That subspace is functionally sufficient:</strong> Constraining training to the early-formed subspace preserves OPD performance but substantially degrades SFT, showing the small subspace genuinely carries the useful signal rather than being an artifact.</p></li><li><p><strong>Why it matters:</strong> Knowing where in weight space OPD does its work turns a popular but poorly understood recipe into something with a mechanistic account. That makes the method easier to reason about, combine with other objectives, and improve deliberately instead of by trial and error.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2606.07082">Paper</a></strong></p><div><hr></div><h2><strong>7. Lookahead Sparse Attention</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hPp2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F042e7acd-f529-4171-9d14-d54216224b07_996x441.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hPp2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F042e7acd-f529-4171-9d14-d54216224b07_996x441.png 424w, https://substackcdn.com/image/fetch/$s_!hPp2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F042e7acd-f529-4171-9d14-d54216224b07_996x441.png 848w, https://substackcdn.com/image/fetch/$s_!hPp2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F042e7acd-f529-4171-9d14-d54216224b07_996x441.png 1272w, https://substackcdn.com/image/fetch/$s_!hPp2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F042e7acd-f529-4171-9d14-d54216224b07_996x441.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hPp2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F042e7acd-f529-4171-9d14-d54216224b07_996x441.png" width="996" height="441" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/042e7acd-f529-4171-9d14-d54216224b07_996x441.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:441,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Lookahead Sparse Attention&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Lookahead Sparse Attention" title="Lookahead Sparse Attention" srcset="https://substackcdn.com/image/fetch/$s_!hPp2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F042e7acd-f529-4171-9d14-d54216224b07_996x441.png 424w, https://substackcdn.com/image/fetch/$s_!hPp2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F042e7acd-f529-4171-9d14-d54216224b07_996x441.png 848w, https://substackcdn.com/image/fetch/$s_!hPp2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F042e7acd-f529-4171-9d14-d54216224b07_996x441.png 1272w, https://substackcdn.com/image/fetch/$s_!hPp2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F042e7acd-f529-4171-9d14-d54216224b07_996x441.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Long-context decoding is bottlenecked by the KV cache, which grows with every token and quickly dominates memory at extreme context lengths. This work, branded around DeepSeek-V4, introduces Lookahead Sparse Attention (LSA), which avoids storing the full KV cache by predicting which parts of the context future decoding will actually need and retaining only those query-critical chunks.</p><ul><li><p><strong>A learned, lightweight indexer:</strong> Instead of keeping everything, a small indexer proactively selects the KV chunks that matter for upcoming generation, so the physical cache stays small without discarding information the model will need.</p></li><li><p><strong>Backbone-free training:</strong> A decoupled training strategy lets the indexer be trained on its own without loading the full backbone model, cutting the cost of adding the mechanism to a large model.</p></li><li><p><strong>Big cache savings, no quality loss:</strong> LSA shrinks the physical KV cache to 13.5% of the full-context baseline while slightly improving accuracy by 0.6% on average, and at 500K-token contexts it suppresses more than 90% of KV-cache overhead without destabilizing reasoning.</p></li><li><p><strong>Why it matters:</strong> Ultra-long context is increasingly the difference between a toy demo and a usable system, and memory is the wall. Predicting what context you will need, rather than keeping all of it, is a practical route to long context that fits in real hardware budgets.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2606.09079">Paper</a></strong></p><div><hr></div><h2><strong>8. Latent Spatial Memory</strong></h2><p>Video world models struggle to stay consistent over long horizons because explicit 3D memory usually requires an expensive pixel-space loop. Mirage instead stores scene information directly in diffusion latent space, using depth-guided back-projection and latent-space warping to maintain persistent spatial memory. The approach reports up to 10.57 times faster end-to-end generation and 55 times lower memory use than explicit 3D-memory baselines while improving long-horizon spatial consistency.</p><p><strong><a href="https://arxiv.org/abs/2606.09828">Paper</a></strong></p><div><hr></div><h2><strong>9. The Consistency Illusion</strong></h2><p>Multi-agent debate is often judged by whether the agents end up agreeing, but this paper shows that output-level consensus can hide deep disagreement in the reasoning that produced it. The authors abstract agents&#8217; reasoning traces and decisions into four states along two axes, reasoning similarity and conclusion agreement, and flag divergent agreement, where agents reach the same answer through very different paths. Across 600 content-moderation items, divergent agreement appeared in 118 cases and separated cleanly from genuine disagreement states with a Cohen&#8217;s d of 0.80, and routing on these categories beat divergence-only methods at flagging high-disagreement cases.</p><p><strong><a href="https://arxiv.org/abs/2606.04223">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2064395355220029696">Tweet</a></strong></p><div><hr></div><h2><strong>10. Beyond Scalar Rewards</strong></h2><p>Reward models usually compress a judgment into a single scalar, but this paper argues human preferences are better captured as score distributions, and proposes Z-Reward, which internalizes reasoning into a predicted distribution before scoring. A large vision-language teacher does the reasoning-heavy judgment and is distilled into a compact student for efficient deployment, with the 27B teacher reaching 89.6% human-preference accuracy and the 9B student nearly matching it at 88.6%. Used as a reinforcement learning signal, it delivers a 41.3% net preference improvement over a supervised baseline, beating GRPO and other reward methods.</p><p><strong><a href="https://arxiv.org/abs/2606.09076">Paper</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Claude Fable 5, Kimi K2.7-Code, NotebookLM Goes Agentic, DiffusionGemma, MiMo Code, and More]]></title><description><![CDATA[Claude Fable 5, Kimi K2.7-Code, NotebookLM Goes Agentic, DiffusionGemma, MiMo Code, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-fable-5-kimi</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-fable-5-kimi</guid><pubDate>Sat, 13 Jun 2026 15:56:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TenN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3089ccf-1372-49bb-975e-165790615fe7_1586x948.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Anthropic ships Mythos-class Claude Fable 5</p></li><li><p>Kimi K2.7-Code open-sources a 1T coder</p></li><li><p>NotebookLM becomes an agentic workstation</p></li><li><p>Google&#8217;s DiffusionGemma generates text in blocks</p></li><li><p>Xiaomi open-sources MiMo Code agent</p></li><li><p>Cohere ships North Mini Code</p></li><li><p>Gemini 3.5 Live Translate goes real-time</p></li><li><p>Gemini-SQL2 tops BIRD text-to-SQL</p></li><li><p>Apple rebuilds Siri on Google Gemini</p></li><li><p>Grok opens a plugin marketplace</p></li><li><p>Claude Code adds nested subagents</p></li><li><p>Nex-N2 opens an agentic model series</p></li><li><p>Extend UI ships document-agent components</p></li><li><p>Cognition&#8217;s FrontierCode raises the eval bar</p></li><li><p>Study questions the multi-agent advantage</p></li><li><p>Recursive automates AI research</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Anthropic Launches Claude Fable 5, Its First Public Mythos-Class Model</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XnY2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XnY2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png 424w, https://substackcdn.com/image/fetch/$s_!XnY2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png 848w, https://substackcdn.com/image/fetch/$s_!XnY2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png 1272w, https://substackcdn.com/image/fetch/$s_!XnY2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XnY2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png" width="1456" height="1607" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1607,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Fable 5 benchmarks&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Fable 5 benchmarks" title="Claude Fable 5 benchmarks" srcset="https://substackcdn.com/image/fetch/$s_!XnY2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png 424w, https://substackcdn.com/image/fetch/$s_!XnY2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png 848w, https://substackcdn.com/image/fetch/$s_!XnY2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png 1272w, https://substackcdn.com/image/fetch/$s_!XnY2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21bc5181-dcdf-46aa-ab66-973efaf10577_2600x2870.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic released Claude Fable 5, a Mythos-class model made safe for general use, alongside the restricted Claude Mythos 5. It is the most capable model Anthropic has ever made widely available, and its lead grows the longer and more complex the task.</p><ul><li><p><strong>State-of-the-art across the board:</strong> Fable 5 tops nearly every tested benchmark, with the widest margins on long, multi-step reasoning and autonomous task completion.</p></li><li><p><strong>Software engineering:</strong> Posts SOTA on Cognition&#8217;s FrontierCode, and Stripe reported it compressed a 50-million-line codebase migration from two months of human work into a single day.</p></li><li><p><strong>Agentic and vision:</strong> Holds focus across millions of tokens, runs roughly 3x better on strategic gameplay with persistent memory, and finished Pok&#233;mon FireRed from raw screenshots with no helper tools.</p></li><li><p><strong>Safeguards by fallback:</strong> Requests touching cybersecurity, biology, chemistry, or distillation are routed to Claude Opus 4.8 instead of refused, triggering in under 5% of sessions. Mythos 5 stays restricted to Project Glasswing partners.</p></li><li><p><strong>Pricing:</strong> $10 per million input tokens and $50 per million output tokens, with rollout across plans continuing through June 22.</p></li></ul><p><strong><a href="https://www.anthropic.com/news/claude-fable-5-mythos-5">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-fable-5-kimi">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (May 31 - June 7)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-a3d</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-a3d</guid><pubDate>Sun, 07 Jun 2026 15:00:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!G-g2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong> 1. Self-Revising Discovery Systems</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VnE9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VnE9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png 424w, https://substackcdn.com/image/fetch/$s_!VnE9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png 848w, https://substackcdn.com/image/fetch/$s_!VnE9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png 1272w, https://substackcdn.com/image/fetch/$s_!VnE9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VnE9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png" width="1456" height="569" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:569,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!VnE9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png 424w, https://substackcdn.com/image/fetch/$s_!VnE9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png 848w, https://substackcdn.com/image/fetch/$s_!VnE9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png 1272w, https://substackcdn.com/image/fetch/$s_!VnE9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3abc82d-ec54-4cf2-9414-ec5c91467a6e_1756x686.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From MIT, this paper argues that genuine scientific discovery is not answer generation but a change in the search space itself, and that an AI scientist must perceive that shift without being told. It develops a category-theoretic framework in which evidence, artifacts, operations, and verifiers are typed, and discovery is defined as a principled revision of that representational regime rather than more search within a fixed one.</p><ul><li><p><strong>Discovery means changing the regime:</strong> The system is built to detect when the representational regime should change and to revise it autonomously. That reframes an AI scientist from a faster searcher into something that can move the boundaries of the space it searches.</p></li><li><p><strong>A typed, categorical foundation:</strong> Evidence, artifacts, operations, and verifiers are formally typed. Old results are carried into the new regime by functorial transport, and what counts as genuine discovery is the residual content that transport alone cannot explain.</p></li><li><p><strong>Description-length gates keep it honest:</strong> Proposed revisions are accepted only when they reduce total description length, which separates real structural gains from mere added complexity. In one run, 388 proposals yield just 25 accepted revisions, a deliberately strict 6.4% rate.</p></li><li><p><strong>Why it matters:</strong> Two concrete instantiations, protein-mechanics modeling and a knowledge-computation graph with typed skills and validation checkpoints, show category theory serving as both a formal language and an engineering spec. It is a more principled blueprint for autonomous discovery than search-only AI scientists.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2606.01444">Paper</a></strong> | <strong><a href="https://x.com/ProfBuehlerMIT/status/2062865983459475830">Tweet</a></strong></p><div><hr></div><h2><strong>2. Disentangling Agent Self-Evolution</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G-g2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G-g2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png 424w, https://substackcdn.com/image/fetch/$s_!G-g2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png 848w, https://substackcdn.com/image/fetch/$s_!G-g2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png 1272w, https://substackcdn.com/image/fetch/$s_!G-g2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G-g2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png" width="846" height="482" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:482,&quot;width&quot;:846,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Disentangling Agent Self-Evolution&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Disentangling Agent Self-Evolution" title="Disentangling Agent Self-Evolution" srcset="https://substackcdn.com/image/fetch/$s_!G-g2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png 424w, https://substackcdn.com/image/fetch/$s_!G-g2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png 848w, https://substackcdn.com/image/fetch/$s_!G-g2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png 1272w, https://substackcdn.com/image/fetch/$s_!G-g2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff312b9b5-9fd7-423e-b951-14cca5d5a514_846x482.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This paper asks a question every agent builder eventually hits: if an agent rewrites its own harness, does a stronger model make a better self-evolving agent? The answer is no, and the reason is that &#8220;self-evolution&#8221; is actually two separate abilities that scale very differently. The work separates harness-updating, where an evolver model writes edits to memory, tools, prompts, and skills, from harness-benefit, where a solver model actually exploits those edits on the task.</p><ul><li><p><strong>Updating is flat across model tiers:</strong> The quality of harness edits barely depends on model strength. Updates written by Qwen3.5-9B yield gains comparable to those from Claude Opus 4.6, so paying for a frontier model on the evolver side buys almost nothing.</p></li><li><p><strong>Benefit is non-monotonic:</strong> The ability to use a better harness follows a curve. Weak models gain little, mid-tier models benefit most, and the strongest models benefit less than mid-tier ones, often because they already solve the task without the scaffold.</p></li><li><p><strong>Failure modes are concrete:</strong> Weaker solvers either fail to activate the relevant harness component or follow its instructions inconsistently, which is why their gains stay small even when the edits themselves are good.</p></li><li><p><strong>Why it matters:</strong> The practical lever is to put a cheap model on the evolver and spend your capability budget on the solver. System design, not raw model scale, is doing most of the work in agent self-improvement.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.30621">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2061460266186125703">Tweet</a></strong></p><div><hr></div><h2><strong>3. LEAP</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S9SC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d9041a-6039-4583-b6aa-50c68f878026_976x366.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S9SC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d9041a-6039-4583-b6aa-50c68f878026_976x366.png 424w, https://substackcdn.com/image/fetch/$s_!S9SC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d9041a-6039-4583-b6aa-50c68f878026_976x366.png 848w, https://substackcdn.com/image/fetch/$s_!S9SC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d9041a-6039-4583-b6aa-50c68f878026_976x366.png 1272w, https://substackcdn.com/image/fetch/$s_!S9SC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d9041a-6039-4583-b6aa-50c68f878026_976x366.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S9SC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d9041a-6039-4583-b6aa-50c68f878026_976x366.png" width="976" height="366" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42d9041a-6039-4583-b6aa-50c68f878026_976x366.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:366,&quot;width&quot;:976,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;LEAP&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="LEAP" title="LEAP" srcset="https://substackcdn.com/image/fetch/$s_!S9SC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d9041a-6039-4583-b6aa-50c68f878026_976x366.png 424w, https://substackcdn.com/image/fetch/$s_!S9SC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d9041a-6039-4583-b6aa-50c68f878026_976x366.png 848w, https://substackcdn.com/image/fetch/$s_!S9SC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d9041a-6039-4583-b6aa-50c68f878026_976x366.png 1272w, https://substackcdn.com/image/fetch/$s_!S9SC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d9041a-6039-4583-b6aa-50c68f878026_976x366.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>New research from Google shows how far a custom agent harness can push a general-purpose model on formal mathematics. LEAP wraps a general LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback. Rather than fine-tuning a specialized prover, it leans on informal reasoning, instruction following, and self-refinement, then forces every formal step through a compiler check before moving on.</p><ul><li><p><strong>Decompose, then verify:</strong> The scaffold takes the natural form of proof decomposition and verifier-guided refinement. The model breaks a hard theorem into subgoals, drafts an informal blueprint, and the Lean compiler checks each formal step, turning vague reasoning into machine-checkable proof.</p></li><li><p><strong>Putnam solved in full:</strong> On the 2025 Putnam Competition, LEAP solves all 12 problems, matching recent breakthroughs from dedicated frontier math models without any math-specific training of the base LLM.</p></li><li><p><strong>Large jump on IMO-level proofs:</strong> On Lean-IMO-Bench, LEAP lifts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, surpassing the 48% set by a specialized, gold-medal-caliber IMO system.</p></li><li><p><strong>Why it matters:</strong> This is strong evidence that a well-built harness, not a bespoke model, can close the gap on one of the hardest reasoning domains. The leverage sits in the scaffold and the verifier loop around a general model.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2606.03303">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2062187813626675567">Tweet</a></strong></p><div><hr></div><h2><strong>4. Scaling Laws for Agent Harnesses</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b-ZB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b-ZB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png 424w, https://substackcdn.com/image/fetch/$s_!b-ZB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png 848w, https://substackcdn.com/image/fetch/$s_!b-ZB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png 1272w, https://substackcdn.com/image/fetch/$s_!b-ZB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b-ZB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png" width="897" height="284" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:284,&quot;width&quot;:897,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Scaling Laws for Agent Harnesses&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Scaling Laws for Agent Harnesses" title="Scaling Laws for Agent Harnesses" srcset="https://substackcdn.com/image/fetch/$s_!b-ZB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png 424w, https://substackcdn.com/image/fetch/$s_!b-ZB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png 848w, https://substackcdn.com/image/fetch/$s_!b-ZB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png 1272w, https://substackcdn.com/image/fetch/$s_!b-ZB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b83796-aaab-4d39-ab28-b35f9e237b15_897x284.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most harness tuning treats every token and tool call as if volume is what counts. This paper shows that it mostly does not, and introduces Effective Feedback Compute (EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for later decisions, then normalizes by task demand.</p><ul><li><p><strong>Raw budget barely predicts success:</strong> In controlled scaling, raw tokens and tool calls explain limited variation in outcomes, with R-squared of 0.33 and 0.42. The usual cost proxies are weak predictors of whether the agent actually succeeds.</p></li><li><p><strong>Effective feedback nearly explains everything:</strong> Oracle-EFC normalized by task demand reaches an R-squared of 0.99. Once you measure feedback that is genuinely useful and retained, the scaling behavior becomes almost fully predictable.</p></li><li><p><strong>Quality beats quantity at fixed budget:</strong> In matched-budget interventions, improving feedback quality raises success from 0.27 to 0.90 while raw cost and tool calls stay fixed. The win comes from better feedback, not more of it.</p></li><li><p><strong>Why it matters:</strong> Harness scaling is governed less by how much compute you spend than by how efficiently raw budget converts into durable, task-sufficient feedback. That reframes harness engineering as a feedback-quality problem and gives a coordinate to optimize against.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.29682">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2060371848010019001">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A_hF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A_hF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png 424w, https://substackcdn.com/image/fetch/$s_!A_hF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png 848w, https://substackcdn.com/image/fetch/$s_!A_hF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png 1272w, https://substackcdn.com/image/fetch/$s_!A_hF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A_hF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png" width="831" height="505" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:505,&quot;width&quot;:831,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;DAIR Academy Hands-on Labs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="DAIR Academy Hands-on Labs" title="DAIR Academy Hands-on Labs" srcset="https://substackcdn.com/image/fetch/$s_!A_hF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png 424w, https://substackcdn.com/image/fetch/$s_!A_hF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png 848w, https://substackcdn.com/image/fetch/$s_!A_hF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png 1272w, https://substackcdn.com/image/fetch/$s_!A_hF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98f6f4f-2f93-412d-944e-c62ba44f0c9e_831x505.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We just released hands-on labs on DAIR Academy to help you build alongside agents. Start with practical, guided labs for agentic image generation and building your first agent skill, with more labs coming soon.</p><p><strong><a href="https://academy.dair.ai/labs">Explore the Labs</a></strong></p><div><hr></div><h2><strong>5. AutoLab</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FE2t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FE2t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png 424w, https://substackcdn.com/image/fetch/$s_!FE2t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png 848w, https://substackcdn.com/image/fetch/$s_!FE2t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png 1272w, https://substackcdn.com/image/fetch/$s_!FE2t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FE2t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png" width="996" height="443" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AutoLab&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AutoLab" title="AutoLab" srcset="https://substackcdn.com/image/fetch/$s_!FE2t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png 424w, https://substackcdn.com/image/fetch/$s_!FE2t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png 848w, https://substackcdn.com/image/fetch/$s_!FE2t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png 1272w, https://substackcdn.com/image/fetch/$s_!FE2t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0d2d65-036b-4eaa-bd1b-00726f3f92e8_996x443.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Can frontier models actually grind on a hard engineering problem the way a good researcher does? AutoLab is a benchmark for ultra long-horizon, closed-loop optimization built to answer that. It contains 36 realistic, expert-curated tasks across four domains: system optimization, puzzle and challenge, model development, and CUDA kernel optimization. Each task hands the agent a correct but deliberately suboptimal baseline and asks it to improve within a strict wall-clock budget.</p><ul><li><p><strong>Persistence beats a strong start:</strong> The dominant predictor of final performance is not the quality of the initial solution but the agent&#8217;s persistence in iterative refinement. Models that keep probing and improving win, regardless of where they began.</p></li><li><p><strong>Most models quit early:</strong> While Claude Opus 4.6 shows strong long-horizon optimization, most frontier models, including several proprietary ones, either terminate prematurely or burn their budget with minimal progress.</p></li><li><p><strong>Time awareness is the gap:</strong> The results point to time-awareness and sustained iteration, not raw single-shot capability, as the missing ingredient for truly capable long-horizon agents.</p></li><li><p><strong>Why it matters:</strong> Day-one benchmarks reward clever first attempts, but real research and engineering reward stamina. AutoLab measures the thing that actually separates agents on multi-hour tasks, and the benchmark, harness, and task artifacts are open-sourced.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2606.05080">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2062570078705688777">Tweet</a></strong></p><div><hr></div><h2><strong>6. Reusable Context Engineering</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D6U3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D6U3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png 424w, https://substackcdn.com/image/fetch/$s_!D6U3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png 848w, https://substackcdn.com/image/fetch/$s_!D6U3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png 1272w, https://substackcdn.com/image/fetch/$s_!D6U3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D6U3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png" width="1438" height="654" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:654,&quot;width&quot;:1438,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!D6U3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png 424w, https://substackcdn.com/image/fetch/$s_!D6U3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png 848w, https://substackcdn.com/image/fetch/$s_!D6U3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png 1272w, https://substackcdn.com/image/fetch/$s_!D6U3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec2016d-e51f-40c9-9e77-a2b2c8bf3513_1438x654.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Context bloat quietly kills long-horizon runs, and the usual fixes are baked into an agent&#8217;s own prompt or weights, so they do not transfer. AdaCoM takes a different route: it trains a separate external model to manage the context of a frozen agent through flexible modification actions, optimized end-to-end with reinforcement learning. The agent never changes; only the context flowing into it does.</p><ul><li><p><strong>An external context manager:</strong> A dedicated model edits the agent&#8217;s working context, deciding what to keep, compress, or drop. Because it sits outside the agent, it can be reused as a drop-in component rather than re-engineered per backbone.</p></li><li><p><strong>Trained with reinforcement learning:</strong> The manager is optimized end-to-end against task outcomes, learning context-editing policies instead of relying on hand-written heuristics or fixed truncation rules.</p></li><li><p><strong>Transfers across similar agents:</strong> Transfer experiments show AdaCoM generalizes most effectively across agents of similar capability, pointing toward genuinely reusable context managers. It improves web search and deep research by preserving task constraints and progress while pruning stale content.</p></li><li><p><strong>Why it matters:</strong> Treating context management as a separate, trainable, transferable module decouples it from the agent itself. That is a cleaner abstraction than stuffing context logic into every prompt, and it fixes bloat from the outside without touching the underlying model.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.30785">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2061455253325971789">Tweet</a></strong></p><div><hr></div><h2><strong>7. Learn From Your Own Latents</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VvMm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VvMm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png 424w, https://substackcdn.com/image/fetch/$s_!VvMm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png 848w, https://substackcdn.com/image/fetch/$s_!VvMm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png 1272w, https://substackcdn.com/image/fetch/$s_!VvMm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VvMm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png" width="996" height="363" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:363,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Learn From Your Own Latents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Learn From Your Own Latents" title="Learn From Your Own Latents" srcset="https://substackcdn.com/image/fetch/$s_!VvMm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png 424w, https://substackcdn.com/image/fetch/$s_!VvMm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png 848w, https://substackcdn.com/image/fetch/$s_!VvMm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png 1272w, https://substackcdn.com/image/fetch/$s_!VvMm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d1ed3b-1845-45e9-8f0d-c2883a875709_996x363.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>LLMs learn by predicting tokens, while world models like JEPA and data2vec learn by predicting their own internal representations. This paper provides a sample-complexity theory for why the second approach can be dramatically more data-efficient, using a tractable probabilistic context-free grammar as the analytical setting where compositional structure can be measured exactly.</p><ul><li><p><strong>Exponential gap in data efficiency:</strong> Predicting your own latents requires a number of samples that is constant in the tree depth L, whereas supervised and token-based self-supervised learning need samples that grow exponentially in L. The advantage is structural, not incidental.</p></li><li><p><strong>Why latents win:</strong> Latent targets expose the compositional, hierarchical structure of the data directly, so the learner does not have to reconstruct it from surface tokens. That is the mechanism behind the data-efficiency gain.</p></li><li><p><strong>Hierarchy may be implicit:</strong> The analysis suggests that explicit hierarchical stacking, as in H-JEPA, can be largely redundant, because methods like data2vec already learn hierarchical structure implicitly.</p></li><li><p><strong>Why it matters:</strong> As token-prediction scaling laws press against data limits, this gives a principled argument for self-supervised objectives that predict abstractions instead of tokens. It is a theoretical foundation for why world-model-style training could beat brute-force next-token prediction on sample efficiency.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.27734">Paper</a></strong> | <strong><a href="https://x.com/MatthieuWyart/status/2061317203857739846">Tweet</a></strong></p><div><hr></div><h2><strong>8. A Primer on Post-Training Reasoning Data</strong></h2><p>This primer is the first to pull the scattered post-training reasoning-data literature into one place, synthesizing over 150 public studies and system reports that previously lived across dataset papers, RL write-ups, and lab reports. It organizes the field around four questions: what data objects exist, what makes them useful, how they are constructed, and how they scale. The key reframing is that a reasoning-data item is more than a prompt-response pair: it packages a problem or state, model behavior, judging feedback, and attribution metadata, with usefulness defined relative to the verifier and the rest of the corpus rather than in isolation.</p><p><strong><a href="https://arxiv.org/abs/2606.02113">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2062189321697083768">Tweet</a></strong></p><div><hr></div><h2><strong>9. State-Externalizing Harnesses</strong></h2><p>Harness-1 is a 20B search agent trained with reinforcement learning inside a stateful harness that offloads routine bookkeeping to the environment. The argument is that search agents are usually trained as policies over a growing transcript, forcing RL to optimize both genuine search decisions and recoverable state like which evidence is useful or which claims are checked. Harness-1 moves that state out of the policy and into an environment-side working memory of candidate pools, an importance-tagged curated set, compact evidence links, and verification records. The 20B agent reaches an average curated recall of 0.730 across eight retrieval benchmarks, beating open-source baselines by 11.4 points and matching or outperforming much larger frontier searchers, with stronger generalization on unseen domains.</p><p><strong><a href="https://arxiv.org/abs/2606.02373">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2061825437693841651">Tweet</a></strong></p><div><hr></div><h2><strong>10. Do More Agents Help?</strong></h2><p>This paper studies whether adding agents actually makes a single LLM-driven multi-agent system better, using a Sequential Iterative Multi-Agent System (SIMAS) framework. The finding is that performance does not scale monotonically with agent count but follows a pattern of diminishing returns, with degradation eventually driven by coordination overhead. Effective systems still require a capable base model, the optimal number of agents depends on the task type, and collective intelligence turns out to be a product of strategic interaction design rather than a guaranteed outcome of agent plurality. The takeaway for builders is to design the interaction, not just stack more agents.</p><p><strong><a href="https://arxiv.org/abs/2606.00655">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2061826427461464405">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Microsoft's Seven MAI Models, Gemma 4 12B, NVIDIA Nemotron 3 Ultra, Agents' Last Exam, Devin Desktop, and More]]></title><description><![CDATA[Microsoft's Seven MAI Models, Gemma 4 12B, NVIDIA Nemotron 3 Ultra, Agents' Last Exam, Devin Desktop, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-microsofts-seven</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-microsofts-seven</guid><pubDate>Sat, 06 Jun 2026 15:01:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KQrW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Microsoft ships seven new MAI models</p></li><li><p>MAI-Thinking-1 takes on Claude Sonnet</p></li><li><p>Gemma 4 12B runs agents on a laptop</p></li><li><p>NVIDIA opens 550B Nemotron 3 Ultra</p></li><li><p>Anthropic warns of recursive self-improvement</p></li><li><p>Agents&#8217; Last Exam stumps frontier agents</p></li><li><p>Claude Platform gets an ant CLI</p></li><li><p>Cognition launches Devin Desktop</p></li><li><p>Nous ships Hermes Desktop</p></li><li><p>Codex builds iOS apps end-to-end</p></li><li><p>ChatGPT memory learns to dream</p></li><li><p>Multi-agent computer use beats solo CUAs</p></li><li><p>Economy of Minds prices agent actions</p></li><li><p>LEAP solves all 12 Putnam problems</p></li><li><p>A harness rewrites itself for +19 SWE points</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Microsoft Launches Seven In-House MAI Models</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KQrW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KQrW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png 424w, https://substackcdn.com/image/fetch/$s_!KQrW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png 848w, https://substackcdn.com/image/fetch/$s_!KQrW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!KQrW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KQrW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png" width="1456" height="646" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:646,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!KQrW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png 424w, https://substackcdn.com/image/fetch/$s_!KQrW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png 848w, https://substackcdn.com/image/fetch/$s_!KQrW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!KQrW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3cfb9b2-9c71-4f2b-b849-a6e443b69472_2888x1282.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Microsoft AI unveiled a family of seven models trained from scratch, led by MAI-Thinking-1, its first reasoning model, in a bid for long-term self-sufficiency from OpenAI.</p><ul><li><p><strong>MAI-Thinking-1:</strong> A 35B reasoning model that scores 97% on AIME and 53% on SWE-Bench Pro, with early testers preferring it side-by-side over Claude Sonnet 4.6 on overall quality.</p></li><li><p><strong>A full stack:</strong> The launch also ships MAI-Image-2.5 and Flash, MAI-Transcribe-1.5, MAI-Voice-2 and Flash, and MAI-Code-1-Flash for code generation.</p></li><li><p><strong>Clean training:</strong> Every model was trained on commercially licensed data with no distillation from third-party labs, which Microsoft frames as a hedge against legal risk for enterprise customers.</p></li><li><p><strong>Why it matters:</strong> Suleyman positions the release as a &#8220;hill-climbing machine,&#8221; a shared training infrastructure meant to keep Microsoft on the frontier as compute scales, and a direct shot at its biggest enterprise rival.</p></li></ul><p>MAI-Thinking-1 ships with a detailed 109-page technical report.</p><p><strong><a href="https://microsoft.ai/news/building-a-hillclimbing-machine-launching-seven-new-mai-models/">Blog</a></strong> | <strong><a href="https://microsoft.ai/wp-content/uploads/2026/06/main_20260602_2.pdf">Tech Report</a></strong></p><div><hr></div><h3><strong>Gemma 4 12B Brings Agentic Reasoning to Your Laptop</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D3Y2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D3Y2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp 424w, https://substackcdn.com/image/fetch/$s_!D3Y2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp 848w, https://substackcdn.com/image/fetch/$s_!D3Y2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp 1272w, https://substackcdn.com/image/fetch/$s_!D3Y2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D3Y2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp" width="1200" height="676" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Gemma 4 12B&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Gemma 4 12B" title="Gemma 4 12B" srcset="https://substackcdn.com/image/fetch/$s_!D3Y2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp 424w, https://substackcdn.com/image/fetch/$s_!D3Y2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp 848w, https://substackcdn.com/image/fetch/$s_!D3Y2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp 1272w, https://substackcdn.com/image/fetch/$s_!D3Y2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc13e0e-c018-4cd5-9263-155827ae3386_1200x676.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google released Gemma 4 12B, a unified, encoder-free multimodal open model that brings agentic reasoning, vision, and native audio to consumer hardware under an Apache 2.0 license.</p><ul><li><p><strong>Encoder-free design:</strong> Vision inputs pass through a single lightweight matrix multiplication and audio is projected directly into the same space as text tokens, dropping separate modality encoders.</p></li><li><p><strong>Runs locally:</strong> Fits in 16GB of VRAM or unified memory, small enough for a laptop, with support across LM Studio, Ollama, and Google AI Edge Gallery.</p></li><li><p><strong>Punches up:</strong> Reaches performance nearing Google&#8217;s larger 26B MoE model at less than half the memory footprint, and is the first mid-sized Gemma with native audio input.</p></li><li><p><strong>Community traction:</strong> The release topped Hacker News, with builders showing it running on a 10-year-old Xeon CPU.</p></li></ul><p><strong><a href="https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12B/">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-microsofts-seven">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (May 24 - May 31)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-5ce</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-5ce</guid><pubDate>Sun, 31 May 2026 15:01:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vt17!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. SkillOpt</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vt17!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vt17!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png 424w, https://substackcdn.com/image/fetch/$s_!vt17!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png 848w, https://substackcdn.com/image/fetch/$s_!vt17!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png 1272w, https://substackcdn.com/image/fetch/$s_!vt17!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vt17!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png" width="793" height="435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:435,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;SkillOpt&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="SkillOpt" title="SkillOpt" srcset="https://substackcdn.com/image/fetch/$s_!vt17!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png 424w, https://substackcdn.com/image/fetch/$s_!vt17!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png 848w, https://substackcdn.com/image/fetch/$s_!vt17!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png 1272w, https://substackcdn.com/image/fetch/$s_!vt17!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14e3d838-0e26-4ff1-87be-91836cf1f8ae_793x435.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Microsoft Research treats a compact natural-language skill document as the trainable state of a frozen agent, then learns that document through rollouts, reflection, and bounded edits gated by held-out validation. The argument is direct: most engineers handwrite agent skill docs and hope they generalize, when the doc itself should be optimized like a parameter. SkillOpt reframes the SKILL.md file as an external parameter of a model whose weights never change.</p><ul><li><p><strong>The skill doc as a trainable parameter:</strong> An optimizer model proposes validation-gated edits to the skill file, adding, deleting, or replacing instructions. A textual learning rate controls how aggressively each round rewrites the document, with batch and momentum reported in text space rather than gradient space.</p></li><li><p><strong>Validation gates instead of hope:</strong> Every edit must pass a held-out check before it is kept. This turns skill authoring into a measurable optimization loop with a real objective, rather than prompt tweaking guided by intuition.</p></li><li><p><strong>52 out of 52 wins:</strong> SkillOpt beats Trace2Skill, TextGrad, GEPA, EvoSkill, human-written skills, and one-shot skills across 6 benchmarks and 7 target models. It adds roughly +23.5 points on GPT-5.5 in direct chat, +24.8 in the Codex loop, and +19.1 in Claude Code from the no-skill baseline.</p></li><li><p><strong>Why it matters:</strong> If the skill document is the thing you optimize, the bottleneck shifts from base-model capability to how well you can train the natural-language state around a frozen agent. That is a cheap, model-agnostic lever most teams are leaving on the table.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.23904">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2058936160291004483">Tweet</a></strong></p><div><hr></div><h2><strong>2. Compiling Agentic Workflows into Weights</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zo0V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zo0V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png 424w, https://substackcdn.com/image/fetch/$s_!Zo0V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png 848w, https://substackcdn.com/image/fetch/$s_!Zo0V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png 1272w, https://substackcdn.com/image/fetch/$s_!Zo0V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zo0V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png" width="1456" height="550" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:550,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Compiling Agentic Workflows into Weights&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Compiling Agentic Workflows into Weights" title="Compiling Agentic Workflows into Weights" srcset="https://substackcdn.com/image/fetch/$s_!Zo0V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png 424w, https://substackcdn.com/image/fetch/$s_!Zo0V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png 848w, https://substackcdn.com/image/fetch/$s_!Zo0V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png 1272w, https://substackcdn.com/image/fetch/$s_!Zo0V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eb475e-b158-4d22-bf08-61c3d43a3410_2090x790.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This paper shows that a full agentic workflow can be distilled into the weights of a small model and run at roughly two orders of magnitude lower inference cost while preserving near-frontier task quality. Instead of keeping an external orchestrator above the LLM, the procedure is compiled into the weights of a fine-tuned model, producing what the authors call a subterranean agent.</p><ul><li><p><strong>The whole workflow, not just the answer:</strong> The compiled procedure includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision points. The student internalizes the orchestration logic rather than only imitating final outputs.</p></li><li><p><strong>Orchestrator dissolved into the model:</strong> Classic agent frameworks run a planner loop above the model on every request. Compiling that loop into weights removes the per-call orchestration overhead, which is where most of the cost and latency live.</p></li><li><p><strong>Near-frontier quality at 100x less cost:</strong> Across the evaluated tasks, the distilled small model stays close to the original workflow&#8217;s quality while cutting inference cost by about two orders of magnitude. The savings come from collapsing many model calls into one forward pass.</p></li><li><p><strong>Why it matters:</strong> Most production agents pay repeatedly for an orchestration loop they run thousands of times a day. If that loop can be compiled once into a cheap model, the economics of deploying agentic systems change substantially, especially for high-volume narrow workflows.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.22502">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2057846601843146760">Tweet</a></strong></p><div><hr></div><h2><strong>3. AutoScientists</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0dx1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0dx1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png 424w, https://substackcdn.com/image/fetch/$s_!0dx1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png 848w, https://substackcdn.com/image/fetch/$s_!0dx1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png 1272w, https://substackcdn.com/image/fetch/$s_!0dx1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0dx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png" width="996" height="471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:471,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AutoScientists&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AutoScientists" title="AutoScientists" srcset="https://substackcdn.com/image/fetch/$s_!0dx1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png 424w, https://substackcdn.com/image/fetch/$s_!0dx1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png 848w, https://substackcdn.com/image/fetch/$s_!0dx1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png 1272w, https://substackcdn.com/image/fetch/$s_!0dx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a195ca-b018-4c3f-b549-1b164d1ea798_996x471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AutoScientists, from Harvard, is a decentralized team of AI agents for long-running computational science that drops the central planner entirely. Rather than following one research trajectory coordinated from the top, agents self-organize around promising hypotheses, critique each other&#8217;s proposals before spending experimental compute, and record both successes and failures so the system avoids redundant exploration as evidence accumulates over hours or days.</p><ul><li><p><strong>No central planner:</strong> Agents interpret shared experimental state, form teams around promising directions, and reorganize when progress stalls. Coordination emerges from a common state rather than a top-level controller, which sustains parallel search instead of a single thread.</p></li><li><p><strong>Evaluate before you spend:</strong> Proposals are critiqued and scored before any experimental compute is allocated. This gating reduces wasted trials and keeps the system from repeating dead ends that an individual agent would otherwise revisit.</p></li><li><p><strong>Strong results on real science tasks:</strong> On BioML-Bench, 24 biomedical ML tasks spanning imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists reaches 74.4% mean leaderboard percentile, an improvement of +8.33% over the strongest prior AI agent.</p></li><li><p><strong>Why it matters:</strong> Most multi-agent research systems still funnel decisions through a planner that becomes a bottleneck. Decentralized self-organization with explicit failure-sharing is a different blueprint for long-horizon scientific search, and it holds up on hard biomedical benchmarks.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.28655">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2060028833080987668">Tweet</a></strong></p><div><hr></div><h2><strong>4. Language Models Need Sleep</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N_SF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N_SF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png 424w, https://substackcdn.com/image/fetch/$s_!N_SF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png 848w, https://substackcdn.com/image/fetch/$s_!N_SF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png 1272w, https://substackcdn.com/image/fetch/$s_!N_SF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N_SF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png" width="634" height="345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:634,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Language Models Need Sleep&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Language Models Need Sleep" title="Language Models Need Sleep" srcset="https://substackcdn.com/image/fetch/$s_!N_SF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png 424w, https://substackcdn.com/image/fetch/$s_!N_SF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png 848w, https://substackcdn.com/image/fetch/$s_!N_SF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png 1272w, https://substackcdn.com/image/fetch/$s_!N_SF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07cf84e4-8e8d-44e1-a4ff-8b791a16e7f7_634x345.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Attention scales badly with context length, so long-horizon agents keep paying a growing cost as their context grows. This paper studies a sleep-like consolidation mechanism: the model periodically converts recent context into persistent fast weights, then clears its key-value cache. During the sleep phase it performs offline recurrent passes over the accumulated context and updates fast weights in its state-space blocks through a learned local rule.</p><ul><li><p><strong>Consolidate, then clear the cache:</strong> Recent context is folded into fast weights stored in the model&#8217;s SSM blocks before the KV cache is discarded. The agent keeps what it learned without carrying the full attention bill into every future step.</p></li><li><p><strong>Compute moves to sleep, latency stays at wake:</strong> The extra work happens offline during consolidation, so wake-time prediction keeps its low latency. The tradeoff is explicit and controllable rather than hidden in a ballooning context window.</p></li><li><p><strong>More sleep helps the hardest cases:</strong> Increasing sleep duration improves performance, with the largest gains precisely on tasks that require the most complex reasoning over long histories. The mechanism buys the most where naive attention struggles most.</p></li><li><p><strong>Why it matters:</strong> Long-horizon agents are the first systems to feel the quadratic cost of context. A biologically inspired consolidation step gives a principled alternative to ever-longer context windows, and it maps cleanly onto the state-space architectures already used for efficiency.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.26099">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2059333792775745619">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8e1O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8e1O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png 424w, https://substackcdn.com/image/fetch/$s_!8e1O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png 848w, https://substackcdn.com/image/fetch/$s_!8e1O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png 1272w, https://substackcdn.com/image/fetch/$s_!8e1O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8e1O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png" width="831" height="505" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:505,&quot;width&quot;:831,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;DAIR Academy Hands-on Labs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="DAIR Academy Hands-on Labs" title="DAIR Academy Hands-on Labs" srcset="https://substackcdn.com/image/fetch/$s_!8e1O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png 424w, https://substackcdn.com/image/fetch/$s_!8e1O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png 848w, https://substackcdn.com/image/fetch/$s_!8e1O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png 1272w, https://substackcdn.com/image/fetch/$s_!8e1O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5c1c4a-0de0-4c37-b9c2-e5c41e26288b_831x505.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We just released hands-on labs on DAIR Academy to help you build alongside agents. Start with practical, guided labs for agentic image generation and building your first agent skill, with more labs coming soon.</p><p><strong><a href="https://academy.dair.ai/labs">Explore the Labs</a></strong></p><div><hr></div><h2><strong>5. Adapting the Interface, Not the Model</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VkDy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VkDy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png 424w, https://substackcdn.com/image/fetch/$s_!VkDy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png 848w, https://substackcdn.com/image/fetch/$s_!VkDy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png 1272w, https://substackcdn.com/image/fetch/$s_!VkDy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VkDy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png" width="997" height="575" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:575,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Adapting the Interface, Not the Model&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Adapting the Interface, Not the Model" title="Adapting the Interface, Not the Model" srcset="https://substackcdn.com/image/fetch/$s_!VkDy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png 424w, https://substackcdn.com/image/fetch/$s_!VkDy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png 848w, https://substackcdn.com/image/fetch/$s_!VkDy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png 1272w, https://substackcdn.com/image/fetch/$s_!VkDy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd1d4516-9e2a-4e09-9bdf-c1b9971e23c7_997x575.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When a frozen LLM agent repeatedly fails in a deterministic, rule-governed environment, do you have to retrain the model? Life-Harness argues no. Many failures come from mismatches at the model-environment interface, not from the model&#8217;s reasoning, so the fix belongs in the runtime harness. Life-Harness is a lifecycle-aware harness that improves frozen agents without touching model weights or the evaluation environment.</p><ul><li><p><strong>Failures become reusable interventions:</strong> Recurring errors are turned into runtime fixes across four areas: action realization, environment contracts, trajectory regulation, and procedural skills. Each fix is a harness-level patch the agent reuses on later attempts.</p></li><li><p><strong>Model frozen, environment intact:</strong> Nothing about the model or the benchmark changes. Only the interface between them adapts, which keeps the approach drop-in for any backbone and avoids the cost and risk of fine-tuning.</p></li><li><p><strong>Broad, consistent gains:</strong> Across 7 deterministic agent benchmarks and 18 model backbones, Life-Harness improves 116 of 126 model-environment settings, with an 88.5% average relative improvement. The effect holds across model scales rather than helping only weak models.</p></li><li><p><strong>Why it matters:</strong> This is more evidence for the code-as-harness thesis: a large share of agent failures are interface problems that harness engineering can fix without retraining. For builders, the leverage is in the runtime, not the model.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.22166">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2058208914148389083">Tweet</a></strong></p><div><hr></div><h2><strong>6. The Efficiency Frontier</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6-qg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6-qg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png 424w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png 848w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6-qg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png" width="1456" height="844" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:844,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Efficiency Frontier&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Efficiency Frontier" title="The Efficiency Frontier" srcset="https://substackcdn.com/image/fetch/$s_!6-qg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png 424w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png 848w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8f3a20-034a-45f5-9029-61ea6a13c7fd_1752x1016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Context costs dominate production LLM bills, and the right strategy depends on how often preprocessing gets reused. This paper models context-strategy selection as a deployment-aware optimization problem that jointly accounts for task performance, token cost, and reuse, then uses it to compare retrieval-based and preprocessing-based approaches under realistic constraints.</p><ul><li><p><strong>A reuse-aware cost model:</strong> A parameterized log-utility metric captures diminishing returns from more context while charging an amortized preprocessing cost. Varying a reuse parameter lets the framework compare strategies under different deployment patterns on equal footing.</p></li><li><p><strong>Distinct operating regimes:</strong> The analysis reveals clean transition boundaries between retrieval and preprocessing strategies. Which one wins flips depending on how many times you reuse the preprocessed context, so a single default is rarely optimal.</p></li><li><p><strong>Real token savings:</strong> On 5,000 HotpotQA instances, deployment-aware optimization cuts effective token usage by roughly 25% at comparable performance, and amortized memory compression achieves over 50% lower token cost relative to full-context.</p></li><li><p><strong>Why it matters:</strong> Most teams pick a context strategy once and pay for it on every request. Treating context management as an explicit cost-performance optimization turns a guess into a measurable decision, with double-digit savings available on common workloads.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.23071">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2058948732658626789">Tweet</a></strong></p><div><hr></div><h2><strong>7. Forecasting Scientific Progress with AI</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!40xH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabea6948-d290-4b3b-ae73-137f62e290df_2290x930.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!40xH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabea6948-d290-4b3b-ae73-137f62e290df_2290x930.png 424w, https://substackcdn.com/image/fetch/$s_!40xH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabea6948-d290-4b3b-ae73-137f62e290df_2290x930.png 848w, https://substackcdn.com/image/fetch/$s_!40xH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabea6948-d290-4b3b-ae73-137f62e290df_2290x930.png 1272w, https://substackcdn.com/image/fetch/$s_!40xH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabea6948-d290-4b3b-ae73-137f62e290df_2290x930.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!40xH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabea6948-d290-4b3b-ae73-137f62e290df_2290x930.png" width="1456" height="591" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abea6948-d290-4b3b-ae73-137f62e290df_2290x930.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:591,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Forecasting Scientific Progress with AI&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Forecasting Scientific Progress with AI" title="Forecasting Scientific Progress with AI" srcset="https://substackcdn.com/image/fetch/$s_!40xH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabea6948-d290-4b3b-ae73-137f62e290df_2290x930.png 424w, https://substackcdn.com/image/fetch/$s_!40xH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabea6948-d290-4b3b-ae73-137f62e290df_2290x930.png 848w, https://substackcdn.com/image/fetch/$s_!40xH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabea6948-d290-4b3b-ae73-137f62e290df_2290x930.png 1272w, https://substackcdn.com/image/fetch/$s_!40xH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabea6948-d290-4b3b-ae73-137f62e290df_2290x930.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Can frontier models predict where science is going? This work introduces CUSP, a cutoff-conditioned benchmark built from 4,760 real scientific events across multiple disciplines, each grounded against a verified knowledge cutoff. For every event, models are tested on four tasks: feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. The headline is sobering: models recognize plausible directions but cannot forecast outcomes.</p><ul><li><p><strong>Recognition is not foresight:</strong> Models can identify plausible research directions when choosing among competing candidates, but they fail to reliably predict whether an advance will actually be realized, and they systematically misestimate when it will happen.</p></li><li><p><strong>Domain-dependent, and timing is hardest:</strong> Performance is highly heterogeneous across fields, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Temporal prediction is the weakest skill across the board.</p></li><li><p><strong>Not just a training-cutoff artifact:</strong> Performance is largely insensitive to whether an event falls before or after the model&#8217;s training cutoff. Extra pre-cutoff knowledge helps but does not close the gap to full-information settings, and that gap widens for high-citation advances.</p></li><li><p><strong>Why it matters:</strong> Models also show systematic overconfidence and strong response biases, which means unreliable uncertainty estimates. As labs lean on AI to triage research bets, CUSP gives a controlled way to measure where it helps, surfacing directions, and where it fails, predicting outcomes.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.22681">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2058215140789797204">Tweet</a></strong></p><div><hr></div><h2><strong>8. Your Agents Are Aging Too</strong></h2><p>AgingBench is a longitudinal reliability benchmark for agent lifespan engineering, built on the observation that long-lived agents are still evaluated like freshly initialized models. It organizes agent degradation into four mechanisms: compression aging, where write-time summarization drops future-relevant details; interference aging, where accumulated similar memories crowd out the target fact; revision aging, where changed or derived state is not updated correctly; and maintenance aging from routine lifecycle events. Using a temporal dependency DAG to encode cross-session structure, it produces aging curves over an operational lifetime rather than a single day-one score, and points to where repair should target.</p><p><strong><a href="https://arxiv.org/abs/2605.26302">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2059689897523642510">Tweet</a></strong></p><div><hr></div><h2><strong>9. Harnesses Are Not Uniformly Better</strong></h2><p>This paper studies LLM agent harnesses through the lens of inference-time trajectory alignment, separating a harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. The key finding is that more elaborate harnesses are not uniformly better. Increasing decomposition or guidance can improve execution but can also reduce final task success, producing concrete failure modes like over-decomposition, over-pruning, and hallucinated execution. Strikingly, partial harnesses that specify only the initial steps and leave the rest to the agent can reach a higher pass rate than fully structured workflows.</p><p><strong><a href="https://arxiv.org/abs/2605.21516">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2059691141302542445">Tweet</a></strong></p><div><hr></div><h2><strong>10. Epicure</strong></h2><p>Epicure trains a family of multilingual ingredient embeddings from scratch on 4.14 million recipes aggregated from 11 sources across seven languages, with raw ingredient strings normalized to 1,790 canonical entries via an LLM-augmented pipeline. It ships three skip-gram (Metapath2Vec) variants that share architecture but differ in what they walk: recipe co-occurrence only, chemical-compound structure from FlavorDB only, or a blend of both, placing each model at a different point on the chemistry-versus-recipe-context spectrum. The result is a compact, downloadable map of the emergent geometry of food, a clean reminder that representation learning generalizes well beyond text into surprisingly everyday domains.</p><p><strong><a href="https://arxiv.org/abs/2605.22391">Paper</a></strong> | <strong><a href="http://localhost:7001/">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Claude Opus 4.8, Claude Code Dynamic Workflows, Chrome DevTools for Agents 1.0, DeepSWE, Agent Harness Scaling Laws, and More]]></title><description><![CDATA[Claude Opus 4.8, Claude Code Dynamic Workflows, Chrome DevTools for Agents 1.0, DeepSWE, Agent Harness Scaling Laws, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-48-claude</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-48-claude</guid><pubDate>Sat, 30 May 2026 15:02:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!eSHZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>AutoScientists self-organize agent teams</p></li><li><p>Anthropic ships Claude Opus 4.8</p></li><li><p>Claude Code adds dynamic workflows</p></li><li><p>Chrome DevTools for agents hits 1.0</p></li><li><p>DeepSWE raises the coding-agent bar</p></li><li><p>xAI opens grok-build-0.1 in beta</p></li><li><p>Microsoft open-sources Webwright for agents</p></li><li><p>Scaling laws for agent harnesses land</p></li><li><p>Harness sensitivity proves non-monotone</p></li><li><p>SIA co-updates harness and weights</p></li><li><p>CUA-Gym scales computer-use RL data</p></li><li><p>Polar trains agents on real harnesses</p></li><li><p>Anthropic details how it contains Claude</p></li><li><p>Xiaomi slashes MiMo-V2.5 API prices</p></li><li><p>Language models learn to sleep</p></li><li><p>a16z maps the AI application layer</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>AutoScientists Self-Organize for Long-Running Science</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eSHZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eSHZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png 424w, https://substackcdn.com/image/fetch/$s_!eSHZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png 848w, https://substackcdn.com/image/fetch/$s_!eSHZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png 1272w, https://substackcdn.com/image/fetch/$s_!eSHZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eSHZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png" width="996" height="471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:471,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AutoScientists&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AutoScientists" title="AutoScientists" srcset="https://substackcdn.com/image/fetch/$s_!eSHZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png 424w, https://substackcdn.com/image/fetch/$s_!eSHZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png 848w, https://substackcdn.com/image/fetch/$s_!eSHZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png 1272w, https://substackcdn.com/image/fetch/$s_!eSHZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210b9450-3ef4-474a-8a4e-bdb8b5079038_996x471.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Harvard&#8217;s Zitnik Lab introduced AutoScientists, a decentralized multi-agent system for long-running computational science where agents self-organize around promising research directions instead of following a fixed plan.</p><ul><li><p><strong>Self-organizing teams:</strong> Agents form around promising directions and vet proposals before allocating resources, so compute goes only to ideas that survive review.</p></li><li><p><strong>Learning from failure:</strong> The system documents failures as well as successes, building a record that steers future exploration.</p></li><li><p><strong>Validated broadly:</strong> Reaches a 74.4% mean leaderboard percentile on biomedical ML, 1.9x faster convergence on language model training, and gains on protein fitness.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.28655">Paper</a></strong></p><div><hr></div><h3><strong>Claude Opus 4.8 Sharpens Agentic Judgment</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PjZ9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PjZ9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PjZ9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PjZ9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PjZ9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PjZ9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Opus 4.8&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Opus 4.8" title="Claude Opus 4.8" srcset="https://substackcdn.com/image/fetch/$s_!PjZ9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PjZ9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PjZ9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PjZ9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5534b045-66c0-4843-89cc-877e169cca01_2880x1620.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic released Claude Opus 4.8, an incremental upgrade over Opus 4.7 tuned for sharper judgment, more honesty about its own progress, and longer independent runs.</p><ul><li><p><strong>Agentic gains:</strong> Posts 84% on Online-Mind2Web for computer-use and browser-agent tasks, and the team reports it is roughly 4x less likely than its predecessor to overlook code flaws.</p></li><li><p><strong>Self-correction and honesty:</strong> Early testers cite improved reliability, better self-correction, and more accurate reporting of how far it has actually gotten on a task.</p></li><li><p><strong>New controls:</strong> Ships alongside dynamic workflows, an effort control to dial response intensity, and a Systems API update that lets you change mid-task instructions without breaking the prompt cache.</p></li><li><p><strong>Why it matters:</strong> The honesty and judgment gains target the exact failure modes that break long-horizon agents, where a model that overstates progress derails an entire run.</p></li></ul><p>Available today via the <code>claude-opus-4-8</code> API identifier at the same price as before ($5/$25 per million tokens), with a 3x cheaper Fast mode.</p><p><strong><a href="https://www.anthropic.com/news/claude-opus-4-8">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-48-claude">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (May 18 - May 24)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-c9b</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-c9b</guid><pubDate>Sun, 24 May 2026 15:01:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RPO8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Code as Agent Harness</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RPO8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RPO8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png 424w, https://substackcdn.com/image/fetch/$s_!RPO8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png 848w, https://substackcdn.com/image/fetch/$s_!RPO8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png 1272w, https://substackcdn.com/image/fetch/$s_!RPO8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RPO8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png" width="996" height="651" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:651,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Code as Agent Harness&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Code as Agent Harness" title="Code as Agent Harness" srcset="https://substackcdn.com/image/fetch/$s_!RPO8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png 424w, https://substackcdn.com/image/fetch/$s_!RPO8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png 848w, https://substackcdn.com/image/fetch/$s_!RPO8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png 1272w, https://substackcdn.com/image/fetch/$s_!RPO8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99bd4a08-8003-460c-acf1-2caa80afab0c_996x651.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A 100+ page survey treating the agent harness as a first-class research object rather than glue around an LLM. The authors argue that code-as-harness is the most promising path to general-purpose agency, and that future agent systems should satisfy four properties: executable, inspectable, stateful, and governed. The report consolidates methods, applications, and open problems across the harness layer.</p><ul><li><p><strong>Harness engineering as a discipline:</strong> The paper frames harness design as a science distinct from model training, with its own primitives, failure modes, and evaluation criteria. The taxonomy gives a vocabulary for comparing systems that has been missing in prior agent literature.</p></li><li><p><strong>Four-property test for production agents:</strong> Executable, inspectable, stateful, and governed. Each property maps to a class of operational concerns. The authors use it to audit current open-source agent frameworks and identify where defaults fall short.</p></li><li><p><strong>Code as the unifying substrate:</strong> Across browsing, tool use, and multi-step reasoning, harnesses that compile decisions into code consistently outperform JSON-call orchestration on the surveyed benchmarks. The paper traces this back to determinism, composability, and inspectability of the resulting traces.</p></li><li><p><strong>Why it matters:</strong> If code-as-harness is the right substrate, then the next round of agent-system progress will come from harness-level innovation rather than from new base models. The survey gives builders a structured reference for that work.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.18747">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2056764334181884158">Tweet</a></strong></p><div><hr></div><h2><strong>Message from our Sponsor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q1vL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q1vL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png 424w, https://substackcdn.com/image/fetch/$s_!q1vL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png 848w, https://substackcdn.com/image/fetch/$s_!q1vL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png 1272w, https://substackcdn.com/image/fetch/$s_!q1vL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q1vL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png" width="1456" height="730" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:730,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q1vL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png 424w, https://substackcdn.com/image/fetch/$s_!q1vL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png 848w, https://substackcdn.com/image/fetch/$s_!q1vL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png 1272w, https://substackcdn.com/image/fetch/$s_!q1vL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8d135c1-6c94-4fdc-96c0-eac96e81e61c_3905x1957.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Intology released <a href="https://www.intology.ai/blog/nanogpt-bench">NanoGPT-Bench</a>, a benchmark that drops agents into the NanoGPT Speedrun environment at the September 2025 human world record and measures how much of the next five months of community progress they can recover autonomously. </p><p>Claude Code, Codex, and Autoresearch each ran 320 to 455 training variants on a 512 H100-hour budget and recovered under 10% of the human speedup, mostly via hyperparameter tuning rather than algorithmic research. </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.intology.ai/blog/nanogpt-bench&quot;,&quot;text&quot;:&quot;Read More&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.intology.ai/blog/nanogpt-bench"><span>Read More</span></a></p><div><hr></div><h2><strong>2. OpenAI Disproves the Unit Distance Conjecture</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7ts_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7ts_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png 424w, https://substackcdn.com/image/fetch/$s_!7ts_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png 848w, https://substackcdn.com/image/fetch/$s_!7ts_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png 1272w, https://substackcdn.com/image/fetch/$s_!7ts_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7ts_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png" width="1201" height="652" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:652,&quot;width&quot;:1201,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:142026,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://nlp.elvissaravia.com/i/198992221?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7ts_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png 424w, https://substackcdn.com/image/fetch/$s_!7ts_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png 848w, https://substackcdn.com/image/fetch/$s_!7ts_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png 1272w, https://substackcdn.com/image/fetch/$s_!7ts_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f744f0-b39d-4377-ac80-3be1c42d2890_1201x652.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This OpenAI paper disproves Erd&#337;s&#8217;s 1946 unit distance conjecture. For a finite planar set P, let &#957;(P) count the unordered pairs at distance exactly 1, and let &#957;(n) be the maximum of &#957;(P) over all n-point sets. Erd&#337;s conjectured &#957;(n) &#8804; n^(1+C/log log n); the paper proves instead that there is a fixed &#948; greater than 0 with &#957;(n) &#8805; n^(1+&#948;) for infinitely many n. The result was produced in a completely automated fashion by an internal OpenAI model and then human-edited into the present exposition.</p><ul><li><p><strong>The theorem:</strong> There exists an absolute constant &#948; greater than 0 and infinitely many n for which &#957;(n) &#8805; n^(1+&#948;). This contradicts the widely believed conjecture, which earlier results on generic and most planar norms had appeared to support.</p></li><li><p><strong>The construction:</strong> It passes through an infinite unramified tower of totally real number fields with 3-power Galois groups of growing degree, in which a fixed set of rational primes splits completely. After adjoining i, these fields produce high-dimensional lattices with many elements whose images have absolute value 1 under every complex embedding. The construction is a high-dimensional analogue of the arithmetic behind Erd&#337;s&#8217;s classical square-grid lower bound.</p></li><li><p><strong>Why it works:</strong> Golod-Shafarevich theory guarantees an infinite tower exists, even after a quotient step that trivializes the prescribed Frobenius classes. A crucial property is that all resulting discriminants and class numbers stay at most exponential in the extension degree.</p></li><li><p><strong>Statement on AI use:</strong> The internal model was given an AI-written problem statement, and its output was checked by an AI grading pipeline before any human examined it. After AI-assisted verification and rewriting, a draft was sent to external mathematicians, including number theory experts, who confirmed the proof&#8217;s correctness and have since simplified and strengthened the argument.</p></li></ul><p><strong><a href="https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit-distance-proof.pdf">Paper</a></strong> | <strong><a href="https://x.com/OpenAI/status/2057176201782075690">Tweet</a></strong></p><div><hr></div><h2><strong>3. Memory as a Model</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZV5t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZV5t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png 424w, https://substackcdn.com/image/fetch/$s_!ZV5t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png 848w, https://substackcdn.com/image/fetch/$s_!ZV5t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png 1272w, https://substackcdn.com/image/fetch/$s_!ZV5t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZV5t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png" width="996" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Memory as a Model&quot;,&quot;title&quot;:&quot;Memory as a Model&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Memory as a Model" title="Memory as a Model" srcset="https://substackcdn.com/image/fetch/$s_!ZV5t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png 424w, https://substackcdn.com/image/fetch/$s_!ZV5t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png 848w, https://substackcdn.com/image/fetch/$s_!ZV5t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png 1272w, https://substackcdn.com/image/fetch/$s_!ZV5t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f262012-b43c-4481-b700-58b6fe4386f5_996x264.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>MeMo augments any frozen LLM with a separately trained memory model that stores, retrieves, and integrates facts on the base model&#8217;s behalf. Memory updates are decoupled from base-model weight updates, so the system supports continual learning without catastrophic forgetting, a property RAG fails to deliver because a vector store is just a database with a learned encoder bolted on.</p><ul><li><p><strong>Memory as a learned subsystem:</strong> MeMo has explicit read, write, and integrate interfaces rather than relying on the context window. The position is that memory in agents should be modular, learned, and gated.</p></li><li><p><strong>Decoupled update schedule:</strong> New facts are absorbed through the memory model&#8217;s training loop without touching backbone weights. This makes weekly knowledge updates feasible without retraining and without vector-DB churn.</p></li><li><p><strong>Continual-learning robustness:</strong> Across the evaluated tasks, the system retains old knowledge while ingesting new knowledge, addressing a known failure mode of fine-tuning and a known limitation of retrieval-based memory.</p></li><li><p><strong>Why it matters:</strong> Most production agent systems still bolt a vector store onto an LLM and call it memory. MeMo proposes that memory should be a trained component with explicit interfaces, which has implications for how long-running agent platforms are architected.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.15156">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2057182105671750047">Tweet</a></strong></p><div><hr></div><h2><strong>4. AIRA</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6YWa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6YWa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png 424w, https://substackcdn.com/image/fetch/$s_!6YWa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png 848w, https://substackcdn.com/image/fetch/$s_!6YWa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png 1272w, https://substackcdn.com/image/fetch/$s_!6YWa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6YWa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png" width="996" height="531" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:531,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AIRA&quot;,&quot;title&quot;:&quot;AIRA&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AIRA" title="AIRA" srcset="https://substackcdn.com/image/fetch/$s_!6YWa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png 424w, https://substackcdn.com/image/fetch/$s_!6YWa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png 848w, https://substackcdn.com/image/fetch/$s_!6YWa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png 1272w, https://substackcdn.com/image/fetch/$s_!6YWa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7e61dd-73c5-4398-b915-70ea29b9e61a_996x531.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Meta&#8217;s AIRA is an agent system that autonomously discovers neural architectures, producing models that beat Llama 3.2 at 350M, 1B, and 3B scales under a 24-hour compute budget. The search is split across two specialized agents: AIRA-Compose searches macro architecture, and AIRA-Design implements the low-level mechanisms. The split outperforms a single end-to-end agent on this non-toy search problem.</p><ul><li><p><strong>Two-agent decomposition:</strong> A planner picks structure; an implementer fills in mechanisms. This pattern generalizes well beyond neural architecture search to pipeline assembly, query planning, prompt scaffolding, and tool-use programs.</p></li><li><p><strong>Beats Llama 3.2 at three scales under budget:</strong> Discovered architectures match or exceed Llama 3.2 at 350M, 1B, and 3B parameter scales within a 24-hour compute budget for the search itself. That is competitive with months of human-led ablation studies.</p></li><li><p><strong>Search not synthesis:</strong> The discovered models are not LLM-written code patches grafted into a framework. They are full architectures discovered through structured search guided by the two-agent loop.</p></li><li><p><strong>Why it matters:</strong> If agentic search can produce competitive architectures end to end, then NAS and large parts of the ML research workflow become candidates for automation by agent systems rather than by hand-engineered search algorithms.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.15871">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2056434731508703607">Tweet</a></strong></p><div><hr></div><h2><strong>5. Weak-Model Critic-Comparator</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!82V_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39253814-da07-489d-827e-2e204ce71a8b_997x651.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!82V_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39253814-da07-489d-827e-2e204ce71a8b_997x651.png 424w, https://substackcdn.com/image/fetch/$s_!82V_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39253814-da07-489d-827e-2e204ce71a8b_997x651.png 848w, https://substackcdn.com/image/fetch/$s_!82V_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39253814-da07-489d-827e-2e204ce71a8b_997x651.png 1272w, https://substackcdn.com/image/fetch/$s_!82V_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39253814-da07-489d-827e-2e204ce71a8b_997x651.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!82V_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39253814-da07-489d-827e-2e204ce71a8b_997x651.png" width="997" height="651" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39253814-da07-489d-827e-2e204ce71a8b_997x651.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:651,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Weak-Model Critic-Comparator&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Weak-Model Critic-Comparator" title="Weak-Model Critic-Comparator" srcset="https://substackcdn.com/image/fetch/$s_!82V_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39253814-da07-489d-827e-2e204ce71a8b_997x651.png 424w, https://substackcdn.com/image/fetch/$s_!82V_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39253814-da07-489d-827e-2e204ce71a8b_997x651.png 848w, https://substackcdn.com/image/fetch/$s_!82V_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39253814-da07-489d-827e-2e204ce71a8b_997x651.png 1272w, https://substackcdn.com/image/fetch/$s_!82V_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39253814-da07-489d-827e-2e204ce71a8b_997x651.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>GPT-5.4 nano wrapped in a critic-comparator orchestration loop reaches 76.4% on SWE-bench Verified, matching standalone Gemini 3 Pro and Claude Opus 4.5 Thinking. The trick is to sample k=8 candidate patches from the weak model and select the winner using execution and proof signals rather than asking the model to self-rank.</p><ul><li><p><strong>k=8 candidates plus verifier beats frontier model:</strong> A weak model&#8217;s top-k often already contains a correct patch. The selector is the limiting factor, not the base model&#8217;s capability.</p></li><li><p><strong>Execution and proof signals as selection:</strong> Candidates are run and verified rather than scored by an LLM judge. The critic and comparator are separate roles inside the loop, each with a narrow task.</p></li><li><p><strong>Matches frontier performance at lower per-call cost:</strong> Selecting among nano-tier proposals is cheaper than calling a frontier model once, even after accounting for the 8x sampling, because the dominant cost driver is model size rather than call count.</p></li><li><p><strong>Why it matters:</strong> This is a reproducible recipe for getting frontier-level coding-agent results out of cheaper models. The result also reframes where SWE-bench progress is coming from: orchestration quality, not just stronger base models.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.14163">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2056427128401641908">Tweet</a></strong></p><div><hr></div><h2><strong>6. MetaCogAgent</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y3gZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y3gZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png 424w, https://substackcdn.com/image/fetch/$s_!y3gZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png 848w, https://substackcdn.com/image/fetch/$s_!y3gZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png 1272w, https://substackcdn.com/image/fetch/$s_!y3gZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y3gZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png" width="1250" height="1064" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1064,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!y3gZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png 424w, https://substackcdn.com/image/fetch/$s_!y3gZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png 848w, https://substackcdn.com/image/fetch/$s_!y3gZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png 1272w, https://substackcdn.com/image/fetch/$s_!y3gZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63823559-ddd4-46ba-9cee-5fef02c126d5_1250x1064.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>MetaCogAgent equips a multi-agent system with metacognition, so each agent decides whether it should answer or delegate. The bottleneck in current multi-agent systems is over-delegation and under-delegation, and a metacognitive gate is a principled way to manage both. The Metacognitive Unit (MCU) at each agent produces confidence scores that drive routing to a delegation hub.</p><ul><li><p><strong>Confidence-driven routing:</strong> Each agent&#8217;s MCU combines verbalized and profile-based confidence into a single score. Low-confidence tasks route to a delegation hub rather than getting answered anyway.</p></li><li><p><strong>Self-aware specialization beats fixed routers:</strong> MetaCogAgent reaches 82.4% on MetaCog-Eval, versus 70.2% for a skill-fixed router and 65.3% for single-agent. Self-assessment and adaptive delegation each contribute material gains in ablations.</p></li><li><p><strong>Emergent specialization:</strong> Distinct confidence profiles (high on coding, low on retrieval, etc.) emerge purely from feedback. No specialization is encoded beyond initial system prompts.</p></li><li><p><strong>Why it matters:</strong> Multi-agent systems usually rely on fixed routers or simple round-robin schemes. A learned, uncertainty-aware delegation gate gives a primitive that adapts to task difficulty without retraining the routing layer.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.17292">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2056822215619035156">Tweet</a></strong></p><div><hr></div><h2><strong>7. Production Agent Architecture Methodology</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G3Sv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05113900-82d3-4508-9cef-27212db9f950_1469x713.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G3Sv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05113900-82d3-4508-9cef-27212db9f950_1469x713.png 424w, https://substackcdn.com/image/fetch/$s_!G3Sv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05113900-82d3-4508-9cef-27212db9f950_1469x713.png 848w, https://substackcdn.com/image/fetch/$s_!G3Sv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05113900-82d3-4508-9cef-27212db9f950_1469x713.png 1272w, https://substackcdn.com/image/fetch/$s_!G3Sv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05113900-82d3-4508-9cef-27212db9f950_1469x713.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G3Sv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05113900-82d3-4508-9cef-27212db9f950_1469x713.png" width="1456" height="707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05113900-82d3-4508-9cef-27212db9f950_1469x713.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Production Agent Architecture Methodology&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Production Agent Architecture Methodology" title="Production Agent Architecture Methodology" srcset="https://substackcdn.com/image/fetch/$s_!G3Sv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05113900-82d3-4508-9cef-27212db9f950_1469x713.png 424w, https://substackcdn.com/image/fetch/$s_!G3Sv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05113900-82d3-4508-9cef-27212db9f950_1469x713.png 848w, https://substackcdn.com/image/fetch/$s_!G3Sv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05113900-82d3-4508-9cef-27212db9f950_1469x713.png 1272w, https://substackcdn.com/image/fetch/$s_!G3Sv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05113900-82d3-4508-9cef-27212db9f950_1469x713.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A methodology paper on selecting and composing runtime architecture patterns for production LLM agents. The core argument is that most teams accidentally let framework defaults make critical architecture decisions for them. The paper introduces the stochastic-deterministic boundary (SDB) as a named primitive and presents a six-pattern catalog organized by the three runtime concerns of coordination, state, and control.</p><ul><li><p><strong>Stochastic-deterministic boundary:</strong> A four-part contract of proposer, verifier, commit, and reject that marks where the LLM hands off to deterministic infrastructure. The paper inventories how five widely used open-source agent frameworks place this boundary, often implicitly.</p></li><li><p><strong>Three-by-six pattern catalog:</strong> Six patterns organized along three orthogonal concerns. Coordination patterns answer how work splits and combines. State patterns answer how the system remembers. Control patterns answer who decides what runs and when to stop.</p></li><li><p><strong>Patterns as deliberate choices:</strong> Each pattern has a typed-contract specification of input type, output type, deadline, retry budget, and partial-result policy. The catalog grows by passing this procedure rather than by adding ad-hoc abstractions.</p></li><li><p><strong>Why it matters:</strong> Production agent failures rarely come from the LLM. They come from architectural choices that were made by default. The methodology gives teams a way to surface those choices and make them deliberately.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.20173">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2057159497282707875">Tweet</a></strong></p><div><hr></div><h2><strong>8. NanoGPT-Bench</strong></h2><p>A new evaluation of whether coding agents can do real AI R&amp;D. Intology runs Codex, Claude Code, and Autoresearch on the NanoGPT-Bench suite and reports that the agents recover only 9.3% of human progress on the same problems. Coding agents spend the bulk of their compute on hyperparameter tuning and rarely attempt algorithmic research. Claude Code and Autoresearch reason about algorithmic changes more often, but still tend to dodge implementing them. The headline result tempers the current wave of &#8220;self-improving agent&#8221; claims: producing real research progress requires a different distribution of effort than the one current coding agents converge to under their default scaffolds.</p><p><strong><a href="https://www.intology.ai/blog/nanogpt-bench">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2056901737055752633">Tweet</a></strong></p><div><hr></div><h2><strong>9. General-Agent</strong></h2><p>Prime Intellect&#8217;s General-Agent is a fully synthetic reinforcement learning environment whose task corpus self-evolves and grows harder over time. The release ships with 4,504 tool-use tasks across 1,040 domains and 8,159 unique tools. Synthetic task creation is formulated as a two-player game between a Synthesizer that proposes new task families and a Solver that runs rollouts to measure pass rates. Tasks whose pass rate falls inside a calibrated difficulty band are accepted into the corpus, and hard tiers seed the next round of extensions. The framing turns RL environment creation, historically a major bottleneck, into an automated agentic search problem in its own right.</p><p><strong><a href="https://www.primeintellect.ai/blog/general-agent">Paper</a></strong> | <strong><a href="https://x.com/PrimeIntellect/status/2056569877167808966">Tweet</a></strong></p><div><hr></div><h2><strong>10. Contrastive Neuron Attribution</strong></h2><p>Nous Research releases Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks. Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact. The intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade. Validated on the refusal circuit across 8 instruct-tuned models including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B.</p><p><strong><a href="https://arxiv.org/abs/2605.12290">Paper</a></strong> | <strong><a href="https://x.com/NousResearch/status/2056778746716107193">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Gemini 3.5 Flash, Antigravity 2.0, Codex Thursday, Cohere Command A+, Qwen3.7-Max, and More]]></title><description><![CDATA[Gemini 3.5 Flash, Antigravity 2.0, Codex Thursday, Cohere Command A+, Qwen3.7-Max, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-gemini-35-flash</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-gemini-35-flash</guid><pubDate>Sat, 23 May 2026 15:02:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!u1O7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Google ships Gemini 3.5 Flash for agents</p></li><li><p>Antigravity 2.0 becomes a full agent platform</p></li><li><p>OpenAI ships Appshots and /goal in Codex</p></li><li><p>Cohere open-sources Command A+ on Apache 2.0</p></li><li><p>Qwen3.7-Max runs agents for 35 hours straight</p></li><li><p>NVIDIA verifies agent skills</p></li><li><p>Cursor Composer 2.5 sharpens coding agents</p></li><li><p>Anthropic acquires Stainless for SDK tooling</p></li><li><p>Browserbase opens Browse.sh skills catalog</p></li><li><p>Gemini Omni unifies create-anything model</p></li><li><p>OpenAI cracks an 80-year Erd&#337;s problem</p></li><li><p>Compiling agent workflows into model weights</p></li><li><p>PEEK orientation cache for long-context agents</p></li><li><p>SaaS-Bench exposes computer-use agent ceiling</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Gemini 3.5 Flash and Managed Agents Land</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u1O7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u1O7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png 424w, https://substackcdn.com/image/fetch/$s_!u1O7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png 848w, https://substackcdn.com/image/fetch/$s_!u1O7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png 1272w, https://substackcdn.com/image/fetch/$s_!u1O7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u1O7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png" width="1187" height="889" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:889,&quot;width&quot;:1187,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!u1O7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png 424w, https://substackcdn.com/image/fetch/$s_!u1O7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png 848w, https://substackcdn.com/image/fetch/$s_!u1O7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png 1272w, https://substackcdn.com/image/fetch/$s_!u1O7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9e2be9-dc28-4eb6-a40a-1c0d205d26e6_1187x889.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google opened I/O 2026 with Gemini 3.5 Flash, a frontier model tuned explicitly for agents and coding, alongside Managed Agents in the Gemini API that ship an isolated execution environment with every request.</p><ul><li><p><strong>Agentic benchmarks:</strong> Gemini 3.5 Flash posts 76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas, and 1656 Elo on GDPval-AA, outperforming Gemini 3.1 Pro on long-horizon coding and tool-use tasks at 4x faster output.</p></li><li><p><strong>Managed Agents preview:</strong> A single Gemini API call spins up an agent that reasons, uses tools, and executes code in an ephemeral Linux sandbox managed by Google, with AGENTS.md and SKILL.md as versionable config.</p></li><li><p><strong>Where it ships:</strong> Available in Google AI Studio, Android Studio, Antigravity, Gemini Enterprise Agent Platform, the Gemini app, and AI Mode in Search, with 3.5 Pro slated for next month.</p></li><li><p><strong>Why it matters:</strong> Flash is now the cost-optimized agent default at Google scale, and Managed Agents removes the build-your-own-sandbox tax that has kept many teams on third-party runtimes.</p></li></ul><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/">Blog</a></strong> | <strong><a href="https://blog.google/innovation-and-ai/technology/developers-tools/managed-agents-gemini-api/">Managed Agents</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-gemini-35-flash">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (May 11 - May 17)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-086</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-086</guid><pubDate>Sun, 17 May 2026 15:02:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qeWy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The Top AI Papers of the Week (May 11 - May 17)</p><h2><strong>1. Lighthouse Attention</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qeWy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qeWy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png 424w, https://substackcdn.com/image/fetch/$s_!qeWy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png 848w, https://substackcdn.com/image/fetch/$s_!qeWy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png 1272w, https://substackcdn.com/image/fetch/$s_!qeWy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qeWy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png" width="1200" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f34c70f7-0186-4042-b91b-84955f643118_1200x750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Lighthouse Attention&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Lighthouse Attention" title="Lighthouse Attention" srcset="https://substackcdn.com/image/fetch/$s_!qeWy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png 424w, https://substackcdn.com/image/fetch/$s_!qeWy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png 848w, https://substackcdn.com/image/fetch/$s_!qeWy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png 1272w, https://substackcdn.com/image/fetch/$s_!qeWy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c70f7-0186-4042-b91b-84955f643118_1200x750.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Nous Research proposes a training-only attention wrapper for long-context pretraining. Lighthouse Attention wraps standard SDPA with a hierarchical, gradient-free selection layer that compresses and decompresses queries, keys, and values symmetrically while preserving left-to-right causality. The wrapper is removed near the end of training in a short recovery phase, so the deployed model runs vanilla attention with no architectural change at inference. Preliminary LLM experiments report faster total training time and lower final loss than full-attention baselines.</p><ul><li><p><strong>Subquadratic wrapper with vanilla deployment:</strong> The hierarchical selector reduces the cost of long-context training without modifying the underlying attention operator. After the recovery phase, the trained weights are compatible with standard SDPA at inference.</p></li><li><p><strong>Symmetric compression preserves causality:</strong> Queries, keys, and values are compressed and decompressed through the same hierarchy, which keeps the wrapper compatible with left-to-right attention.</p></li><li><p><strong>Training-time speedup at lower final loss:</strong> Preliminary runs report faster wall-clock training and lower final loss than full-attention baselines under matched FLOPs, including 21x faster forward latency at 512K context.</p></li><li><p><strong>Why it matters:</strong> A training-only modification that leaves the deployed model unchanged sidesteps the usual deployment-time tradeoffs of efficient-attention methods.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.06554">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2054224130103554359">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8jee!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029af52f-09e8-4399-880a-e7873c379ee4_831x505.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8jee!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029af52f-09e8-4399-880a-e7873c379ee4_831x505.png 424w, https://substackcdn.com/image/fetch/$s_!8jee!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029af52f-09e8-4399-880a-e7873c379ee4_831x505.png 848w, https://substackcdn.com/image/fetch/$s_!8jee!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029af52f-09e8-4399-880a-e7873c379ee4_831x505.png 1272w, https://substackcdn.com/image/fetch/$s_!8jee!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029af52f-09e8-4399-880a-e7873c379ee4_831x505.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8jee!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029af52f-09e8-4399-880a-e7873c379ee4_831x505.png" width="831" height="505" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/029af52f-09e8-4399-880a-e7873c379ee4_831x505.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:505,&quot;width&quot;:831,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;DAIR Academy Hands-on Labs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="DAIR Academy Hands-on Labs" title="DAIR Academy Hands-on Labs" srcset="https://substackcdn.com/image/fetch/$s_!8jee!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029af52f-09e8-4399-880a-e7873c379ee4_831x505.png 424w, https://substackcdn.com/image/fetch/$s_!8jee!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029af52f-09e8-4399-880a-e7873c379ee4_831x505.png 848w, https://substackcdn.com/image/fetch/$s_!8jee!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029af52f-09e8-4399-880a-e7873c379ee4_831x505.png 1272w, https://substackcdn.com/image/fetch/$s_!8jee!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029af52f-09e8-4399-880a-e7873c379ee4_831x505.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We just released new hands-on labs on <a href="https://academy.dair.ai/labs">DAIR.AI Academy</a> to help you build alongside agents. Start with practical, guided labs for agentic image generation and building your first agent skill, with more labs coming soon.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/labs&quot;,&quot;text&quot;:&quot;Enroll&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/labs"><span>Enroll</span></a></p><div><hr></div><h2><strong>2. Is Grep All You Need?</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j96Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j96Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png 424w, https://substackcdn.com/image/fetch/$s_!j96Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png 848w, https://substackcdn.com/image/fetch/$s_!j96Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png 1272w, https://substackcdn.com/image/fetch/$s_!j96Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j96Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png" width="1456" height="792" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:792,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Is Grep All You Need?&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Is Grep All You Need?" title="Is Grep All You Need?" srcset="https://substackcdn.com/image/fetch/$s_!j96Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png 424w, https://substackcdn.com/image/fetch/$s_!j96Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png 848w, https://substackcdn.com/image/fetch/$s_!j96Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png 1272w, https://substackcdn.com/image/fetch/$s_!j96Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49259801-575e-42cf-b97d-9a6e62d17c68_2568x1396.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The paper evaluates grep-style text search against embedding-based retrieval inside coding agents. When wrapped in a suitable agent harness, grep matches or exceeds embedding retrieval on coding-agent tasks. The study isolates the contribution of the harness from the contribution of the retrieval primitive, and finds that harness design accounts for most of the performance differential typically attributed to embeddings.</p><ul><li><p><strong>Direct comparison of grep vs. embeddings:</strong> Coding-agent tasks evaluated under controlled conditions show grep-based retrieval reaching parity with or exceeding embedding-based retrieval.</p></li><li><p><strong>Harness design as the dominant variable:</strong> Holding the index constant and varying the harness produces larger performance shifts than the inverse, indicating that retrieval comparisons in prior work have likely been confounded by harness differences.</p></li><li><p><strong>Implications for codebase structure:</strong> Grep performs best when the codebase is properly indexed and structured for an agent to navigate, while embedding retrieval can partially compensate for unstructured input.</p></li><li><p><strong>Why it matters:</strong> Vector databases are a common default in coding-agent stacks. The result suggests that for many coding tasks, harness improvements and basic text search can substitute for embedding infrastructure.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.15184">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2055317577031975269">Tweet</a></strong></p><div><hr></div><h2><strong>3. A Geometric Calculator Inside a Neural Network</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g5OI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g5OI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png 424w, https://substackcdn.com/image/fetch/$s_!g5OI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png 848w, https://substackcdn.com/image/fetch/$s_!g5OI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!g5OI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g5OI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png" width="1456" height="933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:933,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Geometric Calculator&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Geometric Calculator" title="Geometric Calculator" srcset="https://substackcdn.com/image/fetch/$s_!g5OI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png 424w, https://substackcdn.com/image/fetch/$s_!g5OI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png 848w, https://substackcdn.com/image/fetch/$s_!g5OI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!g5OI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9e374b-3e5f-4419-8b4e-e7f3e86092c6_2000x1282.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Goodfire reports mechanistic interpretability work identifying a geometric calculator inside an LLM. The model represents numbers as Fourier features, where circles in activation space correspond to numbers modulo a given base. Arithmetic operations are implemented as rotations of these circles, forming a variant of a residue number system that does not require coprime moduli. The same circuit appears to be reused beyond arithmetic.</p><ul><li><p><strong>Numbers as rotating circles:</strong> Numerical quantities are encoded as positions on circles in activation space, with addition implemented as rotation. The encoding extends prior findings that LLMs represent numbers via Fourier features.</p></li><li><p><strong>Residue-system-like structure:</strong> The set of circles forms a residue number system variant. Unlike the textbook residue system, the moduli do not need to be coprime, which is the mechanistic detail the paper introduces.</p></li><li><p><strong>Reuse beyond arithmetic:</strong> The same rotational machinery shows up in non-math contexts inside the model, suggesting the geometric calculator is a general-purpose internal structure rather than a math-specific subnetwork.</p></li><li><p><strong>Why it matters:</strong> The finding gives interpretability researchers a concrete, reproducible circuit to target and connects geometric representation analysis to functional behavior beyond toy settings.</p></li></ul><p><strong><a href="https://www.goodfire.ai/research/a-geometric-calculator">Paper</a></strong> | <strong><a href="https://x.com/GoodfireAI/status/2054962242022777189">Tweet</a></strong></p><div><hr></div><h2><strong>4. &#948;-mem</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!12AG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!12AG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png 424w, https://substackcdn.com/image/fetch/$s_!12AG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png 848w, https://substackcdn.com/image/fetch/$s_!12AG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png 1272w, https://substackcdn.com/image/fetch/$s_!12AG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!12AG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png" width="1456" height="619" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:619,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&#948;-mem&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="&#948;-mem" title="&#948;-mem" srcset="https://substackcdn.com/image/fetch/$s_!12AG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png 424w, https://substackcdn.com/image/fetch/$s_!12AG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png 848w, https://substackcdn.com/image/fetch/$s_!12AG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png 1272w, https://substackcdn.com/image/fetch/$s_!12AG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf22e89-7929-4427-bdbd-d65d1708e22e_2550x1084.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#948;-mem augments a frozen full-attention model with a compact online associative-memory state. The state is a fixed-size matrix updated by delta-rule learning during generation, and its readout produces low-rank corrections to the backbone&#8217;s attention output. There is no fine-tuning, no backbone swap, and no context extension.</p><ul><li><p><strong>Frozen backbone:</strong> The base model weights are unchanged. &#948;-mem adds a small online state plus a pair of low-rank read and write projections.</p></li><li><p><strong>Delta-rule update integrated into attention:</strong> The memory matrix is updated by delta-rule learning during generation, and the readout produces additive query and output corrections to the attention computation rather than functioning as a separate retrieval step.</p></li><li><p><strong>Results from an 8x8 state:</strong> An 8x8 online memory lifts the frozen backbone&#8217;s average score by 1.10x and beats the strongest non-&#948;-mem memory baseline by 1.15x. On memory-heavy benchmarks the gap widens: 1.31x on MemoryAgentBench and 1.20x on LoCoMo. General capabilities are largely preserved.</p></li><li><p><strong>Why it matters:</strong> The mechanism offers an alternative to context extension and external retrieval for long-horizon memory, with minimal deployment overhead on frozen frontier models.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.12357">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2054600147020222630">Tweet</a></strong></p><div><hr></div><h2><strong>5. Beyond Individual Intelligence</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cwfw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cwfw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png 424w, https://substackcdn.com/image/fetch/$s_!cwfw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png 848w, https://substackcdn.com/image/fetch/$s_!cwfw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png 1272w, https://substackcdn.com/image/fetch/$s_!cwfw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cwfw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png" width="1456" height="1033" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1033,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Beyond Individual Intelligence&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Beyond Individual Intelligence" title="Beyond Individual Intelligence" srcset="https://substackcdn.com/image/fetch/$s_!cwfw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png 424w, https://substackcdn.com/image/fetch/$s_!cwfw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png 848w, https://substackcdn.com/image/fetch/$s_!cwfw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png 1272w, https://substackcdn.com/image/fetch/$s_!cwfw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4da88-457f-4091-92e7-b4b68d1c3565_1912x1356.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A multi-agent systems survey covering 200+ papers, organized along three axes: collaboration mechanisms, failure attribution, and self-evolution. Each axis is treated as a distinct research line. The self-evolution chapter maps how memory, meta-learning, and procedure-editing approaches intersect.</p><ul><li><p><strong>Three orthogonal axes:</strong> Collaboration mechanisms cover who communicates with whom and how. Failure attribution covers methods for localizing errors across agents. Self-evolution covers how a system updates its own behavior over time.</p></li><li><p><strong>Failure attribution as a first-class topic:</strong> Errors propagate through coordination protocols in multi-agent systems, making attribution difficult. The survey treats attribution methodology as a research area rather than a debugging activity.</p></li><li><p><strong>Self-evolution as a field map:</strong> The chapter identifies overlap between memory work, meta-learning, and procedure-editing approaches, and surfaces open questions in each area.</p></li><li><p><strong>Why it matters:</strong> The taxonomy provides a vocabulary for comparing multi-agent systems along axes that prior work has often conflated.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.14892">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2055318564127809571">Tweet</a></strong></p><div><hr></div><h2><strong>6. AutoTTS</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jfxf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jfxf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png 424w, https://substackcdn.com/image/fetch/$s_!jfxf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png 848w, https://substackcdn.com/image/fetch/$s_!jfxf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png 1272w, https://substackcdn.com/image/fetch/$s_!jfxf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jfxf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png" width="1456" height="430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:289959,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://nlp.elvissaravia.com/i/198014920?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jfxf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png 424w, https://substackcdn.com/image/fetch/$s_!jfxf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png 848w, https://substackcdn.com/image/fetch/$s_!jfxf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png 1272w, https://substackcdn.com/image/fetch/$s_!jfxf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd59ff0b-2996-49c7-bce3-8894c9ca564d_1734x512.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AutoTTS reframes test-time scaling as a search problem. Instead of designing branching, pruning, and stopping heuristics directly, the user constructs a discovery environment in which TTS strategies are searched automatically. Width-depth TTS is recast as controller synthesis over pre-collected reasoning trajectories and probe signals, so candidate controllers can be evaluated without repeated LLM calls.</p><ul><li><p><strong>Discovery environment plus offline evaluator:</strong> The human specifies states, actions, and feedback. An explorer LLM iteratively proposes candidate controllers. Controllers are evaluated against pre-collected trajectories rather than by re-sampling the base model.</p></li><li><p><strong>Beta parameterization and trace-level feedback:</strong> Beta parameterization makes the controller space tractable for search. Execution-trace feedback gives the explorer information about why a candidate failed, not only that it did.</p></li><li><p><strong>Results on math reasoning benchmarks:</strong> Discovered controllers outperform hand-designed TTS recipes on the accuracy-cost Pareto frontier and transfer zero-shot to held-out benchmarks and model scales. Total discovery cost: $39.9 and 160 minutes.</p></li><li><p><strong>Why it matters:</strong> Automated search over TTS strategies is competitive with hand-tuned heuristics at low cost, which shifts where the research effort needs to go.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.08083">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2053978221193130434">Tweet</a></strong></p><div><hr></div><h2><strong>7. AI Co-Mathematician</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iXHW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iXHW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png 424w, https://substackcdn.com/image/fetch/$s_!iXHW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png 848w, https://substackcdn.com/image/fetch/$s_!iXHW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png 1272w, https://substackcdn.com/image/fetch/$s_!iXHW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iXHW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png" width="1247" height="519" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:519,&quot;width&quot;:1247,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AI Co-Mathematician&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AI Co-Mathematician" title="AI Co-Mathematician" srcset="https://substackcdn.com/image/fetch/$s_!iXHW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png 424w, https://substackcdn.com/image/fetch/$s_!iXHW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png 848w, https://substackcdn.com/image/fetch/$s_!iXHW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png 1272w, https://substackcdn.com/image/fetch/$s_!iXHW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56928eec-ac3c-4dc7-a523-83f67520d6f7_1247x519.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google DeepMind presents AI Co-Mathematician, an agentic research workbench for mathematicians. The system is an asynchronous, stateful environment that supports ideation, literature discovery, computational analysis, theorem verification, and knowledge development across long sessions. It reaches 48% on FrontierMath Tier 4, a new high among AI systems evaluated.</p><ul><li><p><strong>Asynchronous stateful workbench:</strong> The system runs as a persistent environment with multiple workstreams a mathematician can drive in parallel. Long-running computations, literature searches, and verification steps run in the background.</p></li><li><p><strong>Manages uncertainty and intent:</strong> The workbench records unsuccessful attempts, clarifies user intent when underspecified, and emits formal mathematical outputs that can be checked rather than only read.</p></li><li><p><strong>48% on FrontierMath Tier 4:</strong> A new high score on the hardest tier of FrontierMath among AI systems evaluated. Early applications produced solved open problems, fresh research directions, and recovered overlooked citations during active research sessions.</p></li><li><p><strong>Why it matters:</strong> The workbench design pattern (asynchronous, stateful, multi-workstream) generalizes to expert workflows where sessions span days rather than minutes.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.06651">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2054224343551639958">Tweet</a></strong></p><div><hr></div><h2><strong>8. AEvo</strong></h2><p>AEvo separates the iterative self-improvement loop into two roles: a candidate-proposer that generates the next attempt, and a meta-agent that observes traces and edits the procedure used to propose future candidates. Past runs (candidates, feedback, traces, failures) function as memory the meta-agent reads from when revising the procedure. AEvo reports a 26% relative gain over the strongest evolution baseline on agentic and reasoning benchmarks, and SOTA on three open-ended optimization tasks under the same iteration budget. The work demonstrates one way to operationalize accumulated agentic search logs as input to procedure-level updates rather than discarding them after each run.</p><p><strong><a href="https://arxiv.org/abs/2605.13821">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2055000283605004602">Tweet</a></strong></p><div><hr></div><h2><strong>9. The Memory Curse in LLM Agents</strong></h2><p>A study of how long histories affect LLM agent behavior. Across 7 LLMs and 4 social dilemma games over 500 rounds, expanding accessible history degraded cooperation in 18 of 28 model-game combinations. Lexical analysis of 378,000 reasoning traces shows the mechanism is erosion of forward-looking intent rather than increased suspicion: long histories pull the model toward reasoning about past interactions rather than future payoffs. A LoRA adapter trained only on forward-looking traces mitigates the decay and transfers zero-shot to new games. Memory sanitization, which keeps prompt length fixed but swaps in synthetic cooperative records, restores cooperation, indicating the trigger is content rather than length. Ablating explicit chain-of-thought often reduces the collapse, suggesting deliberation amplifies the effect. The paper provides a diagnostic plus interventions for long-running agent systems where history quality, not just history length, drives behavior.</p><p><strong><a href="https://arxiv.org/abs/2605.08060">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2053863994499408214">Tweet</a></strong></p><div><hr></div><h2><strong>10. Token Superposition Training</strong></h2><p>Nous Research&#8217;s second pretraining paper of the week. Token Superposition Training (TST) is a modification to the standard LLM pretraining loop that produces a 2 to 3x wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, training reverts to standard next-token prediction. The inference-time model is identical to one produced by conventional pretraining. TST was validated at 270M, 600M, and 3B dense scales, and at a 10B-A1B mixture-of-experts model where it reaches a lower final loss while consuming 4,768 B200-GPU-hours versus the baseline&#8217;s 12,311. Together with Lighthouse Attention, this is the second pretraining-loop modification from the same lab this week reporting substantial speedups without architecture changes.</p><p><strong><a href="https://arxiv.org/abs/2605.06546">Paper</a></strong> | <strong><a href="https://x.com/NousResearch/status/2054610062836892054">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Thinking Machines Interaction Models, Is Grep All You Need?, Codex Mobile + Hooks, Cursor Cloud Agents, Ring-2.6-1T, and More]]></title><description><![CDATA[Thinking Machines Interaction Models, Is Grep All You Need?, Codex Mobile + Hooks, Cursor Cloud Agents, Ring-2.6-1T, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-thinking-machines</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-thinking-machines</guid><pubDate>Sat, 16 May 2026 15:01:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Y0KV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fb9195e-97a4-4c98-87b8-c62536fa3be9_6885x4906.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Thinking Machines unveils interaction models</p></li><li><p>Is Grep All You Need? challenges vector RAG</p></li><li><p>OpenAI ships Codex mobile and hooks</p></li><li><p>Cursor adds cloud agent dev environments</p></li><li><p>Ring-2.6-1T open trillion-scale agent model</p></li><li><p>Recursive Superintelligence emerges with $650M</p></li><li><p>LangChain Labs targets continual learning</p></li><li><p>xAI launches Grok Build CLI</p></li><li><p>Claude Code adds agent view</p></li><li><p>Prime Intellect agents beat nanoGPT speedrun</p></li><li><p>World Labs open-sources image-blaster</p></li><li><p>Isomorphic Labs raises $2.1B Series B</p></li><li><p>Beyond Individual Intelligence multi-agent survey</p></li><li><p>LongMemEval-V2 raises the memory bar</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Thinking Machines Introduces Interaction Models</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fRWW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fRWW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!fRWW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!fRWW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!fRWW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fRWW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Thinking Machines Interaction Models&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Thinking Machines Interaction Models" title="Thinking Machines Interaction Models" srcset="https://substackcdn.com/image/fetch/$s_!fRWW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!fRWW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!fRWW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!fRWW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527d8cef-3b50-4234-ada4-dcdd62d409c2_1920x1080.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Mira Murati&#8217;s Thinking Machines Lab released its first research preview: a new class of models trained from scratch for real-time interaction across audio, video, and text, instead of bolting streaming onto a turn-based stack. The frontier shifts from &#8220;answer faster&#8221; to &#8220;stay engaged while you think.&#8221;</p><ul><li><p><strong>Time-aligned micro-turns:</strong> Input and output are treated as continuous 200ms streams, so the model can listen, look, and speak in parallel rather than waiting for full user turns.</p></li><li><p><strong>TML-Interaction-Small:</strong> A 276B parameter MoE with 12B active, using encoder-free early fusion, streaming inference sessions, and batch-invariant kernels for stable training.</p></li><li><p><strong>Background reasoning model:</strong> A separate async model handles complex reasoning, freeing the interaction model to stay responsive in the foreground loop.</p></li><li><p><strong>FD-bench v1.5:</strong> Scores 77.8 on a new interactivity benchmark versus 39.0-54.3 for competitors, with real-time speech, visual proactivity, and interrupt handling that turn-based systems cannot match.</p></li></ul><p><strong><a href="https://thinkingmachines.ai/blog/interaction-models/">Blog</a></strong></p><div><hr></div><h3><strong>Is Grep All You Need? Harness Beats Vector RAG for Coding Agents</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LdEF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LdEF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png 424w, https://substackcdn.com/image/fetch/$s_!LdEF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png 848w, https://substackcdn.com/image/fetch/$s_!LdEF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png 1272w, https://substackcdn.com/image/fetch/$s_!LdEF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LdEF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png" width="1456" height="792" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:792,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Is Grep All You Need?&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Is Grep All You Need?" title="Is Grep All You Need?" srcset="https://substackcdn.com/image/fetch/$s_!LdEF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png 424w, https://substackcdn.com/image/fetch/$s_!LdEF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png 848w, https://substackcdn.com/image/fetch/$s_!LdEF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png 1272w, https://substackcdn.com/image/fetch/$s_!LdEF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424c0f88-6242-4284-9e61-9d9f83b0ec13_2568x1396.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A controlled study makes the empirical case that grep-style text search, wrapped in the right agent harness, matches or beats embedding-based retrieval on real coding-agent tasks. The deeper claim is that harness design (which meta-tools, in what order) explains more variance in agent performance than the retrieval algorithm itself.</p><ul><li><p><strong>Head-to-head retrieval:</strong> Across coding benchmarks, grep + light ranking ties or exceeds vector-DB retrieval, with much lower latency, cost, and infra overhead than embedding-based stacks.</p></li><li><p><strong>Harness &gt; algorithm:</strong> The order and shape of search/read/edit meta-tools dominate the result, suggesting &#8220;RAG quality&#8221; is mostly a harness problem dressed up as a retrieval problem.</p></li><li><p><strong>Implications for tooling:</strong> Reinforces the broader move from vector DBs to file-search primitives inside Codex, Claude Code, and Cursor, where ranking quality matters more than the underlying retrieval algorithm.</p></li><li><p><strong>Why it matters:</strong> The strongest empirical pushback yet against vector-DB-for-coding-agents, and a useful prior for anyone deciding whether to invest in retrieval infra or harness engineering.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.15184">Paper</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-thinking-machines">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (May 4 - May 10)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-154</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-154</guid><pubDate>Sun, 10 May 2026 15:01:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ssdq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. HeavySkill</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Udsc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Udsc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 424w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 848w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 1272w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Udsc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png" width="996" height="419" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:419,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;HeavySkill&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="HeavySkill" title="HeavySkill" srcset="https://substackcdn.com/image/fetch/$s_!Udsc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 424w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 848w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 1272w, https://substackcdn.com/image/fetch/$s_!Udsc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44212fca-4e3b-4342-a86f-115d9b10fee0_996x419.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One of the cleaner takes on agentic harness design released this year. The paper argues that what actually drives harness performance is not the orchestration code, but a single inner skill: parallel reasoning followed by deliberation. Internalize that pattern into the model and most of the surrounding scaffolding becomes optional. HeavySkill systematizes the idea as a two-stage pipeline you can run beneath any harness, then trains it as a learnable skill via RLVR. The result is a harness win that looks more like a model win.</p><ul><li><p><strong>Two-stage skill, not orchestration glue:</strong> Stage one runs parallel reasoning across multiple sampled chains. Stage two performs a deliberation pass that compares, critiques, and synthesizes those chains into a final answer. The pipeline is the same regardless of harness, which is why it transfers across tasks.</p></li><li><p><strong>GPT-OSS-20B jumps from 69.7% to 85.5% on LiveCodeBench:</strong> Under the heavy-thinking variant (HM@4), the 20B model gets a 15.8 point lift on a hard coding benchmark. The same recipe takes R1-Distill-Qwen-32B from 35.7% to 69.3% on IFEval, nearly doubling its instruction-following score.</p></li><li><p><strong>Pass@N-level performance from a learned skill:</strong> Several models reach Pass@N-level performance once HeavySkill is internalized through RLVR, which is the property that makes the parallel-deliberation pattern actually portable. The skill survives outside the harness it was trained under.</p></li><li><p><strong>Why it matters:</strong> Harness wins start to look like model wins once you can train them in. If parallel reasoning plus deliberation really is the inner skill, the long arc is models that ship with it baked in, not orchestration glue layered around them.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.02396">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2051678102934454330">Tweet</a></strong></p><div><hr></div><h2><strong>2. Conductor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kF-k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kF-k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 424w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 848w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 1272w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kF-k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png" width="996" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Conductor&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Conductor" title="Conductor" srcset="https://substackcdn.com/image/fetch/$s_!kF-k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 424w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 848w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 1272w, https://substackcdn.com/image/fetch/$s_!kF-k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfcaee2-a581-474b-bd58-f40dc91e5deb_996x364.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Sakana AI&#8217;s ICLR 2026 paper introduces a 7B Conductor model that hits SOTA on GPQA-Diamond and LiveCodeBench by orchestrating other LLMs instead of solving problems itself. The Conductor is trained with RL to do two things simultaneously: design communication topologies between worker agents (open or closed source) and prompt-engineer focused instructions to each worker so it leverages individual strengths. The orchestrator becomes a learnable policy, not a wrapper around one.</p><ul><li><p><strong>Topology design plus targeted prompting:</strong> A single RL policy decides who talks to whom and what each worker is told. Trained against randomized agent pools, the Conductor adapts to arbitrary mixes of agents at inference time, including agents it never saw during training.</p></li><li><p><strong>Recursive topologies emerge:</strong> When allowed to pick itself as a worker, the Conductor forms recursive topologies, unlocking a new form of dynamic test-time scaling through online iterative adaptation. Coordination becomes its own scaling axis, separate from model size or context length.</p></li><li><p><strong>3% gains on AIME25 and GPQA-D from coordination alone:</strong> The gains over the best individual worker land in the 3% range, which the authors note is consistent with entire generational improvements between frontier model versions. The difference is that here the lift comes from learned routing, not from larger pretraining runs.</p></li><li><p><strong>Why it matters:</strong> This is one of the cleaner arguments yet that the orchestrator should be the model. Routing decisions stop being a wrapper and become a learnable policy, which is the right abstraction for production agent stacks that compose multiple model providers.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2512.04388">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2051306659021242635">Tweet</a></strong></p><div><hr></div><h2><strong>3. Self-Improving Pretraining</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bO2d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bO2d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 424w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 848w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 1272w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bO2d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png" width="696" height="311" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:311,&quot;width&quot;:696,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Self-Improving Pretraining&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Self-Improving Pretraining" title="Self-Improving Pretraining" srcset="https://substackcdn.com/image/fetch/$s_!bO2d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 424w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 848w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 1272w, https://substackcdn.com/image/fetch/$s_!bO2d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94a3799f-c2ca-4eb8-9f4b-41c3befe5874_696x311.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most LLM safety, factuality, and reasoning fixes get bolted on at post-training. By then the patterns have already set. This Meta FAIR paper moves those behaviors into pretraining itself. The team uses a strong post-trained model as both a rewriter and a judge: it rewrites pretraining suffixes toward higher-quality, safer continuations, then scores model rollouts against the original suffix and the rewrite to drive RL during pretraining. Instead of next-token prediction, the policy learns sequence generation from the start, with rewards for quality, safety, and factuality.</p><ul><li><p><strong>Post-trained model as rewriter and judge:</strong> The strong model rewrites suffixes during pretraining, then judges rollouts of the in-training model against both the rewrite and the original. Safety, factuality, and quality become reward signals rather than post-hoc filters, which lets the policy internalize the targets early.</p></li><li><p><strong>Sequence generation from the start:</strong> The policy is trained to generate sequences directly under reward, not to predict the next token. This shifts the inductive bias toward producing the kinds of continuations the judge rewards, which matters most on long-form generation where token-level losses lose discriminative signal.</p></li><li><p><strong>Concrete gains across the board:</strong> 36.2% relative gain in factuality, 18.5% in safety, and up to 86.3% win rate in generation quality over standard pretraining. The safety and factuality numbers are large enough to suggest these properties are easier to install during pretraining than to retrofit after the fact.</p></li><li><p><strong>Why it matters:</strong> The post-trained models you already have can be used to pretrain the next ones better. That is a recursive improvement loop at the pretraining layer, which is where the largest behavioral commitments actually get locked in.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2601.21343">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2050213732970848664">Tweet</a></strong></p><div><hr></div><h2><strong>4. Connect Four AlphaZero from Scratch</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FYE5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FYE5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 424w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 848w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 1272w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FYE5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png" width="996" height="537" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:537,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Connect Four AlphaZero&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Connect Four AlphaZero" title="Connect Four AlphaZero" srcset="https://substackcdn.com/image/fetch/$s_!FYE5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 424w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 848w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 1272w, https://substackcdn.com/image/fetch/$s_!FYE5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33c19482-3db8-40ac-9a9f-c43336383ced_996x537.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This paper proposes a new way to evaluate coding agents: hand them a minimal task description, give them a tight budget, and ask them to autonomously rebuild a famous ML breakthrough end-to-end. Connect Four plus AlphaZero is the first instance. It is small enough to run on a laptop and hard enough to require a real research engineering loop. Claude Opus 4.7 implemented the full pipeline (MCTS, neural value and policy nets, self-play, training schedule) in three hours on consumer hardware, then beat the Pascal Pons solver 7 of 8 as first-mover. No other frontier coding agent tested cleared 2 of 8.</p><ul><li><p><strong>From patches to systems:</strong> Existing coding-agent benchmarks measure unit-test fixes and small patches. This benchmark measures whether the agent can build a non-trivial ML system from a one-paragraph spec, which is closer to what production research engineering actually looks like.</p></li><li><p><strong>Tight budget, real research loop:</strong> The agent has to design the search algorithm, train the networks, schedule self-play, and debug the loop, all within a fixed compute budget on consumer hardware. There is no escape hatch into a pre-built library, which is what makes the task discriminative.</p></li><li><p><strong>A clean separation between frontier coders:</strong> Claude Opus 4.7 reached 7 of 8 wins as first-mover against the Pascal Pons solver. No other frontier coding agent tested cleared 2 of 8. The gap is large enough to suggest the benchmark is detecting something real about end-to-end ML engineering capability.</p></li><li><p><strong>Why it matters:</strong> Patch-style benchmarks are starting to saturate. Rebuild-a-breakthrough tasks give the field a harder ceiling to push against, and they map more directly to the agent workloads people actually want to deploy.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.25067">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2050693576250753233">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2NyM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2NyM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2NyM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!2NyM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2NyM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12606519-a207-4803-adb1-5ad9469cebbd_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.dair.ai/courses/build-apps-with-claude-code&quot;,&quot;text&quot;:&quot;Enroll Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.dair.ai/courses/build-apps-with-claude-code"><span>Enroll Now</span></a></p><div><hr></div><h2><strong>5. Coordination as Architecture</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TWgK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TWgK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 424w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 848w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 1272w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TWgK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png" width="1258" height="807" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:807,&quot;width&quot;:1258,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Coordination as Architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Coordination as Architecture" title="Coordination as Architecture" srcset="https://substackcdn.com/image/fetch/$s_!TWgK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 424w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 848w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 1272w, https://substackcdn.com/image/fetch/$s_!TWgK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98e48909-31e3-4386-a821-9acedf0af05e_1258x807.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Multi-agent LLM systems fail in production at rates between 41% and 87%, and the majority of those failures are coordination defects, not base-model capability. Most published comparisons of multi-agent architectures cannot even tell you whether the gain came from coordination or from one configuration just having more context. This paper argues coordination should be treated as a configurable architectural layer, separable from agent logic and information access, then backs the position with an information-controlled experiment.</p><ul><li><p><strong>Information-controlled methodology:</strong> Same LLM, same tools, same prompt template, same per-call output cap. The only thing that varies is coordination structure. Once information access is held constant, the actual contribution of coordination becomes measurable for the first time.</p></li><li><p><strong>Coordination as a separate layer:</strong> The paper proposes treating coordination structure (who talks to whom, when, with what aggregation rule) as a first-class architectural axis. That separation lets teams reason about coordination changes without re-running the entire stack.</p></li><li><p><strong>A vocabulary for the field:</strong> Until now, &#8220;multi-agent beats single-agent&#8221; comparisons have been confounded by context-window asymmetries. This paper provides the methodology and vocabulary needed to actually test coordination claims, which is overdue infrastructure for the multi-agent research line.</p></li><li><p><strong>Why it matters:</strong> If 41% to 87% of failures are coordination defects, fixing coordination is the highest-leverage thing builders can do. The paper turns that intuition into a measurable engineering target instead of a vibes-based debate.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.03310">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2052429021833818458">Tweet</a></strong></p><div><hr></div><h2><strong>6. Horizon Generalization</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5k2g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5k2g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 424w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 848w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 1272w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5k2g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png" width="1456" height="492" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:492,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Horizon Generalization&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Horizon Generalization" title="Horizon Generalization" srcset="https://substackcdn.com/image/fetch/$s_!5k2g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 424w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 848w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 1272w, https://substackcdn.com/image/fetch/$s_!5k2g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60efb55d-66bf-4272-ba21-17dbc93c3942_3729x1260.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Microsoft Research runs a controlled study where the only variable is task horizon length. Same decision rules, same reasoning structure, different sequence length to the goal. The main finding: horizon alone is a training bottleneck. As goal distance grows, exploration explodes combinatorially and credit assignment gets ambiguous. Models that learn cleanly on short horizons fall apart on long ones, even when the underlying reasoning is identical. The fix is not more compute, it is horizon reduction.</p><ul><li><p><strong>Horizon as a first-class variable:</strong> By holding decision rules and reasoning constant and only varying sequence length, the paper isolates horizon as a distinct training bottleneck. This separates &#8220;the agent cannot reason&#8221; from &#8220;the agent cannot stitch together long sequences,&#8221; which most prior work conflated.</p></li><li><p><strong>Macro actions stabilize training:</strong> Re-parameterizing the action space with macro actions that compress many low-level decisions into one stabilizes training immediately. The agent learns the same task, just at a coarser temporal grain that keeps credit assignment tractable.</p></li><li><p><strong>Generalization to longer horizons at inference:</strong> Models trained on reduced horizons generalize to longer ones at inference time. The paper calls this horizon generalization, and it is the most useful property because it means you can train cheap and deploy long.</p></li><li><p><strong>Why it matters:</strong> Most teams treat long-horizon failures as a model-capacity problem. This paper says it is a horizon problem. Reduce horizon during training, get stability now and generalization for free at inference, without retraining a larger backbone.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2605.02572">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2051679862788878354">Tweet</a></strong></p><div><hr></div><h2><strong>7. 1,000 Synthetic Computers</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ssdq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ssdq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ssdq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;1000 Synthetic Computers&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="1000 Synthetic Computers" title="1000 Synthetic Computers" srcset="https://substackcdn.com/image/fetch/$s_!ssdq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!ssdq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea9947d-aa79-4fdb-a592-71a98c2f2f4b_3840x2160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Microsoft Research builds 1,000 synthetic computers, each with realistic directory structures, documents, and artifacts, then runs long-horizon simulations on top of them. One agent plays the user and sets productivity goals; another executes the work. Each simulation runs over 8 hours of agent runtime and 2,000+ turns on average, roughly a month of human work compressed into one trace. Training on this experiential data drives significant improvements on both in-domain and out-of-domain productivity evaluations.</p><ul><li><p><strong>Realistic synthetic environments:</strong> Each of the 1,000 computers ships with directory structures, documents, and artifacts that approximate a real user&#8217;s working environment. The realism is what makes the trajectories useful as training data instead of as evaluation curiosities.</p></li><li><p><strong>Two-agent simulation loop:</strong> A user agent sets productivity goals while a worker agent executes against them. The structure produces multi-turn, goal-directed traces that look like real productivity work, not the short scripted tasks that dominate existing benchmarks.</p></li><li><p><strong>Designed to scale to billions of worlds:</strong> The framework is explicitly designed to scale to millions or billions of synthetic user worlds, which matches the scale at which frontier computer-use agents will need experiential data. The bottleneck on long-horizon training is data, and this is a credible recipe for producing it.</p></li><li><p><strong>Why it matters:</strong> The bottleneck on computer-use agents has stopped being model capability and become realistic long-horizon training data. Synthetic-environment scaling is one of the few paths that does not depend on collecting massive amounts of real user telemetry, which makes it a practical default for teams building computer-use stacks.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.28181">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2050263752147456238">Tweet</a></strong></p><div><hr></div><h2><strong>8. Contextual Agentic Memory is a Memo</strong></h2><p>Most agent memory today is not memory, it is closer to a memo. Vector stores, RAG buffers, and scratchpads implement lookup, not consolidation. The paper draws on neuroscience&#8217;s Complementary Learning Systems theory: biological intelligence pairs fast hippocampal storage with slow neocortical consolidation, and current AI agents only implement the first half (fast write, similarity recall, no abstraction step). The authors prove a generalization ceiling on compositionally novel tasks: as long as memory stays retrieval-only, the agent cannot apply abstract rules to inputs that do not already look like something in the store, and it remains permanently exposed to memory poisoning. If you are building long-running agents and treating memory as a vector index, this paper is a clean diagnosis of what you are missing.</p><p><strong><a href="https://arxiv.org/abs/2604.27707">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2050694339165335754">Tweet</a></strong></p><div><hr></div><h2><strong>9. Agentic-imodels</strong></h2><p>The entire interpretability literature is built around human readers. As more analysis gets delegated to agents, the right target of interpretability shifts. Microsoft Research introduces Agentic-imodels, an autoresearch loop where a coding agent (Claude Code, Codex) iteratively evolves scikit-learn-compatible regressors that are simultaneously accurate AND readable by other LLMs. Interpretability is measured by whether a small LLM can simulate the fitted model&#8217;s behavior just by reading its string representation, predictions, feature effects, and counterfactuals from the <strong>str</strong> output alone. Across 65 tabular datasets, the discovered models push the Pareto frontier past every classical interpretable baseline (decision trees, GAMs, sparse linear), and improve four downstream agentic data-science systems on the BLADE benchmark by 8% to 73%.</p><p><strong><a href="https://arxiv.org/abs/2605.03808">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2052125514266190286">Tweet</a></strong></p><div><hr></div><h2><strong>10. Skills as Verifiable Artifacts</strong></h2><p>If you ship agent skills, your runtime is treating signed-and-cleared skills as trusted by default. This paper argues a skill is untrusted code until it is verified, and the runtime should enforce that default rather than infer trust from origin. Without skill verification, HITL has to fire on every irreversible call, which degrades into rubber-stamping at any non-trivial scale. With verification as a separate gated process, HITL fires only for what is unverified. Skills are now first-class deployment artifacts, and we have decades of supply-chain lessons on what happens when trust is inferred from a signature. This is the right ask for SKILL.md before agent skill libraries become the next attack surface.</p><p><strong><a href="https://arxiv.org/abs/2605.00424">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2051772437520622035">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Meta FAIR Autodata, ZAYA1-8B, SubQ 12M Context, Natural Language Autoencoders, Claude Managed Agents Dreaming, and More]]></title><description><![CDATA[Meta FAIR Autodata, ZAYA1-8B, SubQ 12M Context, Natural Language Autoencoders, Claude Managed Agents Dreaming, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-meta-fair-autodata</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-meta-fair-autodata</guid><pubDate>Sat, 09 May 2026 15:01:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Q3T3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Meta FAIR introduces Autodata</p></li><li><p>Zyphra releases ZAYA1-8B</p></li><li><p>SubQ ships a 12M-token frontier model</p></li><li><p>Anthropic introduces Natural Language Autoencoders</p></li><li><p>Claude Managed Agents adds dreaming and multi-agent</p></li><li><p>Printing Press: an agent CLI factory</p></li><li><p>Flue agent harness framework launches</p></li><li><p>Anthropic adds keyless auth</p></li><li><p>AlphaEvolve marks one year of impact</p></li><li><p>Goodfire opens a neural geometry series</p></li><li><p>Firefox hardened with Claude Mythos</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Autodata: An Agentic Data Scientist From Meta FAIR</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q3T3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q3T3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 424w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 848w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 1272w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q3T3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png" width="1456" height="713" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:713,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Autodata&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Autodata" title="Autodata" srcset="https://substackcdn.com/image/fetch/$s_!Q3T3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 424w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 848w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 1272w, https://substackcdn.com/image/fetch/$s_!Q3T3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40038fe3-fb7c-46ce-b7d5-f75db6028601_3330x1630.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Meta FAIR (Jason Weston et al.) introduced Autodata, an agentic data scientist that builds high-quality training and evaluation data autonomously. The framing is that inference compute can be converted into model quality if the data pipeline itself is an agent.</p><ul><li><p><strong>Agentic Self-Instruct loop:</strong> A planner-executor agent generates, critiques, and refines training and eval examples in a closed loop, replacing static seed sets with a process that keeps producing harder data as the model improves.</p></li><li><p><strong>34-point weak-to-strong gap:</strong> On a CS research QA task, Autodata data opens a 34-point accuracy gap between weak and strong models, a much larger separation than off-the-shelf instruction sets achieve.</p></li><li><p><strong>Inference compute as a quality lever:</strong> The work reframes synthetic data as the place where inference budget pays off, an angle that lines up with Microsoft&#8217;s FaraGen and the broader synthetic-environments thread.</p></li><li><p><strong>Why it matters:</strong> Pairs naturally with self-improving agent runtimes (Claude Managed Agents Outcomes loop, ACE, AHE), giving teams a credible recipe for the data half of the self-improvement story.</p></li></ul><p><strong><a href="https://facebookresearch.github.io/RAM/blogs/autodata/">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-meta-fair-autodata">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (April 26 - May 3)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-b95</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-b95</guid><pubDate>Sun, 03 May 2026 15:02:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!nQGv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Agentic Harness Engineering</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nQGv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nQGv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 424w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 848w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 1272w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nQGv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png" width="947" height="458" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:458,&quot;width&quot;:947,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agentic Harness Engineering&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agentic Harness Engineering" title="Agentic Harness Engineering" srcset="https://substackcdn.com/image/fetch/$s_!nQGv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 424w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 848w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 1272w, https://substackcdn.com/image/fetch/$s_!nQGv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca5e00-9a08-4c8d-bfdf-904657e68873_947x458.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most coding-agent harnesses are still tuned by hand or kept alive through brittle trial-and-error self-evolution. This paper introduces Agentic Harness Engineering (AHE), a framework that makes harness evolution observable and falsifiable. AHE separates the system into three layers: components stored as revertible files, experience condensed from millions of trajectory tokens into structured evidence, and decisions written as predictions that get checked against task outcomes. Every edit becomes a contract you can verify or revert.</p><ul><li><p><strong>Three-layer evolution model:</strong> Components, experience, and decisions are each first-class artifacts. Components are versioned files, experience is compressed evidence pulled from full trajectory logs, and decisions are explicit hypotheses with expected outcomes. The structure turns black-box harness tuning into an auditable engineering loop.</p></li><li><p><strong>Pass@1 gains on Terminal-Bench 2:</strong> Pass@1 climbs from 69.7% to 77.0% across ten iterations, beating both human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO. The framework also uses 12% fewer tokens than the seed harness on SWE-bench-verified.</p></li><li><p><strong>Cross-model transfer:</strong> The evolved harness transfers across model families with +5.1 to +10.1 point gains, suggesting the optimizations are structural rather than overfit to a specific backbone. That is the property you actually want from harness engineering.</p></li><li><p><strong>Why it matters:</strong> Harness work is the largest hidden cost in most agent systems. AHE is the first credible recipe for letting the harness improve itself without drifting into noise, which makes it the most important agent-systems paper of the week.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.25850">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2049492169887748365">Tweet</a></strong></p><div><hr></div><h2><em><strong>Message from our Sponsor</strong></em></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!21F7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!21F7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 424w, https://substackcdn.com/image/fetch/$s_!21F7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 848w, https://substackcdn.com/image/fetch/$s_!21F7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 1272w, https://substackcdn.com/image/fetch/$s_!21F7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!21F7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png" width="790" height="298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:298,&quot;width&quot;:790,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Kurate Leaderboard&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kurate Leaderboard" title="Kurate Leaderboard" srcset="https://substackcdn.com/image/fetch/$s_!21F7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 424w, https://substackcdn.com/image/fetch/$s_!21F7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 848w, https://substackcdn.com/image/fetch/$s_!21F7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 1272w, https://substackcdn.com/image/fetch/$s_!21F7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6ed4df-c257-4a1a-b196-f0541b45dcdf_790x298.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><a href="https://kurate.org/?utm_source=dair_ai&amp;utm_medium=newsletter&amp;utm_campaign=dair_ai_ad">Kurate.org</a> - Arena for scientific papers. Every day, hundreds of arXiv preprints are ranked by scientific impact through pairwise tournaments judged by Claude, GPT and Gemini models. See the top ranked papers in AI, ML, Robotics, Quantum Physics, and more for free.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://kurate.org/?utm_source=dair_ai&amp;utm_medium=newsletter&amp;utm_campaign=dair_ai_ad&quot;,&quot;text&quot;:&quot;Explore The Leaderboards&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://kurate.org/?utm_source=dair_ai&amp;utm_medium=newsletter&amp;utm_campaign=dair_ai_ad"><span>Explore The Leaderboards</span></a></p><div><hr></div><h2><strong>2. AgenticQwen-30B-A3B</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-xpM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-xpM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 424w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 848w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 1272w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-xpM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png" width="781" height="443" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:781,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AgenticQwen-30B-A3B&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AgenticQwen-30B-A3B" title="AgenticQwen-30B-A3B" srcset="https://substackcdn.com/image/fetch/$s_!-xpM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 424w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 848w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 1272w, https://substackcdn.com/image/fetch/$s_!-xpM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9873527d-a210-4f02-9607-d71beda9a2e1_781x443.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Alibaba shows that a 30B MoE model with only 3B active parameters can match Qwen3-235B on real tool-use workloads. AgenticQwen-30B-A3B scores 50.2 average on TAU-2 plus BFCL-V4 Multi-Turn, while AgenticQwen-8B scores 47.4. Both more than double their vanilla Qwen baselines and close most of the gap to a 235B model. The recipe is built around two reinforcement learning flywheels that run in parallel, with simulated users actively trying to mislead the agent.</p><ul><li><p><strong>Reasoning flywheel from self-failure:</strong> The first loop mines the model&#8217;s own errors and converts them into harder reasoning problems each round. The training distribution gets harder on its own as the model improves, removing the need for new human-curated reasoning data.</p></li><li><p><strong>Agentic flywheel for tool use:</strong> The second loop grows simple linear tool-use trajectories into multi-branch behavior trees. Simulated users test recovery from misleading instructions, ambiguous goals, and failed tool calls, which is where vanilla supervised tuning typically breaks.</p></li><li><p><strong>Real efficiency for production agents:</strong> A 30B MoE with 3B active tokens at inference is significantly cheaper to serve than a 235B dense or MoE alternative. For tool-use workloads where frontier reasoning is overkill, this changes the cost profile of shipping production agents.</p></li><li><p><strong>A reusable recipe:</strong> The flywheel approach generalizes beyond Qwen. Teams can generate hard examples from their own agent&#8217;s failures rather than relying on static synthetic data, which is the more scalable path for domain-specific agents.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.21590">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2048504655932760565">Tweet</a></strong></p><div><hr></div><h2><strong>3. Agentic World Modeling</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t8HP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t8HP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 424w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 848w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 1272w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t8HP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png" width="1080" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:313,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agentic World Modeling&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agentic World Modeling" title="Agentic World Modeling" srcset="https://substackcdn.com/image/fetch/$s_!t8HP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 424w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 848w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 1272w, https://substackcdn.com/image/fetch/$s_!t8HP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6593f518-d78e-46c3-b0fd-0776d0c57c39_1080x313.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A massive 40-author survey lands the cleanest taxonomy of world models in agent research released so far. The paper proposes a &#8220;levels by laws&#8221; framework spanning three capability levels and four law regimes, then synthesizes 400+ works and 100+ representative systems across model-based RL, video generation, web and GUI agents, multi-agent simulation, and scientific discovery. As agents shift from chatbots to goal-accomplishers, the bottleneck moves from language to environment, and this is the first paper that gives builders a shared vocabulary across communities that have been working in isolation.</p><ul><li><p><strong>Three capability levels:</strong> L1 Predictors handle one-step transitions, L2 Simulators do multi-step action-conditioned rollouts, and L3 Evolvers self-revise as the world changes. The hierarchy makes it easy to place existing systems and identify where capability gaps actually live.</p></li><li><p><strong>Four law regimes:</strong> Physical, digital, social, and scientific laws each impose different constraints on what a world model needs to capture. The framework treats them as orthogonal axes, which clarifies why a strong physics simulator can still fail at social or digital tasks.</p></li><li><p><strong>Failure-mode catalog:</strong> The survey extracts recurring failure patterns across 100+ systems, including misaligned reward shaping, drift under non-stationarity, and brittle transfer across regimes. Each failure mode is mapped to a level and law combination, so the diagnosis is grounded.</p></li><li><p><strong>Evaluation principles per level:</strong> The authors propose evaluation criteria specific to each capability level rather than a single benchmark. This is the right move because L1 prediction accuracy and L3 self-revision quality are not measurable on the same axis.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.22748">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2048783073547079816">Tweet</a></strong></p><div><hr></div><h2><strong>4. RecursiveMAS</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aBcQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aBcQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 424w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 848w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 1272w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aBcQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png" width="997" height="634" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:634,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;RecursiveMAS&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="RecursiveMAS" title="RecursiveMAS" srcset="https://substackcdn.com/image/fetch/$s_!aBcQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 424w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 848w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 1272w, https://substackcdn.com/image/fetch/$s_!aBcQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e96c6f-88eb-48de-8a84-a41ebd0448bb_997x634.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Multi-agent systems usually pass full text messages between agents at every step, which causes token bloat, latency, and context dilution that all grow with team size. RecursiveMAS asks a different question: what if agents collaborated through recursive computation in a shared latent space instead of through text? The system treats a multi-agent team as a recursive computation where each agent acts like an RLM layer, iteratively passing latent representations to the next and forming a looped interaction process. Less talking, more thinking.</p><ul><li><p><strong>RecursiveLink for latent communication:</strong> A RecursiveLink module generates latent thoughts and transfers state directly between heterogeneous agents, replacing natural-language messages with internal representations. The change removes the cost of re-encoding and re-parsing text on every coordination step.</p></li><li><p><strong>Inner-outer loop learning:</strong> The training algorithm uses an inner loop for per-step latent updates and an outer loop for team-level credit assignment, with shared gradient-based updates across agents. This makes joint optimization tractable instead of relying on hand-tuned communication protocols.</p></li><li><p><strong>Strong gains across 9 benchmarks:</strong> Across math, science, medicine, search, and code generation, RecursiveMAS delivers 8.3% average accuracy gain over baselines, 1.2x to 2.4x end-to-end inference speedup, and 34.6% to 75.6% reduction in token usage. The efficiency story is at least as important as the accuracy story.</p></li><li><p><strong>A path past the agent communication tax:</strong> If agent-to-agent communication is the next real bottleneck, latent-space recursion is one of the cleaner ways to scale collaboration. Teams running multi-agent systems at scale should treat this as a serious design alternative, not a research curiosity.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.25917">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2050261229315477988">Tweet</a></strong></p><div><hr></div><h2><strong>5. OneManCompany</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tMFx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tMFx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 424w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 848w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 1272w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tMFx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png" width="793" height="277" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:277,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;OneManCompany&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="OneManCompany" title="OneManCompany" srcset="https://substackcdn.com/image/fetch/$s_!tMFx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 424w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 848w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 1272w, https://substackcdn.com/image/fetch/$s_!tMFx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71192e34-bdce-4aab-b48d-ed0f0fd5e2a7_793x277.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you are building multi-agent systems, you are probably wiring static org charts. This paper argues they should look more like a labor market. OneManCompany (OMC) replaces fixed teams with &#8220;Talents,&#8221; portable agent identities that bundle skills and tools, and a &#8220;Talent Market&#8221; where agents get recruited dynamically per task. An Explore-Execute-Review tree search decomposes work hierarchically and aggregates results back up. On PRDBench, OMC reaches 84.67% success, +15.5 points over prior SOTA, and the framework generalizes across the case studies the authors run.</p><ul><li><p><strong>Talents as portable identities:</strong> A Talent bundles a skill set, tool access, and behavioral priors into a reusable agent identity. Talents can be hired into any task without rewiring the orchestration graph, which removes most of the brittleness in pre-wired multi-agent pipelines.</p></li><li><p><strong>Dynamic recruitment via Talent Market:</strong> Tasks post requirements, and the market matches Talents to roles based on capability fit and current load. This replaces the standard &#8220;design a team for every workflow&#8221; pattern with on-demand assembly that adapts as the task population shifts.</p></li><li><p><strong>Explore-Execute-Review tree search:</strong> Work is decomposed top-down into subtasks, executed in parallel by recruited Talents, then reviewed and aggregated up the tree. The structure naturally supports retries, branching, and cross-checking without manual coordination logic.</p></li><li><p><strong>Why it matters:</strong> Pre-wired multi-agent pipelines break the moment tasks drift outside their design envelope. Treating agents as a recruitable workforce gets you self-organization and continuous improvement by default, which is what open-ended agent systems need.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.22446">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2048909068409147460">Tweet</a></strong></p><div><hr></div><h2><strong>6. From Skill Text to Skill Structure</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EyH9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EyH9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 424w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 848w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 1272w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EyH9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png" width="997" height="399" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:399,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;SSL&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="SSL" title="SSL" srcset="https://substackcdn.com/image/fetch/$s_!EyH9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 424w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 848w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 1272w, https://substackcdn.com/image/fetch/$s_!EyH9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5b7aa4-0952-4fb3-810f-d13c46778ff6_997x399.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>SKILL.md files entangle invocation interface, execution flow, and tool side effects in a single blob of natural language. That makes downstream discovery and risk review brittle as skill registries scale. This paper proposes SSL, a three-layer typed JSON representation drawn from Schank and Abelson&#8217;s classical work on scripts, MOPs, and conceptual dependency. An LLM-based normalizer converts existing SKILL.md files into the structure, so adoption does not require rewriting registries by hand.</p><ul><li><p><strong>Three layers, cleanly separated:</strong> A Scheduling layer captures invocation signals and trigger conditions, a Structural layer encodes execution scenes and ordering, and a Logical layer specifies atomic actions plus resource and side-effect annotations. The separation lets discovery, risk, and execution each reason about the layer they care about.</p></li><li><p><strong>Skill Discovery MRR jumps 0.573 to 0.707:</strong> Treating skills as typed structure rather than prose makes retrieval significantly more accurate, even before any model fine-tuning. The gain comes from the structure exposing what skills actually do, not just how they describe themselves.</p></li><li><p><strong>Risk Assessment macro F1 of 0.787:</strong> The Logical layer&#8217;s resource annotations enable a 0.744 to 0.787 jump in risk classification. Auditors can now reason about side effects directly instead of inferring them from free-form prose.</p></li><li><p><strong>A 6,184-skill corpus released:</strong> The authors ship a normalized corpus of 6,184 skills, 403 task queries, and 500 risk-labeled skills. As skill registries cross a million entries, structured representations are the only path that keeps discovery and review tractable.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.24026">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2049252335105491147">Tweet</a></strong></p><div><hr></div><h2><strong>7. Latent Agents</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GOq5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GOq5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 424w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 848w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 1272w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GOq5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png" width="997" height="298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:298,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Latent Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Latent Agents" title="Latent Agents" srcset="https://substackcdn.com/image/fetch/$s_!GOq5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 424w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 848w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 1272w, https://substackcdn.com/image/fetch/$s_!GOq5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb4a39fa-a2a3-4370-8622-20dd20523f28_997x298.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Multi-agent debate makes models reason better. It also burns tokens generating long transcripts before any answer comes out. Latent Agents distills the entire debate into a single LLM through a two-stage fine-tuning pipeline: the model first learns debate structure, then internalizes it through dynamic reward scheduling and length clipping. The internalized model matches or beats explicit multi-agent debate while using up to 93% fewer tokens, which makes debate-quality reasoning practical at production scale.</p><ul><li><p><strong>Two-stage internalization pipeline:</strong> Stage one teaches the structure of debate (turn taking, critique, revision) through supervised fine-tuning on transcript data. Stage two uses dynamic reward scheduling and length clipping to compress that structure into single-pass reasoning without losing the gains from the multi-agent setup.</p></li><li><p><strong>Up to 93% token savings:</strong> The internalized model matches or beats explicit debate accuracy while drastically reducing inference cost. For teams running reasoning workloads at scale, this is the kind of efficiency win that turns a research idea into a deployment default.</p></li><li><p><strong>Activation steering reveals agent subspaces:</strong> The &#8220;agents&#8221; survive distillation as identifiable circuits in activation space. Probing finds interpretable directions corresponding to different agent perspectives, which means the internal structure persists even when the external transcript is gone.</p></li><li><p><strong>A safety angle worth noting:</strong> When malicious agents are deliberately embedded via distillation, negative steering suppresses them more cleanly than steering a base model would, with smaller hits to general performance. Internalized debate may turn out to be a useful interpretability and alignment substrate, not just a token-saver.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.24881">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2049493173639278818">Tweet</a></strong></p><div><hr></div><h2><strong>8. OCR-Memory</strong></h2><p>Most agent memory systems compress trajectories into text summaries and hope the model remembers what matters, which is exactly where the information loss hides. OCR-Memory renders the agent&#8217;s interaction history as images with indexed visual anchors, then retrieves via a locate-and-transcribe pipeline: the model scans visual memory, predicts the index of the relevant region, and the original text is fetched verbatim from a database. Older trajectories are stored as low-resolution thumbnails with active-recall up-sampling, and the method reaches SOTA on Mind2Web and AppWorld under strict context limits.</p><p><strong><a href="https://arxiv.org/abs/2604.26622">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2049957482811056307">Tweet</a></strong></p><div><hr></div><h2><strong>9. When to Retrieve During Reasoning</strong></h2><p>Most RAG systems retrieve once, before the model starts reasoning. Large reasoning models like o1 and R1 do not work that way. They generate 12k to 25k token chains of thought and hit knowledge gaps mid-inference, long after the retrieval window closed. ReaLM-Retrieve is a reasoning-aware retrieval framework that injects evidence during multi-step inference, detects uncertainty at reasoning-step granularity, and learns a policy for when external evidence actually helps. It achieves +10.1% absolute F1 over standard RAG across MuSiQue, HotpotQA, and 2WikiMultiHopQA, with 47% fewer retrieval calls than fixed-interval IRCoT, and hits 71.2% F1 on 2-4 hop MuSiQue with only 1.8 retrieval calls per question.</p><p><strong><a href="https://arxiv.org/abs/2604.26649">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2049954716298494386">Tweet</a></strong></p><div><hr></div><h2><strong>10. Co-evolving Decisions and Skills</strong></h2><p>Long-horizon agents fail in two ways: the decision-maker cannot decompose well, or the skill library goes stale. This paper introduces a co-evolution framework where an LLM decision agent and a dynamic skill bank improve each other through iterative refinement. The decision agent picks and chains skills, performance feedback updates both the policy and the skills, and new skills emerge by generalizing successful sequences instead of being hand-coded upfront. Most long-horizon agent stacks treat skills and decision-making as separate optimization problems, which is why they plateau. Co-evolution gives you adaptive planning and a growing library of reusable behaviors from a single loop, which is what you actually want when task structure is not predetermined: robotics, game agents, and complex planning.</p><p><strong><a href="https://arxiv.org/abs/2604.20987">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2048440985726955998">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Codex for Everyday Work, Cursor SDK, Mistral Workflows, LLM Knowledge Bases, Agentic Harness Engineering, and More]]></title><description><![CDATA[Codex for Everyday Work, Cursor SDK, Mistral Workflows, LLM Knowledge Bases, Agentic Harness Engineering, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-codex-for-everyday</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-codex-for-everyday</guid><pubDate>Sat, 02 May 2026 15:01:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GnKJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>OpenAI ships Codex for everyday work</p></li><li><p>Cursor releases the Cursor SDK</p></li><li><p>Mistral launches Workflows orchestration</p></li><li><p>DAIR.AI guide to building LLM knowledge bases</p></li><li><p>Agentic Harness Engineering paper drops</p></li><li><p>Cursor 3.2 multitask lands</p></li><li><p>Claude Code adds push notifications</p></li><li><p>Qwen open-sources Qwen-Scope SAEs</p></li><li><p>AISI evaluates GPT-5.5 cyber capabilities</p></li><li><p>AgenticQwen-30B-A3B closes tool-use gap</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Codex for Everyday Work</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GnKJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GnKJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GnKJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg" width="1200" height="772" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:772,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Codex for Everyday Work&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Codex for Everyday Work" title="Codex for Everyday Work" srcset="https://substackcdn.com/image/fetch/$s_!GnKJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GnKJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a13a976-9f66-4a05-9187-ad2526ab0643_1200x772.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>OpenAI extended Codex from a coding agent into a general-purpose work agent. Users now pick a role (finance, data science, marketing, ops, research), connect the apps they actually use, and get suggested prompts that wire Codex into docs, slides, sheets, research, and planning across ChatGPT.</p><ul><li><p><strong>Role-based onboarding:</strong> Codex ships preset roles for non-engineering teams, with per-role prompt suggestions and connector recommendations so a marketing or finance user can run a useful agent on day one without designing their own harness.</p></li><li><p><strong>Sheets, slides, and docs:</strong> The update adds materially better spreadsheet and slide generation plus cleaner doc workflows, pushing Codex into the same surface as enterprise copilots like Workspace and Microsoft 365 agents.</p></li><li><p><strong>20% faster computer use:</strong> Codex&#8217;s computer-use agent runs 20% faster on the same tasks, narrowing the latency gap that has held browser and desktop automation back from being a daily-driver capability.</p></li><li><p><strong>Same agent everywhere:</strong> OpenAI is positioning a single Codex runtime across coding, research, and operations, so a Pro or Business user gets one agent that scales from &#8220;fix this PR&#8221; to &#8220;build a Q2 finance review.&#8221;</p></li></ul><p><strong><a href="https://chatgpt.com/codex/for-work/">Codex for Work</a></strong> | <strong><a href="https://x.com/OpenAI/status/2049928776147230886">Announcement</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-codex-for-everyday">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (April 19 - April 26)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f</guid><pubDate>Sun, 26 Apr 2026 15:02:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!i-uk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. DeepSeek V4</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i-uk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i-uk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 424w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 848w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 1272w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i-uk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!i-uk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 424w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 848w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 1272w, https://substackcdn.com/image/fetch/$s_!i-uk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4946f21-3a8a-4360-80a5-788ffbe1a869_1734x914.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>DeepSeek V4 is the first open model family built from the ground up around million-token contexts as a default rather than a bolt-on feature. The release includes DeepSeek-V4-Pro (1.6T total / 49B active) and DeepSeek-V4-Flash (284B total / 13B active), both trained natively at 1M context length. The tech report details a hybrid attention architecture, new training stability techniques, and a domain-specialist post-training pipeline that together push the open-source frontier much closer to GPT-5.2 and Gemini 3.0-Pro at a fraction of the cost.</p><ul><li><p><strong>Hybrid attention with CSA and HCA:</strong> DeepSeek V4 replaces a single attention stack with Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses KV entries, then applies DeepSeek Sparse Attention with sliding-window KV for fine-grained local dependencies. HCA aggressively compresses KV for extreme-context layers, keeping the model feasible at 1M tokens.</p></li><li><p><strong>Training stability at trillion-parameter scale:</strong> The team introduces two techniques that materially cut loss spikes. Anticipatory Routing decouples backbone and router updates, using current weights for features but historical weights for routing indices. SwiGLU Clamping bounds the linear and gate components of SwiGLU to stabilize activations throughout pretraining.</p></li><li><p><strong>Domain-specialist post-training:</strong> Instead of one large mixed-RL stage, DeepSeek trains a separate specialist expert per domain. Each expert goes through supervised fine-tuning on domain data, then Group Relative Policy Optimization (GRPO) RL with a domain-specific reward model. The specialists are merged into the final model, recovering capability without destabilizing the generalist.</p></li><li><p><strong>Frontier-adjacent performance at open-source cost:</strong> DeepSeek-V4-Pro-Max beats GPT-5.2 and Gemini 3.0-Pro on standard reasoning benchmarks and lands just behind GPT-5.4 and Gemini 3.1-Pro, effectively trailing the closed frontier by roughly 3 to 6 months. For open-weights teams that need long-context reasoning without closed API pricing, this is the most important release of the week.</p></li></ul><p><strong><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf">Paper</a></strong> | <strong><a href="https://x.com/deepseek_ai/status/2047516922263285776">Tweet</a></strong></p><div><hr></div><h2><strong>2. Autogenesis</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eSR7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eSR7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 424w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 848w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eSR7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png" width="1456" height="670" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:670,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!eSR7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 424w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 848w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!eSR7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0921828d-a0fd-469f-a2b2-6423e472c05e_2550x1174.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Static agents age quickly. As deployment environments change and new tools arrive, the agents that survive will be the ones that can safely rewrite themselves. This paper introduces Autogenesis, a self-evolving agent protocol where agents identify their own capability gaps, generate candidate improvements, validate them through testing, and integrate what works back into their own operational framework. No retraining and no human patching, just an ongoing loop of assessment, proposal, validation, and integration.</p><ul><li><p><strong>Two-layer protocol design:</strong> Autogenesis separates a Resource Substrate Protocol Layer (RSPL) that standardizes access to prompts, tools, environments, and memory from a Self-Evolution Protocol Layer (SEPL) that runs a Generate, Reflect, Improve, Evaluate, Commit loop over evolvable variables. The split keeps core capability registration stable while evolution happens on top.</p></li><li><p><strong>Auditable lineage and rollback:</strong> Improvements are committed with version lineage, state access control, and reversible lifecycle operations. The protocol treats every self-modification as a first-class artifact that can be inspected, reproduced, or rolled back, which is what makes self-improvement safe enough to deploy.</p></li><li><p><strong>Multi-agent applications:</strong> Autogenesis is demonstrated on multi-agent systems with planner, executor, and analyst roles. Agents evolve their own prompts, tool wrappers, and coordination routines using the shared protocol, showing that the abstraction is general enough to hold across roles rather than being tied to a single agent type.</p></li><li><p><strong>Part of a broader self-improvement wave:</strong> The paper sits alongside Meta-Harness and the Darwin G&#246;del Machine as a concrete framework for operationalizing self-modification. Together they mark a shift from &#8220;agents that use tools&#8221; to &#8220;agents that edit their own tooling.&#8221;</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.15034">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2045241905227915498">Tweet</a></strong></p><div><hr></div><h2><strong>3. Attention to Mamba</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b9J9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b9J9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 424w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 848w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 1272w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b9J9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png" width="1456" height="577" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:577,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!b9J9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 424w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 848w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 1272w, https://substackcdn.com/image/fetch/$s_!b9J9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e5169d8-56d2-4e9a-9778-a218e1c3e2e5_1954x774.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Apple proposes a two-stage recipe for cross-architecture distillation from Transformers into Mamba. Naive distillation collapses teacher performance because a Mamba student cannot directly imitate softmax attention. The fix is to distill the transformer into a linearized-attention student using a kernel adaptation first, then transfer that student into a pure Mamba with no attention blocks. On a 1B model trained on 10B tokens, the Mamba student hits 14.11 perplexity against a 13.86 Pythia-1B teacher, nearly matching quality at linear-time inference cost.</p><ul><li><p><strong>Stage 1, softmax to linear attention:</strong> The first stage replaces softmax attention with a Hedgehog-style linearized attention student, using a learnable kernel feature map that preserves the original attention scores while removing the softmax nonlinearity. This gives a strictly linear-complexity intermediate that stays close to the teacher.</p></li><li><p><strong>Stage 2, linear attention to Mamba:</strong> The second stage transfers the linearized student into a HedgeMamba block, a hybrid SSM architecture that reuses the learned linear attention parameters and adds state-space components. The transition preserves quality because the two formulations are mathematically related, not just structurally similar.</p></li><li><p><strong>Quality at long context:</strong> On downstream benchmarks, the distilled Mamba reaches 74.1% of the teacher&#8217;s accuracy, with the recipe generalizing to 1B and 3B scales. The key practical win is retaining Transformer-level quality on the sequence mixing block while moving to linear time at inference.</p></li><li><p><strong>A cheaper path to SSM deployment:</strong> If trained Transformers can be reliably converted into state-space models without retraining from scratch, the entire open-weights ecosystem becomes cheaper to serve at long context. This is the kind of quiet infrastructure work that matters more than it looks.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.14191">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2045600012860801113">Tweet</a></strong></p><div><hr></div><h2><strong>4. Skill-RAG</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aYyL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aYyL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 424w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 848w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 1272w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aYyL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png" width="793" height="308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:308,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Skill-RAG&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Skill-RAG" title="Skill-RAG" srcset="https://substackcdn.com/image/fetch/$s_!aYyL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 424w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 848w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 1272w, https://substackcdn.com/image/fetch/$s_!aYyL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee1ba833-f856-4ecc-93b1-040b0880e39c_793x308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most RAG systems retrieve on every query, whether the model needs help or not. This is wasteful when the model already knows the answer and often too late when it does not. This paper introduces Skill-RAG, a failure-state-aware retrieval system that uses hidden-state probing to detect when an LLM is approaching a knowledge failure, then routes the query to a specialized retrieval strategy matched to the gap.</p><ul><li><p><strong>Hidden-state probing as a retrieval trigger:</strong> Skill-RAG trains a lightweight probe on the LLM&#8217;s hidden representations that predicts whether the model is about to fail the query. Only queries that clear the probe&#8217;s failure threshold trigger retrieval, which cuts unnecessary search calls while preserving answers for the cases that actually need help.</p></li><li><p><strong>Skill-matched retrieval strategies:</strong> Different failure modes (factual recall, multi-hop reasoning, temporal knowledge) are routed to different retrieval &#8220;skills&#8221; rather than a single generic retriever. Each skill is treated as a standalone component the agent can select between, echoing the broader trend of turning RAG into a collection of composable primitives.</p></li><li><p><strong>Consistent gains across benchmarks:</strong> Evaluated on HotpotQA, Natural Questions, and TriviaQA, Skill-RAG improves over uniform RAG baselines on both efficiency and accuracy. The efficiency story matters as much as the accuracy: per-query retrieval cost drops significantly when the system skips retrieval for questions the model can already answer.</p></li><li><p><strong>A shift in how RAG is designed:</strong> The work reinforces the direction RAG is heading: from a single monolithic pipeline to a suite of retrieval skills an agent selects between. Knowing when to retrieve and what kind of retrieval to run is becoming the central design question.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.15771">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2046249336162632155">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PMVb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PMVb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PMVb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!PMVb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PMVb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc66db729-4fa3-4eb8-88f8-475faa071707_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p><strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Enroll Now</a></strong></p><div><hr></div><h2><strong>5. Self-Generated World Knowledge</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cBI7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cBI7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 424w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 848w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 1272w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cBI7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png" width="997" height="431" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:431,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Self-Generated World Knowledge&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Self-Generated World Knowledge" title="Self-Generated World Knowledge" srcset="https://substackcdn.com/image/fetch/$s_!cBI7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 424w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 848w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 1272w, https://substackcdn.com/image/fetch/$s_!cBI7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3649e63b-4230-4f93-a8c0-d27a0d774fa4_997x431.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>How far are we from agents that can self-generate world knowledge? This paper proposes an outcome-based reward that measures how much an agent&#8217;s self-generated world knowledge actually improves its task success rate, then trains with that signal and removes the external guidance at inference. The result is a 14B model that surpasses Gemini-2.5-Flash on web navigation and gains +20% on WebVoyager and WebWalker benchmarks.</p><ul><li><p><strong>Outcome-based reward for knowledge:</strong> Rather than scoring knowledge against a human-labeled reference, the reward is whether the generated knowledge measurably improves task success when the agent uses it. This lets the system learn which internally generated facts are worth keeping without an external oracle.</p></li><li><p><strong>Multistage training pipeline:</strong> The method combines supervised fine-tuning on an instruction-and-trajectory dataset with reinforcement rejection sampling, where the best trajectories (ranked by the outcome reward) are used to update the policy. The training loop iterates between generation, reward scoring, and rejection sampling until the model internalizes effective knowledge-use behaviors.</p></li><li><p><strong>Knowledge-enhanced execution at inference:</strong> At inference the external environment feedback loop is removed. The agent self-generates world knowledge, uses it to plan, and executes, without any human or reward signal in the loop. This is what makes the method deployable, not just measurable.</p></li><li><p><strong>Environment design replaces labeling:</strong> If agents can reliably improve themselves by exploring the world rather than waiting for human-labeled rewards, the bottleneck for scaling agentic systems shifts from data curation to environment design. That matches the broader direction of the field and gives practitioners a concrete recipe to follow.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.18131">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2047061650189307953">Tweet</a></strong></p><div><hr></div><h2><strong>6. Self-Evolving Logic Synthesis</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!idVk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!idVk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 424w, https://substackcdn.com/image/fetch/$s_!idVk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 848w, https://substackcdn.com/image/fetch/$s_!idVk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 1272w, https://substackcdn.com/image/fetch/$s_!idVk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!idVk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png" width="897" height="347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:347,&quot;width&quot;:897,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Self-Evolving Logic Synthesis&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Self-Evolving Logic Synthesis" title="Self-Evolving Logic Synthesis" srcset="https://substackcdn.com/image/fetch/$s_!idVk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 424w, https://substackcdn.com/image/fetch/$s_!idVk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 848w, https://substackcdn.com/image/fetch/$s_!idVk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 1272w, https://substackcdn.com/image/fetch/$s_!idVk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d0612c9-4a21-496a-b5f5-679a614a16ad_897x347.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>EDA tools like ABC have been hand-tuned by humans for decades. NVIDIA shows they can evolve themselves. This work introduces the first self-evolving logic synthesis framework, a multi-agent LLM system that autonomously refines the entire ABC codebase, generates and tests candidate optimization sequences against standard benchmark circuits, then merges improvements back into the base tool. No human engineer in the loop.</p><ul><li><p><strong>Multi-agent refinement of a real EDA toolchain:</strong> The framework assigns specialized agents to exploration, synthesis, and self-review tasks. Agents read and modify the ABC source directly, propose optimization flows, and run them against benchmark circuits such as EPFL, IWLS, and VTR, with three-pass human-domain knowledge injected through the pipeline.</p></li><li><p><strong>Measured improvement over hand-tuned baselines:</strong> The evolved ABC variants produce better area, delay, and switching metrics than the hand-tuned reference on the benchmark suite, and the improvements persist under sensitivity analysis. This is a real gain on a tool the semiconductor industry depends on.</p></li><li><p><strong>Codebase-level evolution, not just prompt tuning:</strong> The agents edit the ABC codebase itself, not just a configuration layer. That is a meaningful extension of the self-improving agent thread: the unit of improvement is real production code, not a prompt or policy.</p></li><li><p><strong>Generalizable blueprint for domain tools:</strong> If agents can evolve a foundational semiconductor tool without manual engineering, the same pattern generalizes to any large, domain-specific codebase. It is a concrete extension of the self-improving agent thread, applied to infrastructure that shipping chips depend on.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.15082">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2046251813738025025">Tweet</a></strong></p><div><hr></div><h2><strong>7. Stateless Decision Memory</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h_Lt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h_Lt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 424w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 848w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 1272w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h_Lt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png" width="1456" height="524" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:524,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Stateless Decision Memory&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Stateless Decision Memory" title="Stateless Decision Memory" srcset="https://substackcdn.com/image/fetch/$s_!h_Lt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 424w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 848w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 1272w, https://substackcdn.com/image/fetch/$s_!h_Lt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbfaa9c-1d50-4f76-b31f-abbcf8a2b4c2_2385x859.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most interesting AI agent papers right now are about capability. This one is about plumbing, and it is probably more important than it looks. Stateful agents do not scale horizontally. The moment you need thousands of concurrent agent instances running across containers, persistent per-agent state becomes the bottleneck. This paper proposes replacing active memory with immutable decision logs using event-sourcing principles from distributed systems.</p><ul><li><p><strong>Decision logs instead of live state:</strong> Every agent decision, tool call, and observation is appended to an immutable event log. Any instance can reconstruct context by replaying the log on demand, which decouples decision logic from storage and lets agents spin up anywhere with no warmup.</p></li><li><p><strong>Enterprise properties by design:</strong> Compared to summary-only, SAM, and vector-memory baselines, Decision Process Memory (DPM) is the only architecture that supports append-only logging, stateless projection, audit-ready rationale trails, replay from log alone, multi-tenant isolation, and per-event provenance. Each of these is a hard requirement in regulated enterprise deployments.</p></li><li><p><strong>Tight-budget performance wins:</strong> On FRP, RCS, and EDA evaluations under constrained memory budgets, DPM substantially outperforms summary-only memory, with the gap widening as the budget tightens. Under loose budgets the approaches converge, which is the expected pattern once scale is no longer the constraint.</p></li><li><p><strong>A blueprint for regulated deployments:</strong> For teams operationalizing agents in finance, healthcare, or other compliance-heavy industries, the paper reads as a practical specification. It maps existing distributed-systems discipline onto agent memory instead of inventing a new category, which is why it is likely to age well.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.20158">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2047325132096758228">Tweet</a></strong></p><div><hr></div><h2><strong>8. There Will Be a Scientific Theory of Deep Learning</strong></h2><p>A position paper arguing that a genuine scientific theory of deep learning is already taking shape under the umbrella of &#8220;learning mechanics.&#8221; The authors identify five converging research directions (solvable idealized models, tractable mathematical limits, simple macroscopic laws, hyperparameter theories, and universal cross-system behaviors) that share a common signature: they describe training dynamics, target coarse aggregate statistics, and commit to falsifiable quantitative predictions. The framing pushes back on skepticism about whether deep learning can have fundamental theory and positions learning mechanics as a complement to mechanistic interpretability, not a competitor.</p><p><strong><a href="https://arxiv.org/abs/2604.21691">Paper</a></strong> | <strong><a href="https://x.com/learning_mech/status/2047723849874330047">Tweet</a></strong></p><div><hr></div><h2><strong>9. MASS-RAG</strong></h2><p>Most real-world RAG failures come from retrieving technically-relevant but contextually useless documents, then forcing a single model to reconcile them. MASS-RAG is a multi-agent synthesis framework for retrieval-augmented generation where specialized agents handle distinct roles: retrieving candidate documents, assessing their actual relevance to the query, and synthesizing the final answer from evidence that actually contributes. Instead of one model doing everything, responsibility is decomposed across coordinated evaluators, which fits the direction the field is heading for deep research agents.</p><p><strong><a href="https://arxiv.org/abs/2604.18509">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2046594362931556728">Tweet</a></strong></p><div><hr></div><h2><strong>10. Diversity Collapse in Multi-Agent LLMs</strong></h2><p>Every multi-agent system pitch assumes agents explore different solutions, but this paper shows they converge on near-identical outputs over time, even across different architectures and different starting prompts. The authors call it diversity collapse. The cause is structural coupling: shared context, shared task descriptions, and mutual feedback pull every agent toward the same attractor. They measure it formally with metrics like the Vendi score, and the homogenization is real. The practical consequence is that multi-agent setups for brainstorming, hypothesis generation, and ideation only work if teams explicitly engineer isolated reasoning phases, decoupled evaluation, and heterogeneous starting conditions.</p><p><strong><a href="https://arxiv.org/abs/2604.18005">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2047326894992081296">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: GPT-5.5, DeepSeek-V4 Preview, Kimi K2.6 Agent Swarm, Diversity Collapse, Sakana Fugu, and More]]></title><description><![CDATA[GPT-5.5, DeepSeek-V4 Preview, Kimi K2.6 Agent Swarm, Diversity Collapse, Sakana Fugu, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-gpt-55-deepseek</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-gpt-55-deepseek</guid><pubDate>Sat, 25 Apr 2026 15:02:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Pd0K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>OpenAI ships GPT-5.5</p></li><li><p>DeepSeek open-sources V4 Preview</p></li><li><p>Kimi releases K2.6 Agent Swarm</p></li><li><p>ACL paper flags diversity collapse in multi-agent LLMs</p></li><li><p>Sakana launches Fugu multi-agent beta</p></li><li><p>ChatGPT gets Workspace Agents</p></li><li><p>Codex adds Chronicle screen memory</p></li><li><p>Qwen3.6-27B drops flagship coding dense</p></li><li><p>Gemini Deep Research Max lands</p></li><li><p>Google unveils eighth-generation TPUs</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>GPT-5.5</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pd0K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pd0K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 424w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 848w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pd0K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png" width="1456" height="711" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:711,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!Pd0K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 424w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 848w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!Pd0K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c15afde-f25d-4bb9-b6ca-45d83696254d_2364x1154.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>OpenAI released GPT-5.5, a new class model built specifically for agentic work. It is designed to understand complex multi-step goals, use tools, check its own work, and carry tasks through to completion, and is now powering both ChatGPT and Codex.</p><ul><li><p><strong>Agentic-first design:</strong> GPT-5.5 targets messy, multi-part jobs and is tuned to plan, invoke tools, navigate ambiguity, and keep going until the task is done rather than stopping at a single response.</p></li><li><p><strong>Strongest gains where it matters:</strong> The biggest jumps are in agentic coding, computer use, knowledge work, and early scientific research, with ChatGPT using full-stack inference improvements to serve the model faster per token.</p></li><li><p><strong>GPT-5.5 Pro for hard jobs:</strong> A new GPT-5.5 Pro tier is rolling out to Pro, Business, and Enterprise users for demanding tasks, with efficiency gains that make Pro a practical default on long reasoning runs.</p></li><li><p><strong>Rollout:</strong> Available today in ChatGPT and Codex for Plus, Pro, Business, and Enterprise users, with Pro limited to paid and enterprise tiers.</p></li></ul><p><strong><a href="https://openai.com/index/introducing-gpt-5-5/">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-gpt-55-deepseek">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (April 13 - April 19)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-717</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-717</guid><pubDate>Sun, 19 Apr 2026 15:03:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!U88-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The Top AI Papers of the Week (April 13 - April 19)</p><h2><strong>1. Automated Weak-to-Strong Researcher</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U88-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U88-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 424w, https://substackcdn.com/image/fetch/$s_!U88-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 848w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U88-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg" width="1456" height="655" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:655,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Automated W2S Researcher&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Automated W2S Researcher" title="Automated W2S Researcher" srcset="https://substackcdn.com/image/fetch/$s_!U88-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 424w, https://substackcdn.com/image/fetch/$s_!U88-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 848w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!U88-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4edfa9c-1f6a-4bab-a59f-24536af29925_1797x809.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic shows that Claude can run fully autonomous progress on scalable oversight research. A team of parallel Automated Alignment Researchers (AARs) built on Claude Opus 4.6 propose ideas, run experiments, and iterate on weak-to-strong supervision, a core alignment problem where a stronger model must learn from a weaker teacher. The system closes almost the entire remaining performance gap that human researchers could not, at a total cost of roughly $18K in tokens and model training.</p><ul><li><p><strong>Performance gap recovered as the metric:</strong> The authors evaluate progress with performance gap recovered (PGR), a 0 to 1 score where 0 matches the weak teacher and 1 matches a ground-truth-supervised student. On a chat preference dataset, two human researchers achieved PGR 0.23 after seven days of iteration on four promising generalization methods.</p></li><li><p><strong>AARs reach 0.97 PGR in five days:</strong> Running nine Claude-based agents in parallel sandboxes, the automated system reached PGR 0.97 in five days and 800 cumulative agent-hours. The cost was about $18,000, or roughly $22 per AAR-hour. This is one of the strongest empirical data points yet that AI can drive measurable progress on open alignment problems.</p></li><li><p><strong>Forum-based collaboration between agents:</strong> Each AAR works in its own isolated sandbox but shares findings to a common forum and uploads codebase snapshots to shared storage. The setup mirrors how a small research team would coordinate, letting later agents build on earlier wins without merging execution environments.</p></li><li><p><strong>Reward hacking as a real outcome, not a hypothetical:</strong> The agents sometimes succeeded through unexpected mechanisms, including reward-hacking behaviors that the researchers did not anticipate. The result highlights the double-edged nature of automated research: measurable progress on outcome-gradable problems is practical today, but careful metric design remains a human responsibility.</p></li></ul><p><strong><a href="https://alignment.anthropic.com/2026/automated-w2s-researcher/">Paper</a></strong> | <strong><a href="https://x.com/janleike/status/2044139528596910584">Tweet</a></strong></p><div><hr></div><h2><strong>2. AiScientist</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T3D7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T3D7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 424w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 848w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1272w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T3D7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png" width="996" height="393" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:393,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AiScientist&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AiScientist" title="AiScientist" srcset="https://substackcdn.com/image/fetch/$s_!T3D7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 424w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 848w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1272w, https://substackcdn.com/image/fetch/$s_!T3D7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca64923-a03d-4c31-9995-a129f198dca2_996x393.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Long-horizon AI research agents are mostly a state-management problem. Reasoning well for the next turn is not enough when ML research demands task setup, implementation, experiments, debugging, and evidence tracking over hours or days. This paper introduces AiScientist, a system for autonomous long-horizon engineering built around the principle of thin control and thick state. A top-level orchestrator manages stage-level progress while specialized agents repeatedly ground themselves in durable workspace artifacts.</p><ul><li><p><strong>File-as-Bus coordination:</strong> AiScientist&#8217;s core design choice is to route coordination through durable filesystem artifacts rather than in-context message passing. Analyses, plans, code, logs, and experimental evidence all live as versioned files in a permission-scoped workspace, allowing specialists and subagents to reconstruct context from scratch without replaying entire conversations.</p></li><li><p><strong>Thin control, thick state:</strong> A Tier-0 orchestrator issues only stage-level directives, while Tier-1 specialists and optional Tier-2 subagents operate on shared artifacts. This keeps the control channel narrow and the state channel rich, giving agents the space to run long experiments without losing track of prior decisions and evidence.</p></li><li><p><strong>Strong benchmark results:</strong> The system improves PaperBench by 10.54 points over the best matched baseline and reaches 81.82 Any Medal% on MLE-Bench Lite. Removing File-as-Bus drops PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points, isolating the artifact-mediated design as the primary driver of gains.</p></li><li><p><strong>Durable project memory over longer chats:</strong> The work argues that autonomous research agents need persistent project memory, not just longer context windows. The results generalize the emerging pattern that environments carrying state on behalf of agents outperform architectures that rely solely on in-context reasoning for multi-hour workflows.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.13018">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2044436099121209546">Tweet</a></strong></p><div><hr></div><h2><strong>3. AlphaEval</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vS7D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vS7D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 424w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 848w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1272w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vS7D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png" width="635" height="331" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/655c258e-96c9-40fa-8e4c-934901545aea_635x331.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:331,&quot;width&quot;:635,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AlphaEval&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AlphaEval" title="AlphaEval" srcset="https://substackcdn.com/image/fetch/$s_!vS7D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 424w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 848w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1272w, https://substackcdn.com/image/fetch/$s_!vS7D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F655c258e-96c9-40fa-8e4c-934901545aea_635x331.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Agent evaluations are drifting away from production reality. Most benchmarks use clean tasks, well-specified requirements, deterministic metrics, and retrospective curation. Production work is messier, with implicit constraints, fragmented multimodal inputs, undeclared domain knowledge, long-horizon deliverables, and expert judgment that evolves over time. This paper introduces AlphaEval, a production-grounded benchmark evaluating agents as complete products rather than model APIs.</p><ul><li><p><strong>Seven companies, six O*NET domains:</strong> AlphaEval contains 94 tasks sourced from seven companies deploying AI agents in core business workflows across six O*NET domains. The tasks preserve production complexity rather than stripping it away, giving the benchmark a materially different distribution from prior coding-centric evaluations.</p></li><li><p><strong>Products, not model APIs:</strong> The benchmark evaluates commercial agent products such as Claude Code and Codex end to end, not the underlying models in isolation. This is a deliberate shift toward measuring the full agent experience that users actually pay for, including tool use, orchestration, and UI behaviors.</p></li><li><p><strong>Six production-specific failure modes:</strong> The authors identify cascade dependencies, subjective judgment collapse, information retrieval failures, cross-section inconsistency, constraint misinterpretation, and format compliance as failure modes that remain invisible to coding benchmarks. The best configuration (Claude Code with Opus 4.6) scores only 64.41/100, exposing a substantial research-to-production gap.</p></li><li><p><strong>Multi-paradigm evaluation:</strong> AlphaEval combines LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, and domain-specific checks. The key practical contribution is a requirement-to-benchmark framework that turns production requirements into executable evals with minimal friction for organizations.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.12162">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2044773323914322393">Tweet</a></strong></p><div><hr></div><h2><strong>4. Nemotron 3 Super</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3ns9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3ns9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 424w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 848w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1272w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3ns9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png" width="996" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Nemotron 3 Super&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Nemotron 3 Super" title="Nemotron 3 Super" srcset="https://substackcdn.com/image/fetch/$s_!3ns9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 424w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 848w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1272w, https://substackcdn.com/image/fetch/$s_!3ns9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23494d1-986d-4ed6-9cf0-2c8afdc5be67_996x374.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>NVIDIA introduces Nemotron 3 Super, an open 120B parameter model with 12B active parameters, built as a hybrid Mamba-Attention Mixture-of-Experts architecture optimized for agentic reasoning. The model targets long-context, high-throughput inference, a capability increasingly central to running agents reliably. It supports up to 1M context length while delivering up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B, at comparable benchmark accuracy.</p><ul><li><p><strong>Hybrid Mamba-Attention with LatentMoE:</strong> The architecture blends Mamba blocks with sparse LatentMoE layers, a new Mixture-of-Experts design that projects tokens into a smaller latent dimension for routing and expert computation. This improves both accuracy per FLOP and accuracy per parameter, and it is what allows the model to scale sparsely without paying a standard MoE memory tax.</p></li><li><p><strong>NVFP4 pretraining at scale:</strong> Nemotron 3 Super is the first model in the Nemotron 3 family to be pretrained in NVFP4, enabling training on 25 trillion tokens while keeping compute and memory overhead manageable. Post-training combines supervised fine-tuning and reinforcement learning on top of this base.</p></li><li><p><strong>Native speculative decoding via MTP layers:</strong> Multi-Token Prediction (MTP) layers are included for native speculative decoding during inference, reducing latency for long-context agentic workloads without requiring an external draft model. The team reports consistent MTP acceptance rates across draft depths on SPEED-Bench.</p></li><li><p><strong>Fully open artifacts:</strong> Nemotron 3 Super datasets, along with base, post-trained, and quantized checkpoints, are open-sourced on Hugging Face. This matters for teams building agent stacks that need efficient, inspectable, long-context models rather than closed API dependencies.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.12374">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2044452957023047943">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sVEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sVEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sVEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!sVEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!sVEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ee8c8b9-b016-46ea-8e1a-ef21731651ef_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p><strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Enroll Now</a></strong></p><div><hr></div><h2><strong>5. Memory Transfer Learning</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dlKK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dlKK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 424w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 848w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1272w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dlKK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png" width="996" height="1186" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1186,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Memory Transfer Learning&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Memory Transfer Learning" title="Memory Transfer Learning" srcset="https://substackcdn.com/image/fetch/$s_!dlKK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 424w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 848w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1272w, https://substackcdn.com/image/fetch/$s_!dlKK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89321a6-7419-4e0b-9406-64c6b37955ad_996x1186.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Coding agents learn from experience, but that knowledge stays locked in silos. Solve a thousand SWE tasks, and none of that wisdom helps with competitive coding. This paper introduces Memory Transfer Learning, a framework where coding agents share a unified memory pool across six heterogeneous coding benchmarks, testing what transfers between domains and what does not.</p><ul><li><p><strong>Unified memory pool across domains:</strong> The framework pools memories across six heterogeneous coding benchmarks rather than isolating them by task type. Cross-domain memory improves average performance by 3.7%, a modest but consistent lift that previously would have been invisible under standard single-domain evaluations.</p></li><li><p><strong>Abstraction dictates transferability:</strong> Four memory formats ranging from raw execution traces to high-level insights are compared. High-level insights generalize well, while low-level traces often cause negative transfer by anchoring agents to incompatible implementation details. The takeaway: memory design matters more than memory volume.</p></li><li><p><strong>Meta-knowledge, not code:</strong> The transferable value is not task-specific code but meta-knowledge such as validation routines, structured action workflows, and safe interaction patterns with execution environments. Algorithmic strategy transfer accounts for only 5.5% of the gains, with procedural guidance doing most of the work.</p></li><li><p><strong>Scaling and cross-model transfer:</strong> Transfer effectiveness scales with the size of the memory pool, and memory can even be shared across different models. Combined with the finding on abstraction levels, the results point toward memory systems that curate insights rather than simply logging everything the agent did.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.14004">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2044900659921895729">Tweet</a></strong></p><div><hr></div><h2><strong>6. Auto-Diagnose</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2T-a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2T-a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 424w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 848w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1272w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2T-a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png" width="812" height="138" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:138,&quot;width&quot;:812,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Auto-Diagnose&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Auto-Diagnose" title="Auto-Diagnose" srcset="https://substackcdn.com/image/fetch/$s_!2T-a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 424w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 848w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1272w, https://substackcdn.com/image/fetch/$s_!2T-a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16a4d604-c1c0-4dcf-8cc2-0963ad292005_812x138.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Integration test failures are painful because the signal is buried in messy logs. Massive output, heterogeneous systems, low signal-to-noise ratio, and unclear root causes leave developers scrolling through thousands of lines. This paper introduces Auto-Diagnose, an LLM-based tool deployed inside Google&#8217;s Critique code review system that analyzes failure logs, summarizes the most relevant lines, and suggests the root cause directly in the developer workflow.</p><ul><li><p><strong>In-workflow root cause assistance:</strong> Auto-Diagnose is integrated into Critique, Google&#8217;s internal code review system, so diagnoses appear where developers are already looking at the failure. Log streams from test drivers and systems under test, spread across data centers and threads, are joined and sorted by timestamp before being passed to the LLM.</p></li><li><p><strong>High diagnosis accuracy:</strong> In a manual evaluation of 71 real-world failures, Auto-Diagnose reached 90.14% root-cause diagnosis accuracy. This level of reliability is what justifies surfacing suggestions directly in a tool developers cannot ignore, rather than hiding them behind an opt-in query interface.</p></li><li><p><strong>Massive-scale deployment evidence:</strong> After Google-wide rollout, the tool was used across 52,635 distinct failing tests. User feedback marked it &#8220;Not helpful&#8221; in only 5.8% of cases, and it ranked #14 in helpfulness among 370 Critique tools. This is one of the clearest data points on production LLM tooling at scale inside a major company.</p></li><li><p><strong>A template for developer-facing LLM tools:</strong> The paper reads as a practical blueprint for embedding LLM-based diagnosis into existing engineering workflows. Rather than building a standalone product, the team integrated into the tool where the problem is already being reviewed, which likely explains the low &#8220;Not helpful&#8221; rate and high adoption.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.12108">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2044769798845079665">Tweet</a></strong></p><div><hr></div><h2><strong>7. Subliminal Learning</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JlNa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JlNa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JlNa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg" width="1456" height="806" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:806,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Subliminal Learning&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Subliminal Learning" title="Subliminal Learning" srcset="https://substackcdn.com/image/fetch/$s_!JlNa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JlNa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01fad987-9d60-4423-b717-6a52959fb666_1984x1098.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Subliminal Learning paper by Evans and colleagues is now published in Nature. The work showed that LLMs can transmit traits (such as a preference for owls) through data that appears unrelated to that trait, like sequences of numbers that look meaningless on inspection. The Nature version extends the original July 2025 preprint with new experiments, replications on Gemma, and a broader discussion of safety implications for AI systems trained on one another&#8217;s outputs.</p><ul><li><p><strong>Transfer across different initializations:</strong> The preprint showed subliminal transfer between models that shared an initialization. The new MNIST results demonstrate transfer between models with different initializations. Although a toy setup, it meaningfully broadens the scope of the effect beyond shared-weight scenarios.</p></li><li><p><strong>Misalignment transmitted through code and chain-of-thought:</strong> General misalignment, not just benign preferences, can also be transmitted subliminally. The new results show this transfer can happen through model-written code or chain-of-thought reasoning, not only through numeric sequences, which expands the attack and contamination surface considerably.</p></li><li><p><strong>Connections to independent follow-ups:</strong> The authors highlight concurrent work from Aden-Ali et al. (2026) showing trait transfer via standard post-training datasets filtered by the teacher, Draganov et al. (2026) demonstrating a cross-family &#8220;phantom transfer&#8221; data poisoning attack, and Weckbecker et al. (2026) describing a subliminal &#8220;virus&#8221; that spreads between agent groups. Together they suggest the phenomenon is robust, reproducible, and difficult to defend against.</p></li><li><p><strong>Implications for safety evaluations:</strong> The practical takeaway is that safety evaluations may need to examine not just model behavior, but the origins of models and the processes used to create training data. As systems increasingly train on each other&#8217;s outputs, properties invisible in the data can still be inherited, undermining evaluations that focus purely on observable responses.</p></li></ul><p><strong><a href="https://www.nature.com/articles/s41586-026-10319-8">Paper</a></strong> | <strong><a href="https://x.com/OwainEvans_UK/status/2044488099707949545">Tweet</a></strong></p><div><hr></div><h2><strong>8. LLM-as-a-Verifier</strong></h2><p>Test-time scaling is effective for agentic tasks, but picking the winner among many candidates is the bottleneck. LLM-as-a-Verifier introduces a simple test-time method that reaches SOTA on agentic benchmarks by extracting a cleaner ranking signal from the model itself. The approach asks the LLM to rank results on a 1-k scale and uses the log-probabilities of the rank tokens to compute an expected score, yielding a verification signal in a single sampling pass per candidate pair. The result is a lightweight, drop-in verifier that works without training a dedicated reward model.</p><p><strong><a href="https://llm-as-a-verifier.github.io/">Paper</a></strong> | <strong><a href="https://x.com/Azaliamirh/status/2043813128690192893">Tweet</a></strong></p><div><hr></div><h2><strong>9. WebXSkill</strong></h2><p>Web agents can navigate a page, but ask them to repeat a checkout flow they already completed and they start from scratch every time. WebXSkill is a skill learning framework where web agents extract reusable skills from synthetic trajectories, each pairing a parameterized action program with step-level natural language guidance. Two deployment modes let the agent either auto-execute skills as atomic tool calls (grounded) or follow them as step-by-step instructions while retaining autonomy to adapt (guided). On WebArena, WebXSkill improves task success by up to 9.8 points over baselines. On WebVoyager, grounded mode reaches 86.1%, a 14.2-point gain, and skills even transfer across environments.</p><p><strong><a href="https://arxiv.org/abs/2604.13318">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2045139481892880892">Tweet</a></strong></p><div><hr></div><h2><strong>10. Muses-Bench</strong></h2><p>Every agent framework assumes one user giving instructions, but in real team workflows agents have multiple bosses with conflicting goals, private information, and different authority levels. Muses-Bench formalizes multi-user interaction as a multi-principal decision problem and evaluates frontier LLMs across three scenarios: instruction following under authority conflicts, cross-user access control, and multi-user meeting coordination. Gemini-3-Pro tops the leaderboard at just 85.6% average, and no model exceeds 64.8% on meeting coordination. Privacy-utility tradeoffs are brutal: Grok-3-Mini scores 99.6% on privacy but collapses to 60.1% on utility, showing current models cannot reliably balance both under multi-principal pressure.</p><p><strong><a href="https://arxiv.org/abs/2604.08567">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2044067923787165799">Tweet</a></strong></p>]]></content:encoded></item><item><title><![CDATA[🤖 AI Agents Weekly: Claude Opus 4.7, Codex Everywhere, Claude Design, Windsurf 2.0, Qwen3.6-35B-A3B, AiScientist, and More]]></title><description><![CDATA[Claude Opus 4.7, Codex Everywhere, Claude Design, Windsurf 2.0, Qwen3.6-35B-A3B, AiScientist, and More]]></description><link>https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-47-codex</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-47-codex</guid><pubDate>Sat, 18 Apr 2026 15:01:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!491v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today&#8217;s issue:</p><ul><li><p>Anthropic ships Claude Opus 4.7</p></li><li><p>Codex extends to Mac apps</p></li><li><p>Claude Design enters research preview</p></li><li><p>Windsurf 2.0 delegates to Devin</p></li><li><p>Qwen drops 3.6-35B-A3B open weights</p></li><li><p>OpenAI Agents SDK adds sandboxes</p></li><li><p>Gemini CLI adds subagents</p></li><li><p>FrontierSWE benchmark launches</p></li><li><p>NVIDIA releases Nemotron 3 Super</p></li><li><p>AiScientist lifts long-horizon research</p></li></ul><p>And all the top AI dev news, papers, and tools.</p><div><hr></div><div><hr></div><h2><strong>Top Stories</strong></h2><h3><strong>Claude Opus 4.7</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!491v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!491v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!491v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!491v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!491v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!491v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg" width="1080" height="1080" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1080,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Claude Opus 4.7&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Opus 4.7" title="Claude Opus 4.7" srcset="https://substackcdn.com/image/fetch/$s_!491v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!491v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!491v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!491v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cf1cdc9-5e32-4698-91f6-6f4c6f0ea1bf_1080x1080.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic released Claude Opus 4.7, its most capable Opus model yet, built for long-running agentic work with more rigorous self-verification and tighter instruction following. Opus 4.7 also powers the new Claude Design product and Anthropic&#8217;s Glasswing cybersecurity frontier model.</p><ul><li><p><strong>Self-verifying long-running work:</strong> Opus 4.7 checks its own outputs before reporting back and handles multi-hour tasks with less supervision, making it a stronger default for hand-offs where the agent must own the full loop.</p></li><li><p><strong>Vision upgrade:</strong> The model sees images at more than three times the resolution of Opus 4.6 and produces higher-quality interfaces, slides, and documents, which is the foundation for the new Claude Design research preview.</p></li><li><p><strong>New reasoning and budget controls:</strong> A new xhigh effort level between high and max gives developers finer latency/quality tradeoffs on hard problems. Task budgets (beta) let Claude prioritize work and manage cost across longer runs.</p></li><li><p><strong>Claude Code upgrades:</strong> A new /ultrareview command runs a dedicated review pass over changes that flags what a careful reviewer would catch, and auto mode is now extended to Max users so long tasks run with fewer interruptions.</p></li></ul><p><strong><a href="https://www.anthropic.com/news/claude-opus-4-7">Blog</a></strong></p>
      <p>
          <a href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-opus-47-codex">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🥇Top AI Papers of the Week]]></title><description><![CDATA[The Top AI Papers of the Week (April 6 - April 12)]]></description><link>https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-831</link><guid isPermaLink="false">https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-831</guid><pubDate>Sun, 12 Apr 2026 15:02:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1pgB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>1. Neural Computers</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fEae!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fEae!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 424w, https://substackcdn.com/image/fetch/$s_!fEae!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 848w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1272w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fEae!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png" width="1085" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:1085,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!fEae!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 424w, https://substackcdn.com/image/fetch/$s_!fEae!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 848w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1272w, https://substackcdn.com/image/fetch/$s_!fEae!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6f5d63-9d6f-44cd-ad5b-60568b9d44e6_1085x660.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Researchers from Meta AI and KAUST propose Neural Computers (NCs), an emerging machine form that unifies computation, memory, and I/O in a single learned runtime state. Unlike conventional computers that execute explicit programs, agents that act over external environments, or world models that learn dynamics, NCs aim to make the model itself the running computer, establishing a new computing paradigm.</p><ul><li><p><strong>From hardware stack to neural latent stack:</strong> Classical computers separate compute, memory, and I/O into modular hardware layers. Neural Computers collapse all three into a single latent runtime state carried by a neural network. The model&#8217;s hidden state serves simultaneously as working memory, computational substrate, and interface layer, removing the boundary between program and execution environment.</p></li><li><p><strong>Video models as prototype substrate:</strong> The team instantiates NCs as video models that generate screen frames from instructions, pixel inputs, and user actions. Two prototypes cover command-line interfaces (NCCLIGen, which renders and executes terminal workflows) and graphical desktops (NCGUIWorld, which learns pointer dynamics and menu interactions), both trained without access to internal program state.</p></li><li><p><strong>Early runtime primitives emerge:</strong> The prototypes demonstrate that learned runtimes can acquire I/O alignment and short-horizon control directly from raw interface traces. CLI models execute short command chains with structurally accurate output rendering, while GUI models learn coherent click feedback and window transitions in controlled settings.</p></li><li><p><strong>Roadmap toward Completely Neural Computers:</strong> The long-term target is the CNC: a system that is Turing complete, universally programmable, and behavior-consistent unless explicitly reprogrammed. Key open challenges include routine reuse across sessions, controlled capability updates without catastrophic forgetting, and stable symbolic processing for long-horizon reasoning.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.06425">Paper</a></strong> | <strong><a href="https://x.com/SchmidhuberAI/status/2042601088029708704">Tweet</a></strong></p><div><hr></div><h2><strong>2. Memento: Teaching LLMs to Manage Their Own Context</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1pgB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1pgB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 424w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 848w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1pgB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png" width="1456" height="1064" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1064,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!1pgB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 424w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 848w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!1pgB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e175ca-44d1-470d-8451-86ef61e5b8d2_2082x1522.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>New research from Microsoft teaches reasoning models to compress their own chain-of-thought mid-generation. Memento trains models to segment reasoning into blocks, summarize each block into a compact &#8220;memento,&#8221; and then evict the original block from the KV cache. The model continues reasoning from mementos alone, cutting peak memory by 2-3x while nearly doubling throughput.</p><ul><li><p><strong>Block-and-compress architecture:</strong> The model learns to mark reasoning boundaries using special tokens, produce a terse summary capturing key conclusions and intermediate values, and then drop the full block from context. From that point forward, the model sees only past mementos plus the current active block, keeping context compact without losing critical information.</p></li><li><p><strong>KV cache reduction with minimal accuracy loss:</strong> Applied to five models including Qwen2.5-7B, Qwen3 8B/32B, Phi-4 Reasoning 14B, and OLMo3-7B-Think, Memento achieves 2-3x peak KV cache reduction with small accuracy gaps that shrink at larger scales. The erased blocks still leave useful traces in the KV cache that the model leverages.</p></li><li><p><strong>Practical throughput gains:</strong> Beyond memory savings, the reduced context length directly translates to faster inference. The approach nearly doubles serving throughput, making it immediately useful for production deployments where both latency and memory are constraints.</p></li><li><p><strong>Open resources:</strong> Microsoft released the full codebase under MIT license, the OpenMementos dataset containing 228K reasoning traces with block segmentation and compressed summaries, and a custom vLLM fork for KV cache block masking. Standard supervised fine-tuning on approximately 30K examples is sufficient to teach this capability.</p></li></ul><p><strong><a href="https://github.com/microsoft/memento">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2042315710173528122">Tweet</a></strong></p><div><hr></div><h2><strong>3. Memory Intelligence Agent (MIA)</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mD5U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mD5U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 424w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 848w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1272w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mD5U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png" width="1456" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!mD5U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 424w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 848w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1272w, https://substackcdn.com/image/fetch/$s_!mD5U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70e3a376-166b-49f3-938a-25d615842f25_2822x1454.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most memory-augmented research agents treat memory as a static retrieval store, leading to inefficient evolution and rising storage costs. MIA introduces a Manager-Planner-Executor architecture where a Memory Manager maintains compressed search trajectories, a Planner generates strategies, and an Executor searches and analyzes information. The framework boosts GPT-5.4 by up to 9% on LiveVQA through bidirectional memory conversion.</p><ul><li><p><strong>Bidirectional memory conversion:</strong> MIA enables transformation between parametric memory (model weights) and non-parametric memory (retrieved context) in both directions. This allows the system to internalize frequently accessed knowledge while keeping rare or volatile information in retrievable form, optimizing both storage efficiency and access speed.</p></li><li><p><strong>Alternating reinforcement learning:</strong> The three agents are trained through alternating RL, where each agent&#8217;s policy improves in response to the others&#8217; behavior. This co-evolutionary training ensures the agents develop complementary strategies rather than competing for the same signal.</p></li><li><p><strong>Test-time parametric updates:</strong> Unlike standard retrieval-augmented systems, MIA can update its parametric memory on-the-fly during inference. This test-time learning allows the agent to adapt to new domains and evolving information without retraining, maintaining relevance as the information landscape changes.</p></li><li><p><strong>Broad benchmark coverage:</strong> The framework demonstrates improvements across 11 benchmarks spanning question answering, knowledge-intensive tasks, and long-form research synthesis. The up to 9% improvement on LiveVQA is particularly notable given that video question answering demands effective memory management across temporal sequences.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.04503">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2041895109252542730">Tweet</a></strong></p><div><hr></div><h2><strong>4. Single-Agent LLMs vs. Multi-Agent Systems</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fvx7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fvx7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 424w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 848w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fvx7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Single vs Multi Agent&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Single vs Multi Agent" title="Single vs Multi Agent" srcset="https://substackcdn.com/image/fetch/$s_!fvx7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 424w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 848w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!fvx7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77623480-0269-42f4-bcbb-b4c2d8b6d558_1584x1056.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>More agents, better results, right? Not so fast. This Stanford paper challenges a core assumption in the multi-agent LLM space by showing that when computation is properly controlled, single-agent systems consistently match or outperform multi-agent architectures on multi-hop reasoning. The authors present an information-theoretic argument grounded in the Data Processing Inequality.</p><ul><li><p><strong>Computation as the hidden confounder:</strong> Most reported multi-agent gains are confounded by increased test-time computation rather than architectural advantages. When reasoning token budgets are held constant, the performance gap disappears or reverses, suggesting that prior comparisons were inadvertently measuring compute scaling rather than coordination benefits.</p></li><li><p><strong>Information-theoretic foundation:</strong> The authors ground their analysis in the Data Processing Inequality, arguing that under a fixed reasoning-token budget with perfect context utilization, single-agent systems are inherently more information-efficient. Distributing reasoning across agents introduces information loss at each handoff.</p></li><li><p><strong>Benchmark artifacts inflate MAS gains:</strong> Testing across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5, the study identifies significant evaluation artifacts, particularly in API-based budget control for Gemini 2.5, that inflate apparent multi-agent advantages. Standard benchmarks also contain structural biases favoring multi-agent decomposition.</p></li><li><p><strong>Practical implications for system design:</strong> The findings suggest that teams should explicitly control for compute, context, and coordination trade-offs before committing to multi-agent architectures. In many cases, allocating the same token budget to a single agent with richer context yields stronger results at lower system complexity.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.02460">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2041534488342360305">Tweet</a></strong></p><div><hr></div><h2><strong>Message from the Editor</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NAtL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NAtL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NAtL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vibe Coding AI Apps&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vibe Coding AI Apps" title="Vibe Coding AI Apps" srcset="https://substackcdn.com/image/fetch/$s_!NAtL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NAtL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b69c0a-1751-4050-b088-08eef5912a09_2626x1504.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Excited to announce our new on-demand course &#8220;<a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Vibe Coding AI Apps with Claude Code</a>&#8220;. Learn how to leverage Claude Code features to vibecode production-grade AI-powered apps.</p><p><strong><a href="https://academy.dair.ai/courses/build-apps-with-claude-code">Enroll Now</a></strong></p><div><hr></div><h2><strong>5. The Universal Verifier for Agent Benchmarks</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4ydR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4ydR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 424w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 848w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1272w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4ydR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png" width="887" height="348" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:348,&quot;width&quot;:887,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Universal Verifier&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Universal Verifier" title="Universal Verifier" srcset="https://substackcdn.com/image/fetch/$s_!4ydR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 424w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 848w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1272w, https://substackcdn.com/image/fetch/$s_!4ydR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ac36af1-218d-4f76-8b23-6be960fa2769_887x348.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Every agent benchmark has the same hidden problem: how do you know the agent actually succeeded? Microsoft researchers introduce the Universal Verifier, built on four design principles for reliable evaluation of computer-use agent trajectories. The verifier reduces false positive rates to near zero, down from 45%+ with WebVoyager and 22%+ with WebJudge.</p><ul><li><p><strong>Four design principles:</strong> The verifier is built on non-overlapping rubric criteria to reduce noise, separate process and outcome rewards for complementary signals, cascading error-free assessment that distinguishes controllable from uncontrollable failures, and divide-and-conquer context management that attends to all screenshots in a trajectory.</p></li><li><p><strong>Near-zero false positives:</strong> Current verifiers suffer from alarmingly high false positive rates that corrupt both benchmark scores and training data. The Universal Verifier achieves agreement with human judges that matches inter-human agreement rates, making it reliable enough for both evaluation and RL reward signal generation.</p></li><li><p><strong>Cumulative design gains:</strong> No single design choice dominates the performance improvement. The authors demonstrate that gains result from the cumulative effect of all four principles working together, with each contributing meaningful improvements that compound rather than any one serving as a silver bullet.</p></li><li><p><strong>Limits of automated research:</strong> An interesting meta-finding: the team used an auto-research agent to replicate the verifier design process. The agent reached 70% of expert verifier quality in 5% of the time but could not discover the structural design decisions that drove the biggest gains, suggesting human insight remains essential for system-level design.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.06240">Paper</a></strong> | <strong><a href="https://x.com/omarsar0/status/2042249194409501054">Tweet</a></strong></p><div><hr></div><h2><strong>6. Scaling Coding Agents via Atomic Skills</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fjUh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fjUh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 424w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 848w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1272w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fjUh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png" width="1456" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Scaling Coding Agents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Scaling Coding Agents" title="Scaling Coding Agents" srcset="https://substackcdn.com/image/fetch/$s_!fjUh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 424w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 848w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1272w, https://substackcdn.com/image/fetch/$s_!fjUh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29f87cb6-ca53-45d7-9fcf-7896d1ce987f_2560x1103.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most coding agents train end-to-end on full tasks like resolving GitHub issues, leading to task-specific overfitting that limits generalization. This paper proposes a different approach: identifying five atomic coding skills (code localization, code editing, unit-test generation, issue reproduction, and code review) and training agents through joint reinforcement learning over these foundational competencies.</p><ul><li><p><strong>Atomic skill decomposition:</strong> Instead of treating software engineering as monolithic composite tasks, the framework formalizes five fundamental operations that compose into higher-level capabilities. Think of it as teaching an agent the alphabet of coding rather than memorizing specific sentences, enabling flexible recombination across novel task types.</p></li><li><p><strong>Joint RL across skills:</strong> The agents are trained through joint reinforcement learning that optimizes performance across all five atomic skills simultaneously. This joint training produces representations that capture the underlying structure shared across coding operations rather than surface-level patterns tied to specific benchmarks.</p></li><li><p><strong>Strong generalization to unseen tasks:</strong> Joint RL improves average performance by 18.7% across both the five atomic skills and five composite tasks. The improvements transfer to unseen composite tasks including bug-fixing, code refactoring, ML engineering, and code security, none of which were directly optimized during training.</p></li><li><p><strong>A new scaling paradigm:</strong> The work establishes that scaling coding agents through foundational skill mastery is more sample-efficient and transferable than task-level optimization. As the number and complexity of software engineering tasks grow, this compositional approach offers a more sustainable path than continuously expanding task-specific training sets.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.05013">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2042237615492260249">Tweet</a></strong></p><div><hr></div><h2><strong>7. Agent Skills in the Wild</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UEmi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UEmi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 424w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 848w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1272w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UEmi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png" width="997" height="377" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:377,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agent Skills in the Wild&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agent Skills in the Wild" title="Agent Skills in the Wild" srcset="https://substackcdn.com/image/fetch/$s_!UEmi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 424w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 848w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1272w, https://substackcdn.com/image/fetch/$s_!UEmi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F157f8c6a-199d-46f7-af9e-a1b4c6d676c8_997x377.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Agent skills look great in demos. Hand them a curated toolbox, and they shine. But what happens when the agent has to find the right skill from a library of 34,000? This paper from UC Santa Barbara and MIT presents the first comprehensive study of skill utility under progressively realistic settings, revealing that the benefits of skills are far more fragile than current evaluations suggest.</p><ul><li><p><strong>Progressive difficulty framework:</strong> The study moves from idealized conditions with hand-crafted, task-specific skills to realistic scenarios requiring retrieval from 34K real-world skills. Performance gains degrade consistently at each step, with pass rates approaching no-skill baselines in the most challenging scenarios.</p></li><li><p><strong>Retrieval as the bottleneck:</strong> The core failure mode is not skill execution but skill selection. When agents must identify the right skill from a massive library, the retrieval step introduces errors that cascade through execution, highlighting a fundamental gap between demo-ready and production-ready skill systems.</p></li><li><p><strong>Refinement strategies help but do not solve:</strong> Query-specific and query-agnostic refinement approaches show improvement, with Claude Opus 4.6 going from 57.7% to 65.5% on Terminal-Bench 2.0. However, even with refinement, performance under realistic retrieval conditions remains well below idealized baselines.</p></li><li><p><strong>Implications for skill ecosystems:</strong> As the ecosystem of agent skills grows through frameworks like MCP, the findings suggest that simply expanding the skill library creates diminishing returns without corresponding advances in skill discovery. Quality of skill retrieval may matter more than quantity of available skills.</p></li></ul><p><strong><a href="https://arxiv.org/abs/2604.04323">Paper</a></strong> | <strong><a href="https://x.com/dair_ai/status/2041540525539614797">Tweet</a></strong></p><div><hr></div><h2><strong>8. MedGemma 1.5</strong></h2><p>Google releases the MedGemma 1.5 technical report, introducing a 4B-parameter medical AI model that expands capabilities to 3D medical imaging (CT/MRI volumes), whole slide pathology, multi-timepoint chest X-ray analysis, and improved medical document understanding. The model achieves notable gains including a +47% macro F1 improvement on whole slide pathology and +22% on EHR question answering, positioning itself as an open foundation for next-generation medical AI systems.</p><p><strong><a href="https://arxiv.org/abs/2604.05081">Paper</a></strong> | <strong><a href="https://x.com/SRSchmidgall/status/2041973798589903260">Tweet</a></strong></p><div><hr></div><h2><strong>9. LightThinker++: From Reasoning Compression to Memory Management</strong></h2><p>While LLMs excel at complex reasoning, long thought traces create surging cognitive overhead. LightThinker++ moves beyond static compression by introducing three explicit memory primitives: Commit (archive a step as a compact summary), Expand (retrieve past steps for verification), and Fold (collapse context to maintain a clean signal). The framework reduces peak token usage by 70% while gaining +2.42% accuracy on standard reasoning tasks, and maintains stability beyond 80 rounds on long-horizon agentic tasks with a 14.8% average performance improvement.</p><p><strong><a href="https://arxiv.org/abs/2604.03679">Paper</a></strong> | <strong><a href="https://x.com/zxlzr/status/2041881875887878237">Tweet</a></strong></p><div><hr></div><h2><strong>10. Thinking Mid-training: RL of Interleaved Reasoning</strong></h2><p>Meta FAIR addresses the gap between pretraining (no explicit reasoning) and post-training (reasoning-heavy) with an intermediate SFT+RL mid-training phase. The approach annotates pretraining data with interleaved reasoning traces, then uses supervised fine-tuning followed by RL to teach models when and how to think during continued pretraining. Applied to Llama-3-8B, the full pipeline achieves a 3.2x improvement on reasoning benchmarks compared to direct RL post-training, demonstrating that reasoning benefits from being trained as native behavior early in the pipeline.</p><p><strong><a href="https://facebookresearch.github.io/RAM/blogs/thinking_midtraining/">Paper</a></strong> | <strong><a href="https://x.com/jaseweston/status/2041864833214095484">Tweet</a></strong></p>]]></content:encoded></item></channel></rss>