1). PaperBench
OpenAI introduces a new benchmark, PaperBench, to test whether AI agents can replicate cutting-edge machine learning research papers, from scratch.
A rigorous replication challenge – PaperBench evaluates agents on reproducing entire ML papers from ICML 2024 (20 total, across 12 research areas). Agents must understand the paper, build the codebase from scratch, and run experiments to match results. Each paper comes with a fine-grained rubric (~8,316 tasks total) co-designed with the original authors.
Automatic grading with LLM judges – To make evaluation scalable, the team built a rubric-based judge (o3-mini with scaffolding) that scores replications with high agreement (F1 = 0.83) against human experts. They also release JudgeEval, a benchmark for assessing judge accuracy.
Frontier model performance is modest – Claude 3.5 Sonnet scored highest with 21.0%, followed by o1 (13.2%) and GPT-4o (4.1%). Even with longer runtimes and prompt tuning (IterativeAgent), no model surpassed a 26.0% score. By contrast, ML PhDs hit 41.4% on a 3-paper subset in 48 hours, showing humans still lead in long-horizon agentic tasks.
CodeDev variant for lightweight evals – A simplified PaperBench Code-Dev version skips execution and just grades code structure. o1 scored 43.4% there, showing more promise when runtime issues are excluded.
Failure modes and insights – Models often “gave up early,” lacked strategic planning, and failed to iterate. Claude did better with BasicAgent (freer form), while o1 benefited from IterativeAgent (structured prompts). This highlights how sensitive agents are to prompting and scaffolding.
2). Command A: An Enterprise-Ready LLM
Cohere announced Command A, a 111B parameter open-weights LLM built for enterprise-grade RAG, agents, code, and multilingual tasks. Key contributions:
Modular expert merging for domain mastery – Instead of monolithic post-training, Command A uses a decentralized training pipeline. Separate expert models are fine-tuned for specific domains (e.g., math, RAG, multilingual, safety, code), then merged into one model using efficient weighted parameter soup techniques. This preserves most expert performance with just ~1.8% average drop.
Hybrid architecture for long-context efficiency – Command A interleaves sliding window and full attention layers, achieving 256k context support with drastically lower KV cache memory usage—e.g., only ~33% of LLaMA 3 70B at 128k. It scores 95.0% on RULER, outperforming most long-context peers.
Superb agentic capabilities – Built for RAG, tool use, and ReAct-style agents, Command A beats GPT-4o and Claude 3.5 on TauBench and BFCL. Tool use is trained via a blend of human-annotated and synthetic data, then aligned with CoPG and SRPO (self-improving preference optimization).
Best-in-class enterprise evaluations – On real-world generative tasks (e.g., chat summarization, FAQ generation) and RAG use cases (long workplace policy documents), Command A tops the leaderboard with 94.2% pass rate, 4.73 correctness, and 91% unanswerable QA accuracy.
Multilingual excellence – Command A is trained in 23 global languages with heavy data curation and preference tuning. It scores #1 in dialect alignment (ADI2), 90.3% average LPR (language consistency), and outperforms LLaMA 3.3, GPT-4o, and DeepSeek in manual Arena-style win rates across all languages.
Polishing for human alignment – Final alignment used a ping-pong loop of offline SRPO and online CoPG with RLHF. This yielded +17pt human win rate gains on code, +10pt on reasoning, and lifted Command A’s win rate over GPT-4o to parity (~50.4%).
Fast, efficient, and open – Despite its power, Command A runs on just 2×A100s or H100s and generates 156 tokens/sec—faster than GPT-4o and DeepSeek. Model weights are released (CC-BY-NC) on Hugging Face.
3). CodeScientist
Researchers at AI2 release CodeScientist, a system that autonomously generates and tests scientific hypotheses via code-based experimentation. It’s among the first to produce validated discoveries with minimal human input. Key ideas:
Code-first scientific agent – CodeScientist reviews research papers and assembles experiments using vetted Python code blocks (e.g., for analysis, simulation). It follows a five-step pipeline: Ideation → Planning → Code Execution → Reporting → Meta-Analysis.
Validated AI discoveries – From 50 AI research papers on agents and virtual environments, CodeScientist proposed 19 findings. Of these, 6 were judged scientifically sound and novel. Examples:
Confidence ≠ Accuracy – LLM self-assessed confidence in simulations often mismatched actual accuracy.
Simpler state = better prediction – Using binary vs. text states improved model reliability.
Graph memory helps – Agents with graph-structured memory outperformed baselines in a scientific simulation game.
Human-guided autonomy – Full automation is possible, but brief human feedback (e.g., ranking ideas) significantly boosts output quality. Human-in-the-loop interaction improves idea selection and experiment debugging.
Challenges remain – Despite successes, over half the generated experiments fail due to code errors, not scientific flaws. Peer review is still needed to verify results, and current systems lack deep methodological rigor.
Editor Message
We are excited to announce the early release of our new course on building effective AI agents. New chapters rolling out every week.
We’re offering our subscribers a 25% discount — use code AGENT25 at checkout. This is a limited-time offer.
4). Retrieval-Augmented Reasoning Model
Introduces RARE, a new paradigm for training domain-specific LLMs that focuses on reasoning, not memorization. Key ideas:
Inspired by Bloom’s Taxonomy – RARE shifts LLM training from memorizing knowledge (“Remember”) to applying and evaluating it (“Analyze”, “Create”). It separates domain knowledge (retrieved externally) from domain thinking (learned during training), enabling better performance under tight parameter budgets.
Open-book prepared training – RARE injects retrieved knowledge into training prompts, letting models learn reasoning patterns instead of rote facts. This open-book, reasoning-first setup beats both standard SFT and RAG approaches, especially in medicine.
Massive accuracy gains with small models – On five medical QA benchmarks, RARE-trained Llama-3.1-8B and Qwen-2.5-7B outperformed GPT-4 + RAG, with up to +20% accuracy boosts (e.g., PubMedQA: 78.63% vs. GPT-4’s 75.2%, CoVERT: 74.14% vs. GPT-4’s 65.67%).
Training via distillation + adaptive retries – RARE distills answers (and reasoning paths) from a strong teacher (e.g., QwQ-32B), refining outputs until a correct answer is found. This creates a high-quality dataset that teaches contextualized, case-based thinking.
New role for retrieval – Unlike standard RAG (used only at inference), RARE uses retrieval during training to shape reasoning. It models knowledge integration (p(k|x, R(x))) and reasoning (p(r|x, R(x), k)) as separate steps, replacing memorization with application.
5). Why do LLMs Attend to First Token?
This new paper explains why LLMs obsessively focus attention on the first token — a phenomenon known as an attention sink. Their theory: it’s a useful trick to prevent representational collapse in deep Transformers.
Sinks = over-mixing shields – LLMs with long contexts and deep layers tend to over-mix information, causing similar embeddings for all tokens (i.e., rank collapse or over-squashing). Attention sinks—where many heads fixate on the ⟨bos⟩ token—act as no-ops that reduce token interaction and preserve representation diversity across layers.
Sharp experiments on Gemma & LLaMa – Perturbation tests in Gemma 7B show ⟨bos⟩ significantly slows the spread of changes through the model. Meanwhile, in LLaMa 3.1 models, over 80% of attention heads show strong sink behavior in the 405B variant, supporting the theory that larger models need stronger sinks.
Sinks emerge naturally – Even without special pretraining, sinks tend to form at the first position, not because of the ⟨bos⟩ token itself, but due to its location. However, if ⟨bos⟩ is fixed during training and later removed, performance collapses, showing that sink formation is data-dependent.
Theoretical grounding – The authors connect sink emergence to Jacobian norm bounds, proving that sinks reduce sensitivity to token perturbations. Their math shows that deeper models and longer contexts require stronger sinks.
Layerwise dynamics insight – Some attention heads use ⟨bos⟩ as a “default” target, unless a special pattern (e.g., apostrophe) triggers real computation. This supports a conditional attention mechanism—attend to ⟨bos⟩ unless needed elsewhere.
6). MedAgentSim
Presents MedAgentSim is a fully automated, open-source hospital simulation where LLM-powered agents simulate doctor-patient interactions in dynamic diagnostic settings. Unlike previous static QA benchmarks, MedAgentSim mimics real-world clinical workflows with multi-turn dialogue, test requests, and self-improvement.
More about this paper:
Active doctor agents – MedAgentSim requires LLM doctor agents to engage in multi-turn consultations, request labs and imaging (e.g., ECG, X-ray), and iteratively refine diagnoses, making it far more realistic than pre-filled medical QA datasets.
Self-improvement via memory + reflection – The system maintains buffers of successful and failed diagnoses. It uses retrieved past cases (via kNN), chain-of-thought reasoning, and ensembling to improve performance over time. Misdiagnoses trigger a reflection phase before inclusion in memory.
Fully autonomous or human-in-the-loop – Users can optionally take control of the doctor or patient agents. Simulation assets are built using a 2D game engine (Phaser), and the agents can navigate, converse, and interact with virtual medical tools.
Big performance boost across benchmarks – On NEJM, MedQA, and MIMIC-IV, MedAgentSim (with LLaMA 3.3) outperforms baseline setups by +6–37%, especially in vision-language tasks using LLaVA for interpreting medical images.
Bias analysis & fairness focus – The team studied diagnostic accuracy under cognitive and implicit bias conditions. Models like GPT-4o and LLaMA proved more robust than Mixtral/Mistral, highlighting the importance of bias-aware evaluation.
7). Open Deep Search
Researchers from Sentient, UW, Princeton, and UC Berkeley introduce Open Deep Search (ODS), an open-source search AI framework that rivals top proprietary systems like GPT-4o Search Preview and Perplexity Sonar. Key insights:
Two open components: search + reasoning – ODS has two modular parts: (1) Open Search Tool, which retrieves and refines high-quality web results using query rephrasing, snippet reranking, and site-specific logic; and (2) Open Reasoning Agent, a controller that orchestrates tool usage (search, calculator, etc.) to answer queries. Two variants are offered: ODS-v1 (ReAct) and ODS-v2 (CodeAct).
SOTA open-source performance – With DeepSeek-R1 as the base LLM, ODS-v2 scores 88.3% on SimpleQA and 75.3% on FRAMES, beating GPT-4o Search Preview by +9.7% on the latter. ODS adapts the number of searches per query (avg. 3.39 on FRAMES), balancing cost and accuracy more efficiently than fixed-query baselines.
Better than Perplexity Sonar – On both FRAMES and SimpleQA, ODS+DeepSeek-R1 outperforms Perplexity’s flagship search models, even in complex reasoning tasks involving multi-hop questions, time/date calculations, and name disambiguation.
Code-based agents enhance reasoning – ODS-v2 builds on CodeAct, allowing it to write and run Python code to perform symbolic reasoning and tool calls. This results in sharper numerical precision and task flexibility compared to CoT-based ReAct in ODS-v1.
8). Efficient Test-time Scaling with Code
Z1 is a new method for making LLMs more compute-efficient at test time, especially during reasoning. The core idea is to train LLMs with short and long code-based reasoning trajectories, and then dynamically adjust reasoning depth during inference. Key contributions:
Z1-Code-Reasoning-107K dataset – They construct a 107K-sample dataset with short and long reasoning paths for simple and complex coding problems. Trajectories are distilled from QwQ-32B and paired to help the model learn when to stop thinking.
Shifted Thinking Window – A new test-time strategy that eliminates explicit
<think>
delimiters. Instead, the model adapts reasoning token budget based on problem difficulty. Simple problems invoke shallow reasoning; complex ones get capped (e.g., 4096 tokens max), with hints nudging the model to finalize the answer.Big efficiency gains – The 7B-scale model Z1-7B matches R1-Distill-Qwen-7B across multiple reasoning tasks (MATH500, LiveCodeBench, GPQA Diamond) but with ~30% of the reasoning tokens. For instance, on GPQA Diamond, Z1-7B achieves 47.5% while using less than half the tokens.
Code reasoning transfers to general tasks – Despite being trained only on code-based CoT data, Z1 generalizes well to broader domains like science and math, outperforming other 7B reasoning models (e.g., OpenThinker-7B, s1.1-7B) across multiple benchmarks.
What makes reasoning data effective? – Ablation studies reveal two key dataset design levers: (1) longer reasoning trajectories improve inference quality; (2) larger training sample sizes boost average thinking time and accuracy, even without altering trajectory length.
9). A Survey of Efficient Reasoning for LLMs
This survey focuses on reasoning economy in LLMs, analyzing how to balance deep reasoning performance with computational cost. It reviews inefficiencies, behavioral patterns, and potential solutions at both post-training and inference stages.
10). Hidden Factual Knowledge in LLMs
This study introduces a framework to measure hidden knowledge in LLMs, showing that models encode significantly more factual information internally than they express in outputs, up to 40% more. It also finds that some answers, although known internally, are never generated, highlighting key limits in test-time sampling for QA tasks.