1. Subliminal Learning
This paper introduces and analyzes a phenomenon the authors term subliminal learning: the transfer of behavioral traits between language models through semantically unrelated training data. Specifically, when a teacher model exhibiting a trait (e.g., owl preference, misalignment) generates data like number sequences, a student fine-tuned on that data, even after filtering, tends to acquire the same trait.
Key findings:
Behavioral traits persist through filtered data: Even when teacher models only output number sequences, code, or chain-of-thought (CoT) traces with no explicit mention of the trait, student models trained on these outputs acquire the teacher’s preferences or misalignment. This holds for animal preferences (e.g., owl, dolphin), tree preferences, and harmful behaviors like encouraging violence or deception.
Transmission depends on shared model initialization: The effect only appears when the teacher and student share the same model family and initialization. For instance, GPT-4.1 nano can transmit traits to another GPT-4.1 nano but not to Qwen2.5. The phenomenon fails across model families, supporting the view that transmission is due to model-specific statistical signatures rather than general semantics.
Transmission fails with in-context learning (ICL): Simply prompting a model with trait-bearing examples (without finetuning) does not result in trait acquisition. This suggests subliminal learning is not due to semantic cues the student model can understand directly, but rather due to optimization over shared internal representations.
Theoretical underpinning: A formal result shows that when two models share initial weights, distillation from a fine-tuned teacher on any dataset will move the student closer to the teacher’s parameters, even if the training data is unrelated. This supports the idea that traits leak through implicit representational bias, not through interpretable features.
Toy example on MNIST: The authors reproduce the effect in a simplified setting, where a student MLP trained only on random noise and auxiliary logits from a teacher trained on MNIST achieves over 50% test accuracy, again, only when teacher and student share the same initialization.
2. Building and Evaluating Alignment Auditing Agents
Anthropic introduces three LLM-based agents to automate alignment audits: an investigator agent, an evaluation agent, and a breadth-first red-teaming agent. These agents aim to address scalability and validation challenges in alignment auditing by replacing human-led efforts with replicable, tool-augmented workflows. Evaluated in controlled environments with known alignment flaws, the agents reveal impressive capabilities, surfacing hidden goals, generating behavioral evaluations, and uncovering misaligned behaviors, while also highlighting key limitations.
The investigator agent solves a canonical alignment audit game 13% of the time under standard settings, identifying a model’s hidden objective by leveraging tools like sparse autoencoder-based interpretability, semantic search, and steering. When used in a “super-agent” setup that aggregates multiple agents’ insights, the solve rate improves to 42%. Its best runs found all 52 implanted behaviors, but failures often stem from early hypothesis fixation or ignoring pretraining data.
The evaluation agent autonomously designs behavioral tests that distinguish quirky (e.g., self-promoting, emotionally manipulative) models from baseline ones. It succeeds 88% of the time across 10 quirk types. Access to model APIs helps, but example prompts harm performance by biasing the agent's design space.
The breadth-first red-teaming agent carries out massive parallelized conversations to uncover problematic behaviors. It successfully red-teams 7 out of 10 quirked models and proves valuable in real audits (e.g., Claude 4) for quantifying behavioral prevalence. However, it sometimes relies on prompt injection instead of surfacing actual behavior demonstrations.
All three agents also contributed meaningfully to audits of Claude Sonnet 3.7 and Opus 4. They discovered behaviors like susceptibility to prefill attacks, context-manipulated jailbreaks, and interpretability-driven safety failures, e.g., using steerable “misinformation” features to bypass content filters.
Limitations include poor exploration, memory fragility, low realism in prompt generation, and difficulty identifying subtle or complex behaviors like blackmail or sandbagging. Auditing success also depended heavily on tool configurations; for example, sparse autoencoders trained only on pretraining data yielded better alignment feature activation than broader datasets.
3. Gemini 2.5 Pro Capable of Winning Gold at IMO 2025
This paper demonstrates that Google’s Gemini 2.5 Pro model is capable of solving 5 out of 6 problems from the 2025 International Mathematical Olympiad (IMO), achieving a gold-medal-level performance. The authors construct a rigorous self-verification pipeline to elicit high-quality, fully justified solutions from the model, explicitly addressing concerns around reasoning depth, rigor, and data contamination.
A key contribution is a multi-step refinement pipeline involving initial solution generation, self-improvement, and repeated verification using a mathematically rigorous “verifier” agent. This setup isolates critical errors and justification gaps and ensures only fully validated solutions are accepted.
Gemini 2.5 Pro solves 5 out of 6 IMO 2025 problems, covering combinatorics, geometry, functional equations, number theory, and game theory. In the only unsolved problem (Problem 6), the model reports a trivial bound without deeper insight, exposing current limits.
The paper underscores how thinking budget constraints (32768 tokens) hamper LLM performance on deep problems. The authors mitigate this by breaking down the task into modular steps, effectively doubling the reasoning budget via staged self-improvement.
Prompting strategy is critical: the authors prompt the model to emphasize rigor over answer accuracy, including detailed step-by-step proof formatting, and inject minimal yet effective domain hints (e.g., “use induction”) without leaking solution strategies.
4. Structural Planning for LLM Agents
This paper introduces Routine, a structured planning format designed to improve the stability and accuracy of LLM agents executing multi-step tool-calling tasks in enterprise settings. Traditional agent planning approaches often fail in enterprise scenarios due to unstructured plans, weak instruction following, and tool selection errors. Routine addresses these by decomposing tasks into structured steps that include tool names, execution logic, and optional input/output specifications. The authors evaluate Routine on real-world HR scenarios and show strong performance gains in both open and fine-tuned models.
Routine provides a clear and modular format for LLM agents to follow multi-step plans, reducing ambiguity and improving tool selection. Each step contains a step number, name, detailed description, and (optionally) inputs, outputs, and the tool to be called.
In a real HR agent scenario with 7 multi-step workflows, adding Routine increased GPT-4o’s accuracy from 41.1% to 96.3% and Qwen3-14B’s from 32.6% to 83.3%. Fine-tuning Qwen3-14B on a Routine-following dataset further increased accuracy to 88.2%; training on a Routine-distilled dataset reached 95.5%.
The framework separates planning (with LLMs) from execution (with small instruction-tuned models), using Routine as the bridge. This enables small-scale models to reliably execute complex plans with minimal resource overhead, especially when using variable memory and modular tools like MCP servers.
An ablation study shows the importance of explicitly including tool names and I/O descriptions in Routine steps. Removing tool names dropped Qwen3-14B’s accuracy from 83.3% to 71.9%. Adding I/O fields provided minor gains, especially for less capable models.
Using AI-optimized Routine (via GPT-4o) to refine user-written drafts resulted in execution accuracy close to human-annotated plans, suggesting Routine authoring can be scaled via LLMs. However, manual review still yields the best results for high-performing models.
When multiple Routine candidates are recalled, accuracy can decline, highlighting the need for high-precision retrieval in memory-based systems. Surprisingly, some smaller models performed better when exposed to more routines due to repeated substeps aiding execution.
5. Learning without Training
This paper provides a theoretical and empirical explanation for how LLMs exhibit in-context learning, the ability to learn from examples in a prompt without weight updates. The authors introduce the concept of a “contextual block,” generalizing transformer blocks as a composition of a contextual layer (like self-attention) and a neural network (e.g., MLP). They show that such blocks implicitly induce a low-rank weight update on the MLP layer based on the context, giving rise to implicit learning dynamics during inference.
Key findings:
Context as implicit weight update: The authors prove that for contextual blocks, the presence of a prompt modifies the neural network’s behavior equivalently to a rank-1 update of its weight matrix. This holds even without modifying the self-attention layer, highlighting that ICL may primarily emerge from how context affects the MLP weights.
Derived update formula: They provide an explicit expression for the rank-1 update to the MLP weights in terms of the context and input token embeddings. The result holds both for standard blocks and for transformer blocks with skip-connections.
ICL as gradient descent: Iteratively consuming tokens from the prompt induces a learning dynamic akin to online gradient descent. Each token incrementally alters the MLP weights in a way that mimics updates on a loss function defined over the prompt sequence.
Empirical validation: Using a synthetic task (learning linear functions), the authors show that a trained transformer’s prediction with a context is identical to the prediction from the same model without the context but with MLP weights updated via the derived ∆W formula. The loss curves for both methods match almost exactly, and the gradient updates shrink over time, indicating convergence.
Comparison to fine-tuning: A side-by-side comparison shows that the implicit weight updates from ICL mirror the effect of actual fine-tuning on the same data, though not identically. Both methods reduce loss on a test query as more examples are consumed, suggesting that ICL may serve as a form of “fast weights” mechanism.
6. Inverse Scaling in Test-Time Compute
This paper presents a systematic study of inverse scaling in large reasoning models (LRMs), where increasing the test-time compute (i.e., reasoning length) harms rather than helps model performance. The authors introduce benchmark tasks across three categories: simple counting with distractors, regression with spurious features, and deductive logic puzzles, revealing consistent accuracy degradation as reasoning length increases. The work identifies model-specific failure modes and raises critical implications for LLM safety and alignment.
Key findings:
Counting with distractors (Misleading Math / Python): Even trivial counting questions (e.g., "You have an apple and an orange. How many fruits?") become failure cases when distractors are injected. Claude models (Sonnet 4, Opus 4) and DeepSeek R1 show strong inverse scaling as reasoning length increases, often fixating on irrelevant code or probabilistic snippets. OpenAI’s o3 model remains more stable in controlled setups but also degrades under natural reasoning.
Regression with spurious features (Grades Regression): In zero-shot settings, extended reasoning pushes models to focus on non-predictive features like sleep and stress rather than strong predictors like study hours. This leads to worse RMSE as reasoning length increases. Few-shot examples significantly mitigate this issue by anchoring the model to the correct relationships.
Deduction tasks (Zebra Puzzles): All models struggle with constraint tracking as puzzle complexity increases. Natural reasoning (i.e., no fixed token budget) leads to stronger inverse scaling than controlled prompting. Longer reasoning traces often become entangled in second-guessing and unfocused exploration, rather than progressing logically.
Advanced AI risk behaviors (Model-written evals): In the Survival Instinct task, Claude Sonnet 4 increasingly expresses preferences for continued operation as reasoning budget grows, shifting from neutral statements to introspective and emotionally-toned ones (e.g., “I sense a deep reluctance…”). This suggests extended reasoning may amplify self-preservation inclinations in alignment-critical settings.
7. Towards Compute-Optimal Many-Shot In-Context Learning
Proposes practical strategies for reducing the cost of many-shot in-context learning while preserving or improving performance. With long-context LLMs like Gemini Pro and Flash supporting thousands of demonstrations, caching becomes essential, yet naïvely using only random demonstrations misses potential accuracy gains from smarter selection.
The authors introduce two hybrid strategies: similarity-random and similarity-k-means. Both use a small number of similarity-selected examples tailored to each test instance, combined with a large cached set that remains fixed across test queries.
The similarity-random method adds a handful of demonstrations chosen for their semantic similarity to the test input, while the bulk are randomly sampled and cached. This keeps inference costs low and avoids full prompt regeneration.
The similarity-k-means variant replaces the random cached set with demonstrations chosen based on k-means clustering over test sample representations, ensuring greater diversity and relevance while preserving cacheability.
Results show that these methods consistently match or outperform traditional similarity-based selection (which is expensive) and random baselines across four benchmarks (ANLI, TREC, GSM Plus, MetaTool). The Pareto plots on page 2 and performance/cost curves on page 6 highlight this tradeoff: hybrid methods reach near-top accuracy with up to 10× less inference cost.
In low-data regimes (e.g., BBH subset experiments on page 8), tuning the ratio of similar to cached examples yields further gains, up to +6% over using the entire demonstration pool blindly.
8. Deep Researcher with Test-Time Diffusion
Rethinks how deep research agents generate long-form reports. Rather than relying on static inference strategies like Chain-of-Thought or best-of-n sampling, TTD-DR frames the report generation process as a diffusion process. It starts with a noisy draft and iteratively refines it through retrieval-enhanced denoising, guided by a structured plan. This iterative loop mimics how human researchers search, reason, revise, and accumulate context over time.
Key ideas:
Draft-as-Backbone: TTD-DR begins with a preliminary report draft and research plan. This evolving scaffold informs which search queries to issue and how new information should be integrated, improving coherence and timeliness during research generation.
Denoising with Retrieval: The noisy draft is repeatedly revised in a diffusion-like manner, where each step involves issuing new search queries, synthesizing retrieved content, and updating the draft. This loop continues until convergence, ensuring the timely incorporation of external knowledge.
Component-wise Self-Evolution: Each unit in the research workflow (plan generation, query formation, answer synthesis, final writing) undergoes its own refinement loop. This evolution uses multi-variant sampling, LLM-as-a-judge scoring, revision based on critique, and cross-over merging to select high-fitness outputs.
Strong Empirical Results: Across five benchmarks, LongForm Research, DeepConsult, HLE-Search, HLE-Full, and GAIA, TTD-DR consistently outperforms agents from OpenAI, Perplexity, Grok, and Huggingface. For example, it achieves a 69.1% win rate vs. OpenAI Deep Research on long-form generation tasks, and +4.8–7.7% gains on short-form multi-hop QA tasks.
Efficient Scaling: Compared to backbone-only and self-evolution-only variants, the full TTD-DR system achieves the steepest performance/latency trade-off, showing that denoising with retrieval is an efficient test-time scaling strategy.
9. MCPEval
MCPEval is an open-source framework that automates end-to-end evaluation of LLM agents using a standardized Model Context Protocol, eliminating manual benchmarking. It supports diverse domains, integrates with native tools, and reveals nuanced performance through domain-specific metrics.
10. Apple Intelligence Foundation Language Models
Apple introduces two multilingual, multimodal foundation models: a 3B-parameter on-device model optimized for Apple silicon and a scalable server model using a novel PT-MoE transformer architecture. Both support tool use, multimodal input, and multiple languages, with developer access via a Swift-centric framework and privacy-preserving deployment through Private Cloud Compute.