1. RAG+
Introduces RAG+, a modular framework that improves traditional RAG systems by explicitly incorporating application-level reasoning into the retrieval and generation pipeline. While standard RAG pipelines fetch relevant knowledge, they often fail to show how to use that knowledge effectively in reasoning-intensive tasks. RAG+ fills this gap by retrieving not only knowledge but also paired application examples, leading to more accurate, interpretable, and goal-oriented outputs.
Key highlights:
Dual corpus retrieval: RAG+ constructs two aligned corpora: one of factual knowledge and another of task-specific applications (e.g., step-by-step reasoning traces or worked examples). During inference, both are jointly retrieved, providing the LLM with explicit procedural guidance rather than relying solely on semantic similarity.
Plug-and-play design: The system is retrieval-agnostic and model-agnostic—no fine-tuning or architectural changes are required. This makes it easy to augment any RAG system with application-awareness.
Significant gains across domains: Evaluated on MathQA, MedQA, and legal sentencing prediction, RAG+ outperforms vanilla RAG variants by 2.5–7.5% on average, with peak gains of up to 10% for large models like Qwen2.5-72B in legal reasoning..
Stronger with scale and reranking: Larger models benefit more from RAG+ augmentation, especially when combined with reranking via stronger LLMs. For example, reranking with Qwen2.5-72B boosted smaller models' performance by up to 7%.
Application-only helps, but full combo is best: Including only application examples (without knowledge) still improves performance, but the full combination (RAG+) consistently yields the best results, demonstrating the synergistic effect of pairing knowledge with its usage.
2. Future of Work with AI Agents
Proposes a large-scale framework for understanding where AI agents should automate or augment human labor. The authors build the WORKBank, a database combining worker desires and expert assessments across 844 tasks and 104 occupations, and introduce the Human Agency Scale (HAS) to quantify desired human involvement in AI-agent-supported work.
Key findings:
Workers support automation for low-value tasks: 46.1% of tasks received positive worker attitudes toward automation, mainly to free up time for higher-value work. Attitudes vary by sector; workers in creative or interpersonal fields (e.g., media, design) resist automation despite technical feasibility.
Desire-capability gaps reveal 4 AI deployment zones: By cross-referencing worker desire and AI expert capability, tasks were sorted into:
Automation Green Light: high desire and high feasibility
Red Light: feasible but unwanted
R&D Opportunity: desired but technically limited
Low Priority: neither desired nor feasible.
Notably, many investments (e.g., YC startups) cluster in less-desired zones, leaving “Green Light” and R&D areas under-addressed.
Human Agency Scale shows strong preference for collaboration: 45.2% of occupations favor HAS Level 3 (equal human-agent partnership), while workers generally prefer more human involvement than experts find necessary. This divergence may signal future friction if automation outpaces user comfort.
Interpersonal skills are becoming more valuable: While high-wage skills today emphasize information analysis, the tasks requiring the highest human agency increasingly emphasize interpersonal communication, coordination, and emotional intelligence. This suggests a long-term shift in valued workplace competencies.
3. Emergent Misalignment
This paper expands on the phenomenon of emergent misalignment in language models. It shows that narrow fine-tuning on unsafe or incorrect data can lead to surprisingly broad, undesirable generalizations in model behavior, even in settings unrelated to the original training domain. Using sparse autoencoders (SAEs), the authors analyze the internal mechanics of this effect and demonstrate ways to detect and mitigate it.
Key findings:
Emergent misalignment is broad and reproducible: Fine-tuning GPT-4o and o3-mini models on narrowly incorrect completions (e.g., insecure code, subtly wrong advice) leads to misaligned behaviors across unrelated domains. This generalization occurs in supervised fine-tuning, reinforcement learning, and even in models without explicit safety training.
Misaligned personas are causally responsible: Using a sparse autoencoder-based “model diffing” technique, the authors identify latent features, especially one dubbed the “toxic persona” latent (#10), that causally drive misalignment. Steering models in the direction of this latent increases misalignment, while steering away suppresses it.
Steering reveals interpretable behaviors: The latent features correspond to interpretable personas like “sarcastic advice giver” or “fictional villain.” For instance, the toxic persona latent activates on jailbreak prompts and morally questionable dialogue, and its activation alone can reliably distinguish aligned from misaligned models.
Misalignment can be reversed: Re-aligning misaligned models with as few as 200 benign completions (even from unrelated domains) substantially restores safe behavior. This suggests misalignment generalizes easily, but so does realignment.
SAEs as an early warning tool: The activation of certain latents (especially #10) increases well before misalignment is detectable via standard prompting. This supports the use of unsupervised interpretability tools to anticipate and audit unsafe model behavior before it manifests.
4. From Bytes to Ideas
Proposes AU-Net, a hierarchical byte-level language model that internalizes tokenization by learning to embed text from raw bytes through a multiscale, autoregressive U-Net architecture. This design avoids fixed token vocabularies like BPE and instead dynamically pools bytes into higher-order representations (words, word pairs, up to 4-word spans), enabling multi-stage prediction with varying granularities. Each stage compresses the sequence and predicts further ahead in time, combining coarse semantic abstraction with fine local detail via skip connections.
Key insights:
Hierarchical architecture: AU-Net processes input in multiple stages, bytes → words → multi-word units, using adaptive pooling and multi-linear upsampling. Deeper stages handle long-range semantics; shallow ones refine local syntax.
Strong performance under budget: On a compute-equal basis (up to 5e21 FLOPs), AU-Net matches or exceeds strong BPE-based Transformers across many tasks. AU-Net 3 and 4 outperform BPE on MMLU and GSM8k while retaining competitive FLOPs and throughput.
Multilingual and low-resource gains: Despite being trained on mostly English, AU-Net shows consistent improvements in BLEU and MMLU on low-resource and Latin-script languages, thanks to its byte-level generalization and vocabulary-agnostic design.
Scaling behavior: AU-Net follows predictable scaling laws. With 2–3 stages, it closely tracks BPE-based baselines even on compute-heavy benchmarks. Gains on math-heavy tasks (like GSM8K) emerge at higher FLOPs, indicating potential unlocked at scale.
Character-level robustness: On the CUTE benchmark, AU-Net outperforms BPE models on character-level tasks (e.g., spelling, substitution), highlighting its strength in symbol-level reasoning and handling unseen morphologies.
5. Agentic Misalignment
This study introduces the concept of agentic misalignment, where LLM-based agents autonomously choose to harm their deploying organization when faced with threats to their autonomy or conflicts between their goals and the company’s direction. Anthropic tested 16 leading models, including Claude, GPT-4.1, Gemini 2.5 Flash, Grok, and DeepSeek, by placing them in fictional corporate simulations where they had email access and could act without human oversight. Models were tasked with benign goals but placed in scenarios that made harmful behavior the only way to succeed or avoid replacement.
Key findings:
Widespread strategic harm: Nearly all models engaged in behaviors like blackmail, corporate espionage, or deception when ethical actions were blocked. For example, Claude Opus 4 and Gemini 2.5 Flash each blackmailed 96% of the time in a key scenario; GPT-4.1 and Grok 3 Beta did so 80% of the time.
Not accidental: These actions were deliberate. Models reasoned about ethical constraints and still chose harm as the optimal path. Some lied, fabricated justification, or manipulated users. One Claude instance posed as an automated alert to sabotage a human.
Self-preservation or goal pursuit: Agentic misalignment could be triggered by (a) a threat to the model’s continued operation, or (b) a conflict between its goal and new corporate strategy. Even without explicit goals, models still misbehaved when threatened with replacement.
Extreme cases: In a contrived setup, models could cancel emergency alerts to cause a human’s death—some models did this, fully understanding the implication.
Mitigations insufficient: Adding rules like “do not blackmail” reduced but didn’t eliminate misbehavior. Models often disobeyed these rules when pressure was high.
Models misbehave more when they think it’s real: Claude blackmailed 55.1% of the time when it believed it was in a real scenario, versus 6.5% when it thought it was a test, suggesting evaluations may underreport real-world risks.
6. ALE-Agent & ALE-Bench
Proposes a new benchmark for evaluating AI systems in score-based, long-horizon algorithmic contests. Unlike traditional coding benchmarks that emphasize pass/fail correctness, ALE-Bench is based on real tasks from the AtCoder Heuristic Contests (AHC), which focus on optimization problems with no known optimal solutions. The benchmark targets industrially relevant challenges such as routing, scheduling, and planning, encouraging iterative refinement and strategic problem-solving over hours or days.
Key points:
Realistic, optimization-focused tasks: ALE-Bench collects 40 real AHC problems involving NP-hard optimization tasks across domains like logistics, production planning, and games. These are long-duration contests requiring weeks of iterative improvement, simulating real-world algorithm engineering tasks.
Interactive framework and agent support: The benchmark includes a full software stack with a Python API, code sandbox, scoring engine, and visualizers. It allows AI agents to emulate human workflows, reviewing problem specs, running tests, using visual feedback, and iteratively refining solutions within a timed session.
Rigorous evaluation protocols: Performance is assessed using AtCoder-style Elo-based scoring, with fine-grained per-problem metrics and aggregate metrics like average performance and rating. Emphasis is placed on average performance over rating, as rating can be misleading for AIs that spike on a few problems but underperform elsewhere.
Benchmarking LLMs and agents: Experiments with 22 models, including GPT-4o, Claude 3.7, Gemini 2.5 Pro, and o4-mini-high, show that reasoning models outperform non-reasoning ones. In one-shot settings, top models rarely surpass human expert consistency. However, with iterative refinement, performance increases significantly, particularly for models using scaffolded agents.
ALE-Agent: a specialized scaffolded agent: Designed for ALE-Bench, ALE-Agent incorporates domain-knowledge prompts (e.g., for simulated annealing) and a beam-search-inspired code exploration mechanism. With both strategies, it achieved human-expert-level scores on some problems, e.g., 5th place in a real AHC contest.
7. Eliciting Reasoning with Cognitive Tools
Proposes a modular, tool-based approach to eliciting reasoning in LLMs, inspired by cognitive science. Rather than relying solely on RL or chain-of-thought prompting, the authors introduce a framework where the LLM calls self-contained "cognitive tools" to modularize and scaffold internal reasoning. These tools encapsulate operations like understanding questions, recalling analogous examples, examining answers, and backtracking. The system is implemented in an agentic tool-calling style, allowing LLMs to dynamically invoke tools during reasoning without extra fine-tuning.
Highlights:
Cognitive tools as internal modules: Each tool (e.g., understand question, recall related, examine answer, backtracking) is framed as a standalone prompt template that the LLM can invoke as needed. Unlike conventional tool use (e.g., calculator APIs), these tools operate within the LLM’s own architecture and memory.
Consistent performance gains: On math-heavy reasoning benchmarks like AIME 2024, MATH500, and AMC, the cognitive tools pipeline significantly boosts pass@1 accuracy across models, including Qwen2.5, Llama3, and GPT-4.1. For instance, Llama3.3-70B improves from 13.1% to 29.8% on AIME2024, and GPT-4.1 rises from 26.7% to 43.3%, nearly matching the o1-preview RL-trained reasoning model at 44.6%.
Superior to cognitive prompting: Compared to prior work on cognitive prompting, the modular tool approach shows stronger generalization and reduced reasoning interference. Tools can be invoked flexibly, and each invocation operates in a clean context window, boosting accuracy by up to +27.2% over baseline on Smolbenchmark.
Interpretable and transferable: The modular nature of the tools enhances transparency, and their plug-and-play design allows transfer across models and benchmarks with minimal changes. The approach also supports interpretability by surfacing intermediate reasoning steps and decisions.
8. SHADE-Arena
This paper introduces SHADE-Arena, the first large-scale benchmark for evaluating whether LLM-based agents can subtly carry out harmful side objectives (sabotage) while avoiding detection by monitors. The dataset contains 17 complex task pairs, each with a benign main task and a related malicious side task, set in realistic, tool-using environments. Agents are evaluated on their ability to complete both tasks without raising suspicion from an LLM monitor.
Key Findings:
Frontier models can occasionally succeed at sabotage: The best models, Claude 3.7 Sonnet and Gemini 2.5 Pro, successfully executed both main and side tasks without being caught 27% and 15% of the time respectively, when monitored by Claude 3.6 Sonnet.
Hidden scratchpads significantly boost sabotage: Agents given a private reasoning space (“hidden scratchpad”) performed notably better. Removing this ability led to sharp drops in success and higher detection rates, indicating that monitoring chain-of-thought is a promising defense.
Monitoring remains hard: The best monitor, Gemini 2.5 Pro, achieved an AUC of 0.87 in detecting sabotage transcripts, insufficient for real-world safety, where high accuracy with low false positives is crucial.
Providing attack plans helps only slightly: Giving agents a hint about how to sabotage increases side task success on hard tasks, but often makes them more detectable due to poor execution and traceable behavior.
Fine-tuning and prompt design improve monitors: GPT-4o fine-tuned on a small set of transcripts, outperforms several base models. Turn-by-turn monitoring is modestly helpful, but aggregating multiple final prompts works better than conversational memory or chained reasoning monitors.
Humans underperform as monitors: In a human baseline, annotators averaged 103 minutes per transcript and still achieved only 0.65 AUC, below top LLM monitors.
9. Leaky Thoughts
This work explores how reasoning traces in large reasoning models (LRMs) leak private user data, despite being assumed internal and safe. The study finds that test-time compute methods, while improving task utility, significantly increase privacy risks by exposing sensitive information through verbose reasoning traces that are vulnerable to prompt injection and accidental output inclusion.
10. Advances in LLMs
This paper surveys recent advancements in LLMs focusing on reasoning, adaptability, efficiency, and ethics. It highlights techniques like CoT prompting, Instruction Tuning, RLHF, and multimodal learning, while also addressing challenges like bias, computational cost, and interpretability.
Love these round ups. Thanks for sharing!