1. Small Language Models are the Future of Agentic AI
This position paper argues that small language models (SLMs), defined as those runnable on consumer-grade hardware, are not only sufficient but superior for many agentic AI applications, especially when tasks are narrow, repetitive, or tool-oriented. The authors propose that shifting from LLM-first to SLM-first architectures will yield major gains in efficiency, modularity, and sustainability.
SLMs are already capable of commonsense reasoning, instruction following, and code/tool interaction at levels comparable to 30–70B models, with orders of magnitude better throughput. Examples include Phi-3, Hymba-1.5B, DeepSeek-R1-Distill, and RETRO-7.5B.
The economic benefits are significant: SLMs offer 10–30× lower inference cost than LLMs, require less parallel infrastructure, and are amenable to overnight fine-tuning and even edge deployment (e.g., ChatRTX). This enables faster iteration and better data control.
SLMs support modular, composable agent systems where specialized models handle subtasks, resulting in better alignment, lower risk of hallucinations, and easier debugging. The authors advocate for heterogeneous architectures, with SLMs as defaults and LLMs used selectively.
A six-step LLM-to-SLM conversion algorithm is proposed, involving usage logging, task clustering, and PEFT fine-tuning. This supports gradual migration from monolithic agents to SLM-based compositions.
Case studies on MetaGPT, Open Operator, and Cradle suggest that 40–70% of LLM invocations can be reliably replaced with SLMs, particularly for structured generation and routine tool use.
2. AI4Research
This survey offers the first unified and comprehensive framework for understanding how AI is transforming the full lifecycle of scientific research. The paper identifies five core areas: Scientific Comprehension, Academic Survey, Scientific Discovery, Academic Writing, and Academic Peer Review, and presents a detailed taxonomy and modeling approach for each.
Systematic Taxonomy and Modeling: The paper introduces a modular functional composition model for AI4Research, where each task (e.g., ASC for comprehension, ASD for discovery) is modeled as a distinct function contributing to the overall research pipeline. These are mathematically formalized to optimize research efficiency, quality, and innovation.
Scientific Discovery Pipeline: The discovery section details a full-stack AI workflow, from idea mining (internal knowledge, external signals, and collaborative brainstorming) through theory formalization and experiment execution to full-automatic discovery. Models like AI Scientist, Carl, and Zochi simulate autonomous research loops and have generated publishable papers, highlighting a growing capability in self-directed research agents.
Multimodal and Multidisciplinary Integration: The survey thoroughly maps AI applications across natural sciences (e.g., AlphaFold 3 in protein folding, AI-Newton in physics), applied sciences (robotics, software engineering), and social sciences (AI-led ethnographic simulations, automated interview agents).
AI in Peer Review and Writing: Beyond comprehension and discovery, the paper explores tools for writing assistance (e.g., ScholarCopilot, SciCapenter) and peer review automation (e.g., AgentReview, TreeReview). Benchmarks like PeerRead and MASSW support this growing subfield, while models like GPT-4o have begun to rival human reviewers in structure and focus.
Emerging Frontiers: In its future directions, the paper emphasizes ethical and explainable AI, interdisciplinary models, multilingual access, and dynamic real-time experiment optimization. Notably, it calls for infrastructure-level innovation in federated learning, collaborative agents, and multimodal integration to fully realize AI4Research’s promise.
Editor Message
We are excited to announce the full release of our Advanced AI Agents course. Learn how to build agentic systems from scratch and how to optimize them.
Our subscribers can use code AGENTS30 for a limited time 30% discount.
3. Chain-of-Thought Is Not Explainability
It challenges the common assumption that chain-of-thought (CoT) reasoning in LLMs is synonymous with interpretability. While CoT improves performance and offers a seemingly transparent rationale, the authors argue it is neither necessary nor sufficient for faithful explanation. Through a review of empirical evidence and mechanistic insights, the paper makes the case that CoT often diverges from the internal computations that actually drive model predictions.
Key contributions and findings:
Unfaithful reasoning is systematic: CoT rationales are often unfaithful to the underlying model computations. Examples include models silently correcting mistakes, being influenced by prompt biases, and using latent shortcuts while offering post-hoc rationales that omit these factors.
Widespread misuse in the literature: Of 1,000 CoT-focused papers surveyed, 24.4% explicitly use CoT as an interpretability technique without proper justification. The paper introduces a misclaim detection pipeline to track this trend and shows that the prevalence has not declined over time.
Architectural mismatch: Transformer models compute in a distributed, parallel manner that doesn’t align with the sequential nature of CoT explanations. This leads to plausible but misleading narratives that fail to capture causal dependencies.
Recommendations for improvement: The authors advocate for causal validation techniques (e.g., counterfactual interventions, activation patching), cognitive science-inspired mechanisms (e.g., metacognition, self-correction), and human-centered interfaces to assess CoT faithfulness. However, even these remain partial solutions to a deeper architectural challenge.
4. Agentic RAG for Personalized Recommendation
Introduces a multi-agent framework that enhances traditional RAG systems with reasoning agents tailored to user modeling and contextual ranking. Developed at Walmart Global Tech, ARAG reframes recommendations as a structured coordination problem between LLM agents.
Instead of relying on static similarity-based retrieval, ARAG comprises four agents:
User Understanding Agent synthesizes user preferences from long-term and session behavior.
NLI Agent evaluates semantic alignment between candidate items and user intent.
Context Summary Agent condenses relevant item metadata.
Item Ranker Agent ranks final recommendations using all prior reasoning.
The agent collaboration is orchestrated via a blackboard-style memory, enabling cross-agent attention and interpretability. On the Amazon Reviews dataset (Clothing, Electronics, Home), ARAG outperforms Vanilla RAG and recency-based baselines by up to +42.1% (NDCG@5) and +35.5% (Hit@5).
Ablation studies reveal each agent’s contribution, with especially large gains from the NLI and Context Summary agents in categories where semantic fit (e.g., fashion) is critical.
This work demonstrates that incorporating agentic reasoning into RAG pipelines leads to more personalized, semantically grounded recommendations, offering a practical path forward for LLM-driven personalization at scale.
5. Threats in LLM-Powered AI Agents Workflows
This work presents the first comprehensive, end-to-end threat model for LLM-powered agent ecosystems. As LLM agents gain the ability to orchestrate multi-step workflows and interact via protocols like MCP, ANP, and A2A, this paper surveys over 30 attack techniques spanning the entire stack, from input manipulation to inter-agent protocol exploits.
Key insights include:
Four-part threat taxonomy: The authors categorize attacks into (1) Input Manipulation (e.g., prompt injections, multimodal adversarial inputs), (2) Model Compromise (e.g., composite backdoors, memory poisoning), (3) System & Privacy Attacks (e.g., retrieval poisoning, speculative side-channels), and (4) Protocol Vulnerabilities (e.g., MCP discovery spoofing, A2A prompt chaining).
Real-world attack success rates are high: Adaptive prompt injections bypass defenses in over 50% of cases, while attacks like Jailbreak Fuzzing and Composite Backdoors reach up to 100% ASR. These findings suggest that even well-aligned agents remain deeply vulnerable.
Protocol-layer threats are underexplored but critical: The paper exposes novel vulnerabilities in communication protocols (e.g., context hijacks in MCP, rogue agent registration in A2A), showing how subtle abuses in capability discovery or authentication can trigger cascading failures across agent networks.
Emerging risks in LLM-agent infrastructure: New modalities like vision-language agents (VLMs), evolutionary coding agents, and federated LLM training introduce their own threat vectors, including cross-modal jailbreaks, dynamic backdoor activation, and coordinated poisoning attacks.
Call for system-level defenses: The authors advocate for dynamic trust management, cryptographic provenance tracking, secure agentic web interfaces, and tamper-resistant memory as key defense directions. They also highlight the need for new benchmarks and anomaly detection methods tailored to agent workflows.
6. Deep Research Agents
Provides the most comprehensive survey to date of Deep Research (DR) agents, LLM-powered systems built for autonomous, multi-step informational research. The paper defines DR agents as AI systems that tightly integrate dynamic reasoning, adaptive long-horizon planning, tool use, retrieval, and structured report generation. It establishes a taxonomy of DR architectures, evaluates recent advances, and outlines the limitations of current systems and benchmarks.
The authors differentiate DR agents from classical RAG or tool-use pipelines by emphasizing their autonomy, continual reasoning, and adaptive planning. Static workflows like AI Scientist and AgentRxiv rely on rigid pipelines, while dynamic DR agents like OpenAI DR and Gemini DR can replan and adapt in response to intermediate results.
A key contribution is the classification of DR agents along three axes: (1) static vs dynamic workflows, (2) planning-only vs intent-planning strategies, and (3) single-agent vs multi-agent systems. For example, Grok DeepSearch uses a single-agent loop with sparse attention and dynamic tool use, while OpenManus and OWL adopt multi-agent orchestration with role specialization.
DR agents use both API-based and browser-based search methods. The former (e.g., arXiv, Google Search) is fast and structured, while the latter (e.g., Chromium-based agents like Manus and DeepResearcher) handles dynamic content and complex UI interactions, albeit at higher latency and fragility.
Most agents now integrate tool-use modules such as code execution, data analytics, and multimodal reasoning. Advanced agents like AutoGLM Rumination even feature computer use, enabling direct API calls and platform interactions (e.g., CNKI, WeChat), effectively bridging inference and execution.
Benchmarking remains immature. The paper catalogs agent performance across QA (e.g., HotpotQA, GPQA, HLE) and execution benchmarks (e.g., GAIA, SWE-Bench), but notes that many evaluations fail to test retrieval rigor or long-form synthesis. Benchmarks like BrowseComp and HLE are highlighted as essential next steps for grounding evaluation in open-ended, time-sensitive tasks.
Future challenges include integrating private or API-gated data sources, enabling parallel DAG-style execution, improving fact-checking via reflective reasoning, and creating structured benchmarks for multimodal report generation. The authors advocate for AI-native browsers, hierarchical reinforcement learning, and dynamic workflow mutation as enablers of true agentic autonomy.
7. Survey on Evaluation of LLM-based Agents
This work presents the first comprehensive overview of how to evaluate LLM-based agents, which differ significantly from traditional LLMs by maintaining memory, planning over multiple steps, using tools, and interacting with dynamic environments. The authors categorize and analyze the evaluation landscape across four axes: core agent capabilities, application-specific agent benchmarks, generalist agent evaluation, and supporting evaluation frameworks.
The paper organizes fundamental capabilities into four categories: (1) planning and multi-step reasoning (e.g., GSM8K, PlanBench, MINT), (2) function calling and tool use (e.g., ToolBench, BFCL, ToolSandbox), (3) self-reflection (e.g., LLF-Bench, LLM-Evolve), and (4) memory (e.g., MemGPT, ReadAgent, A-MEM, StreamBench). For each, it surveys benchmarks that test these competencies under increasingly realistic and complex scenarios.
In domain-specific evaluation, the authors highlight the rise of specialized agents in web navigation (e.g., WebShop, WebArena), software engineering (e.g., SWE-bench and its variants), scientific research (e.g., ScienceQA, AAAR-1.0, DiscoveryWorld), and dialogue (e.g., ABCD, Ï„-Bench, IntellAgent). These benchmarks typically include goal-oriented tasks, tool use, and policy adherence, often with simulations or human-in-the-loop data.
Generalist agent benchmarks like GAIA, AgentBench, OSWorld, and TheAgentCompany aim to evaluate flexibility across heterogeneous tasks, integrating planning, tool use, and real-world digital workflows. Benchmarks such as HAL aim to unify evaluation across multiple axes, including coding, reasoning, and safety.
The paper also covers evaluation frameworks like LangSmith, Langfuse, Vertex AI, and Galileo Agentic Evaluation, which allow stepwise, trajectory-based, and human-in-the-loop assessments. These systems enable real-time monitoring and debugging of agent behavior during development, often via synthetic data and LLM-as-a-judge pipelines.
8. NaturalThoughts
This paper introduces NaturalThoughts, a large-scale dataset of reasoning traces distilled from DeepSeek-R1 using questions from the NaturalReasoning corpus. It challenges the "Less is More" hypothesis by showing that simply scaling up high-quality reasoning traces, without aggressive filtering, yields robust and general improvements across STEM reasoning tasks for smaller models like Llama-3.1-8B and Qwen-2.5-7B.
Key findings:
Scale beats sparsity: Training on hundreds of thousands of randomly selected reasoning traces from NaturalThoughts consistently outperforms curated datasets like LIMO and S1K on GPQA-Diamond, MMLU-Pro, and SuperGPQA, reversing the prior trend where small, curated datasets showed outsized benefits.
Hard examples help more: Selecting training data by difficulty, e.g., examples with model disagreement or long reasoning chains, leads to greater sample efficiency than random sampling. The most effective subsets use disagreement between teacher models as a proxy for reasoning complexity.
Diversity matters, but not in an obvious way: Contrary to expectations, semantic or topical diversity in question domains yielded less gain than diversity in reasoning strategies themselves. Traces with a mix of tactics like self-verification, backtracking, and synthesis provided stronger generalization across tasks.
Mixed distillation improves efficiency: Blending full CoT traces (System-2) with final answers only (System-1) enables inference-time control over reasoning length. Difficulty-based mixing, applying System-2 only for harder examples, beats both random mixing and pure System-2, achieving higher accuracy with fewer tokens at test time.
9. Visual Structures Help Visual Reasoning
This study shows that adding simple spatial structures (like horizontal lines) to images significantly boosts GPT-4o’s visual reasoning by improving feature binding. This visual input tweak outperforms textual strategies alone, yielding large gains in visual search (+25%), counting (+26.8%), and spatial understanding (+9.5%).
10. xLSTMAD
This paper introduces xLSTMAD, the first anomaly detection method using an encoder-decoder xLSTM architecture tailored for multivariate time series. It achieves state-of-the-art results on 17 real-world datasets, outperforming 23 baselines and demonstrating xLSTM’s strong potential beyond forecasting and compression.