🥇Top AI Papers of the Week
The Top AI Papers of the Week (January 26-February 1)
1. Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 is an open-source multimodal agentic model from Moonshot AI that jointly optimizes text and vision capabilities through native multimodal pretraining on 15 trillion mixed tokens, zero-vision SFT, and joint reinforcement learning. K2.5 also introduces Agent Swarm, a parallel agent orchestration framework that dynamically decomposes complex tasks into concurrent subtasks, reducing latency by up to 4.5x over single-agent baselines.
Joint text-vision optimization: K2.5 uses early fusion with a lower vision ratio during pretraining (rather than late-stage heavy vision injection), achieving better results across both modalities. A key finding is that zero-vision SFT - using only text SFT data - is sufficient to activate visual reasoning and tool use, while visual RL actually improves text benchmarks like MMLU-Pro (+1.7%) and GPQA-Diamond (+2.1%).
Agent Swarm with Parallel-Agent RL: The framework trains a learnable orchestrator via RL to decompose tasks and delegate subtasks to frozen specialized subagents running in parallel. This decoupled design avoids credit assignment ambiguity and improves item-level F1 from 72.8% to 79.0% on wide-search scenarios while significantly reducing inference latency.
State-of-the-art agentic performance: K2.5 achieves 74.9% on BrowseComp (with context management), 77.1% on DeepSearchQA, and 57.4% on Seal-0, outperforming GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro. It also scores 96.1% on AIME 2025, 76.8% on SWE-Bench Verified, and establishes new records in long-video comprehension.
Token-efficient RL with Toggle: K2.5 introduces Toggle, a training heuristic that alternates between budget-constrained and standard scaling phases during RL, reducing output tokens by 25-30% with negligible performance impact while maintaining strong test-time scaling capabilities.
2. Shaping Capabilities with Token-Level Data Filtering
Researchers from Anthropic and Stanford show that filtering pretraining data at the token level is a highly effective, scalable, and robust approach for selectively removing undesired capabilities from language models. Using medical knowledge removal as a proxy task, token-level filtering Pareto dominates document-level filtering and achieves a 7,000x compute slowdown on the target domain for 1.8B parameter models - while preserving capabilities in related fields.
Token filtering beats document filtering: Inspired by data attribution research showing individual tokens vary in their influence on model capabilities, the authors filter tokens rather than whole documents. This achieves the same reduction in undesired capabilities with lower cost to benign ones, since document filtering removes many useful tokens alongside harmful ones. Sweeping across classifier thresholds on 521M models confirms that token filtering Pareto dominates document filtering.
Effectiveness scales with compute: Training models from 61M to 1.8B parameters, the authors find filtering gets more effective at larger scales. For 1.8B models, token removal causes a 7,000x effective compute slowdown on the forget domain versus just 30x for document filtering. On multiple choice medical benchmarks, filtered models score near chance, while retaining full performance on biology, STEM, and non-STEM evaluations.
10x more robust than unlearning: Token-filtered models are 10x more robust to adversarial finetuning attacks than state-of-the-art unlearning methods. This addresses a key limitation of post-hoc approaches - once a capability exists in a base model, it is extremely hard to remove, but preventing it from forming during pretraining is far more durable.
Compatibility with alignment and SAE-based labeling: Surprisingly, models trained with token filtering generalize to refusal training better than unfiltered baselines, countering concerns that filtered models cannot be properly aligned on removed domains. The authors also introduce a novel pipeline using sparse autoencoders to label tokens and distill cheap, high-quality classifiers, showing that filtering remains effective even with noisy labels given sufficient compute.
3. How AI Impacts Skill Formation
Researchers from Anthropic conducted randomized experiments to study how AI assistance affects the development of software engineering skills. They find that using AI to complete coding tasks with a new Python library significantly impaired conceptual understanding, code reading, and debugging abilities - without delivering significant efficiency gains on average.
Learning loss from AI assistance: In a controlled study with 52 developers learning the Python Trio library, participants using AI scored 17% lower (Cohen’s d=0.738, p=0.01) on a skills evaluation covering conceptual understanding, debugging, and code reading. The largest gap appeared in debugging questions, likely because control group participants encountered and independently resolved more errors during the task.
No significant productivity gains: Contrary to prior work showing AI-assisted coding speedups, AI did not significantly reduce task completion time in this learning context. Several participants spent up to 11 minutes composing queries to the AI assistant, offsetting potential time savings from code generation.
Six distinct AI interaction patterns: Qualitative analysis of screen recordings revealed three low-scoring patterns (AI Delegation, Progressive AI Reliance, Iterative AI Debugging) averaging below 40% quiz scores, and three high-scoring patterns (Conceptual Inquiry at 86%, Generation-Then-Comprehension at 68%, Hybrid Code-Explanation at 65%) where participants stayed cognitively engaged.
Implications for AI-assisted workflows: The findings suggest that AI-enhanced productivity is not a shortcut to competence. The high-scoring interaction patterns all involved independent thinking and cognitive effort, indicating that how AI is used matters more than whether it is used - particularly in safety-critical domains requiring human oversight of AI-generated code.
4. VibeTensor
VibeTensor is an open-source deep learning system software stack from NVLabs that was fully generated by LLM-powered coding agents under high-level human guidance. The system implements a PyTorch-style eager tensor library with a C++20/CUDA core, Python and Node.js frontends, its own autograd engine, CUDA runtime, and caching allocator - demonstrating that coding agents can produce coherent system software spanning language bindings down to GPU memory management.
Full-stack generated architecture: The system includes a schema-lite dispatcher, reverse-mode autograd engine, stream-ordered caching allocator with diagnostics, CUDA graph support, and a stable C ABI for dynamically loaded operator plugins. The 28B LOC codebase spans 218 core C++ files and 225 Python test files, all generated without per-change manual diff review.
AI-assisted development methodology: A two-month development cycle used a simple loop: specify scoped goals, generate code, compile and test, then broaden validation. Tests as specifications and differential checks against PyTorch served as key guardrails, with multi-agent code review catching unsafe patterns.
Kernel performance and training validation: An accompanying AI-generated kernel suite shows mixed results: 1.54x faster than FlashAttention on NanoChat-style training (batch 32, seq 2048) but 0.67x on small-batch GQA prefill. End-to-end training on H100 and Blackwell GPUs converges correctly but runs 1.7-6.2x slower than PyTorch.
The Frankenstein composition effect: The paper identifies a key failure mode where individually correct generated subsystems compose into globally suboptimal designs - for example, a correctness-first autograd gate serializes execution and starves efficient backend kernels, highlighting challenges unique to AI-generated system software.
5. Reinforcement Learning via Self-Distillation
This paper introduces Self-Distillation Policy Optimization (SDPO), an on-policy RL algorithm that converts rich textual feedback from verifiable environments into dense credit assignment without requiring an external teacher model. SDPO uses the current model conditioned on feedback as a “self-teacher” to retrospectively identify mistakes in its own rollouts, substantially outperforming GRPO across scientific reasoning, tool use, and competitive programming.
Self-teacher for dense credit assignment: Instead of learning from sparse scalar rewards like GRPO, SDPO re-evaluates the model’s original attempt after conditioning on environment feedback (runtime errors, failed tests, or successful rollouts). This produces logit-level advantages at every token position, compared to GRPO’s constant per-rollout advantages. The approach requires only minor changes to standard RLVR pipelines by swapping out the advantage computation.
Strong gains on competitive programming: On LiveCodeBench v6 with Qwen3-8B, SDPO reaches 48.8% accuracy versus 41.2% for GRPO, surpassing Claude Sonnet 4 (40.5%) and Claude Opus 4 (39.7%) on the public leaderboard. SDPO achieves GRPO’s final accuracy in 4x fewer generations, with gains growing at larger model scales - suggesting self-teaching is an emergent capability.
Effective even without rich feedback: In standard RLVR environments with only scalar rewards, SDPO treats successful rollouts as implicit feedback for failed attempts, achieving 68.8% vs. 64.1% aggregate accuracy over GRPO on scientific reasoning and tool use benchmarks. On Chemistry with OLMo3-7B, SDPO reaches GRPO’s 5-hour accuracy in just 30 minutes.
Concise reasoning without verbosity: SDPO produces responses that are 3-7x shorter than GRPO while achieving higher accuracy, avoiding circular reasoning patterns and filler phrases. At test time, SDPO accelerates discovery of solutions on difficult tasks by 3x compared to best-of-k sampling, enabling effective test-time self-distillation on individual questions.
6. Self-Improving Pretraining
Self-Improving Pretraining is a new pretraining paradigm from Meta FAIR that replaces standard next-token prediction with sequence-level generation guided by an existing post-trained model acting as both a suffix rewriter and a suffix judge. The approach addresses quality, safety, and factuality issues at pretraining time rather than deferring them to post-training, yielding large gains across all three dimensions.
Suffix rewriting and judging framework: The method segments pretraining data into prefix-suffix chunks. A post-trained teacher model rewrites low-quality or unsafe suffixes into superior training targets, while a separate judge scores candidate completions (original suffixes, rewrites, and policy rollouts) to provide rewards for online RL training via online DPO or reward-filtered NLL.
Strong continual pretraining gains: When applied to continual pretraining of Llama2 1.4B, the method achieves an 86.3% generation quality win rate over the baseline, a 36.2% relative improvement in factuality (42.3 to 57.6 average score), and an 18.5% relative improvement in safety (76.9 to 91.1 average score), while also improving standard evaluation benchmarks.
From-scratch pretraining improvements: Training from scratch on RedPajama yields a 31.1% absolute gain in generation quality win rate, and safety evaluations, improving from 85.2 to 97.5, demonstrating that embedding quality signals early in pretraining is highly effective.
Scaling with rollouts: Performance improves consistently with more rollouts during online DPO training (tested from 1 to 16), and the model naturally transitions from relying on suffix rewrites early in training to preferring its own high-quality rollouts as training progresses.
7. LingBot-World: Open-Source World Simulator
LingBot-World is an open-source world simulator that evolves a video generation model into an interactive, real-time environment engine. Built on a 28B-parameter Mixture-of-Experts architecture, it achieves high-fidelity dynamics across diverse domains with sub-second latency at 16 fps, outperforming Genie 3 and Mirage 2 in dynamic degree while being fully open-source.
Three-stage evolution pipeline: A progressive training strategy transforms a pretrained video model into an interactive simulator: Stage I establishes a general video prior via the Wan2.2 14B model, Stage II injects world knowledge and action control through MoE middle-training on 60-second sequences, and Stage III adapts to causal attention with few-step distillation for real-time inference.
Scalable data engine with hierarchical captioning: A hybrid data engine ingests real-world footage, game engine recordings, and Unreal Engine synthetic data. A three-layer captioning strategy (narrative, scene-static, and dense temporal) disentangles motion control from scene generation, enabling precise action-contingent dynamics learning.
Emergent spatial memory: Without explicit 3D representations, the model maintains structural integrity of landmarks after 60 seconds out of view, reasons about unobserved state evolution (vehicles continuing trajectories off-screen), and supports coherent generation up to 10 minutes. VBench evaluation shows 0.8857 dynamic degree versus 0.76 for Yume-1.5 and 0.72 for HY-World 1.5.
Versatile embodied AI applications: Beyond visual synthesis, the framework supports promptable world events (global weather/style shifts and local object injection via text), an action agent trained on Qwen3-VL-2B for autonomous exploration, and 3D reconstruction from generated videos validating geometric consistency.
8. Insight Agents: Multi-Agent System for Data Insights
Insight Agents introduces a hierarchical multi-agent system built on a plan-and-execute paradigm for delivering personalized business insights to e-commerce sellers. The system uses a manager agent with OOD detection via a lightweight encoder-decoder model and BERT-based routing to coordinate two worker agents (data presenter and insight generator), achieving 90% accuracy with P90 latency below 15 seconds. Accepted at SIGIR 2025 and deployed for Amazon sellers in the US.
9. Communication Methods in Multi-Agent RL
A systematic survey of 29 papers reviewing how agents coordinate in multi-agent reinforcement learning, covering fully connected message passing, implicit communication, attention-based selective methods, graph-based relational approaches, and role-based hierarchical frameworks. The analysis reveals that attention- and graph-based methods dominate recent research, while implicit communication is seeing renewed interest for its scalability in decentralized settings where explicit channels are infeasible.
10. Team of Rivals: Orchestrating Reliable AI Agents
This paper proposes organizing AI agents into corporate-style teams with strict role boundaries and opposing incentives (planners, executors, critics, experts) to achieve reliability through careful orchestration of imperfect components. A remote code executor separates reasoning from data transformations, preventing raw tool outputs from contaminating agent context windows. The system achieves over 90% internal error interception before user exposure while maintaining acceptable latency tradeoffs.







