1. Text-to-LoRA
Introduces a hypernetwork-based approach for instantly generating LoRA adapters from natural language task descriptions, removing the need for conventional task-specific fine-tuning. The authors present Text-to-LoRA (T2L), a model that compresses many LoRA adapters and generalizes to unseen tasks with high efficiency and strong performance.
T2L is trained to generate low-rank adaptation matrices (LoRAs) for LLMs using only task descriptions, leveraging a hypernetwork to output LoRA weights in a single forward pass. It supports two training modes: reconstruction of pre-trained LoRAs and supervised fine-tuning (SFT) across multiple tasks.
In benchmark experiments, SFT-trained T2L performs competitively in zero-shot adaptation, outperforming multi-task LoRA baselines and even task-specific LoRAs on some tasks (e.g., PIQA and Winogrande), showcasing generalization and compression benefits.
The authors test three architectural variants (L, M, S) of increasing parameter efficiency. Ablations show that T2L scales well with the number of training tasks, and its performance is robust across different task description embeddings (e.g., from GTE or Mistral models).
Qualitative and visual analyses confirm that T2L produces task-specific and semantically meaningful adapters even for unseen tasks, with steerability controlled by how the task is described.
2. V-JEPA 2
Meta AI introduces V-JEPA 2, a scalable joint-embedding predictive architecture for self-supervised video learning, targeting the goal of building a generalist world model capable of understanding, predicting, and planning in the physical world. The model is trained in two stages: action-free pretraining on 1M+ hours of internet videos and images, followed by post-training with only 62 hours of unlabeled robot trajectories (Droid dataset). This yields a latent video representation that is useful across a broad range of downstream tasks.
Video understanding & QA: V-JEPA 2 achieves state-of-the-art performance on motion-centric tasks like Something-Something v2 (77.3 top-1 acc) and Epic-Kitchens-100 action anticipation (39.7 recall@5), and excels in multimodal QA when aligned with an 8B language model (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Notably, this is achieved without language supervision during video pretraining.
Self-supervised robot planning: With just 62 hours of robot video data, the team fine-tunes an action-conditioned model (V-JEPA 2-AC) on top of the frozen video encoder. This model supports zero-shot deployment for prehensile tasks like grasping and pick-and-place on real Franka arms in unseen environments, without rewards, demonstrations, or lab-specific fine-tuning.
Architectural scale-up: V-JEPA 2 scales beyond its predecessor using four key improvements: expanded dataset (22M videos), larger ViT-g model (1B params), longer training (up to 252k iterations), and progressive spatiotemporal resolution. Combined, these yield significant gains across visual benchmarks (e.g., +4.0 average accuracy gain over baseline).
Comparison to other models: V-JEPA 2 outperforms image encoders like DINOv2 and PEcoreG on video QA, and is competitive on appearance tasks like ImageNet. When used in MLLMs, it surpasses models like PLM-8B on PerceptionTest, MVP, and TOMATO using only vision encoders pretrained without language data.
3. Reinforcement Pre-Training
This paper introduces Reinforcement Pre-Training (RPT), a new paradigm that bridges LLM pretraining and RL by reinterpreting next-token prediction as a reasoning task rewarded via verifiable correctness. Instead of relying on hand-curated annotations or costly human feedback, RPT applies RL on vast unannotated text corpora, assigning intrinsic rewards based on whether a predicted token matches the ground truth. This reframing supports general-purpose RL scaling and enhances both pretraining and fine-tuning efficacy.
Core method: At each token position in a text sequence, the model first generates a reasoning trace (chain-of-thought) and then predicts the next token. If the prediction is a valid prefix of the ground-truth continuation, a reward is assigned. Multiple rollouts are used per context, and the model is trained via on-policy RL.
Better than standard pretraining: RPT significantly outperforms standard next-token prediction and chain-of-thought reasoning baselines (without RL), achieving higher accuracy on tokens of varying difficulty and even rivaling larger models in performance. RPT-14B, for instance, matches or exceeds R1-Qwen-32B’s accuracy on the OmniMATH benchmark.
Strong scaling laws: RPT exhibits clean power-law scaling with respect to training compute across difficulty levels, with prediction accuracy consistently improving as compute increases, and fitting closely to theoretical curves.
Improves downstream RL and generalization: Fine-tuning RPT models with reinforcement learning on tasks with verifiable answers (e.g., Skywork-OR1) shows faster and stronger gains compared to models trained with standard objectives. Zero-shot evaluation on SuperGPQA and MMLU-Pro benchmarks reveals that RPT-14B in reasoning mode surpasses R1-Distill-Qwen-32B by a significant margin.
Promotes structured thinking: Analysis of reasoning traces reveals that RPT-14B employs more hypothesis generation, deduction, and reflective patterns compared to traditional problem-solving models, supporting the claim that RPT fosters deeper reasoning habits during training.
Editor Message
We are excited to announce the full release of our Advanced AI Agents course. Learn how to build agentic systems from scratch and how to optimize them.
Our subscribers can use code AGENTS30 for a limited time 30% discount.
4. TableRAG
TableRAG tackles a core limitation of existing RAG approaches: their inability to reason effectively over heterogeneous documents that combine both unstructured text and structured tables. Typical RAG pipelines flatten tables and intermix them with surrounding text, losing essential structural information and hampering multi-hop reasoning. TableRAG overcomes this by introducing a hybrid system that integrates SQL-based symbolic execution with text retrieval in a unified, iterative reasoning framework.
TableRAG operates in four iterative stages: (1) context-sensitive query decomposition, (2) text retrieval, (3) SQL programming and execution, and (4) intermediate answer generation. This design allows it to preserve tabular structure and leverage both symbolic and neural reasoning paths.
A new benchmark, HeteQA, was developed to evaluate heterogeneous reasoning across 304 multi-hop QA examples covering nine domains and five types of tabular operations (e.g., aggregation, filtering, grouping).
Experiments on HeteQA and existing benchmarks (HybridQA, WikiTableQuestions) show that TableRAG consistently outperforms prior methods like NaiveRAG, ReAct, and TableGPT2, achieving >10% gains in accuracy over the strongest baseline.
Ablations reveal that all major components of TableRAG contribute significantly. Notably, SQL execution is critical for nested reasoning tasks, while textual retrieval is crucial for entity and numeric references.
TableRAG achieves greater reasoning efficiency, solving over 90% of HeteQA tasks in five or fewer steps and exhibiting the lowest failure rate among evaluated methods. Its robustness holds across multiple LLM backbones (Claude, DeepSeek, Qwen).
5. Self-Adapting Language Models
It proposes a novel framework that enables LLMs to adapt themselves through reinforcement learning by generating their own fine-tuning data and update directives, referred to as “self-edits.” This approach allows models to autonomously optimize their learning process, without relying on separate adaptation modules or human supervision.
Key highlights:
Self-Edits as Adaptation Mechanism: Instead of updating weights directly with raw data, SEAL uses the model to generate self-edits, natural language instructions that might include restated facts, optimization parameters, or tool invocations. These are then used for supervised fine-tuning. The generation of self-edits is optimized via RL using downstream task performance as the reward.
Two Domains Evaluated:
Knowledge Incorporation: SEAL is evaluated on a no-context version of SQuAD. After RL training, finetuning on SEAL’s own synthetic data improves accuracy from 33.5% to 47.0%, outperforming GPT-4.1-generated data despite SEAL using a smaller model (Qwen2.5-7B).
Few-Shot Generalization: On ARC-AGI, SEAL learns to configure data augmentation and optimization settings, achieving a 72.5% adaptation success rate, up from 20% without RL and far beyond 0% for in-context learning.
Learning Framework: SEAL employs an outer RL loop (optimizing the self-edit generation policy) and an inner loop (applying the edits via supervised fine-tuning). The RL objective is approximated via filtered behavior cloning (ReSTEM), reinforcing only edits that lead to performance gains.
Limitations and Forward View: While SEAL achieves compelling gains, it remains susceptible to catastrophic forgetting during sequential updates and incurs high computational cost due to repeated fine-tuning. Future directions include combining SEAL with continual learning strategies and extending it toward autonomous agents that update weights as part of long-horizon interactions.
6. ComfyUI-R1
Introduces ComfyUI-R1, a 7B-parameter large reasoning model fine-tuned for automatic workflow generation in the ComfyUI ecosystem. Built on Qwen2.5-Coder and trained through a two-stage pipeline (supervised chain-of-thought reasoning followed by reinforcement learning), ComfyUI-R1 significantly outperforms prior state-of-the-art approaches that rely on commercial models like GPT-4o and Claude 3.5.
Key highlights:
Curated Workflow & Node KBs: The team collected and cleaned 27K community workflows down to 3.9K high-quality entries, each with JSON and code representations. They also assembled a node documentation KB using 7,238 nodes, some enhanced via Claude 3.5.
CoT + RL Training: The model first undergoes supervised fine-tuning on simulated long CoT reasoning sequences involving node selection, planning, and code generation. Then, reinforcement learning with a fine-grained rule-metric hybrid reward encourages format validity, graph correctness, and node accuracy.
Superior Performance: ComfyUI-R1 achieves 97% format validity and the highest node-level and graph-level F1 scores across benchmarks, beating all GPT-4o and Claude baselines. On the ComfyBench benchmark, it improves the execution pass rate to 67%, a full 11% above the previous best (ComfyAgent with GPT-4o).
Qualitative Improvements: Case studies show ComfyUI-R1 creates more faithful, complex, and executable workflows than prior multi-agent approaches. Notably, it better aligns image generation outputs with user instructions in terms of style, structure, and coherence.
7. Magistral
Mistral introduces Magistral, its first reasoning-focused LLM line, alongside a custom RL training stack that enables pure reinforcement learning from scratch. In contrast to prior approaches that rely on distillation from teacher models, Magistral trains directly using online RL with text-only data and custom reward shaping. The work yields two open models: Magistral Medium (based on Mistral Medium 3) and the open-sourced Magistral Small (24B), which is bootstrapped via SFT on Medium’s outputs, followed by RL.
Key insights:
Pure RL can rival distillation: Magistral Medium was trained without any reasoning traces and achieved a 50% boost in AIME-24 (pass@1) over the base model. Across reasoning benchmarks like LiveCodeBench and GPQA, it performs on par or better than DeepSeek-R1 despite lacking a distillation phase.
Multilingual and multimodal generalization emerge: Reinforcement learning on textual math/code data surprisingly enhances multimodal reasoning and preserves instruction-following and tool-calling capabilities. Multilingual reasoning is achieved via simple reward shaping, enforcing the same-language output and chain-of-thought.
Efficient async infrastructure: An asynchronous RL setup with NCCL weight broadcasting enables generators to roll out sequences continuously, receiving weight updates without blocking. This helps maintain on-policyness while maximizing GPU utilization.
Reward shaping and CoT format enforcement: Rewards are conditioned on correct formatting (
<think>
tags, boxed answers, markdown blocks), correctness (via SymPy and test cases), length penalties, and language consistency (using fastText). Failure to meet formatting results in an immediate zero reward.Training heuristics and ablations: Analysis shows that reward and performance scale logarithmically with output length. Longer completions correlate with higher reward. RL moves weights in a low-dimensional space, as shown by PCA of checkpoints. Ablations on batch size, advantage normalization, and entropy (via εₕᵢgₕ) highlight stability tradeoffs.
Open-source release: Magistral Small (24B) is released under Apache 2.0 and performs competitively under three setups, SFT-only, RL-only, and SFT+RL, with the final combo yielding the best results across math and code tasks.
8. Code Researcher
Code Researcher is a deep research agent designed to resolve complex bugs in large systems’ code by leveraging multi-step reasoning over semantics, patterns, and commit history. It significantly outperforms existing coding agents like SWE-agent on benchmarks like kBenchSyz, achieving a 58% crash-resolution rate by effectively gathering and filtering global context before generating and validating patches.
9. LLamaRL
LlamaRL is a fully-distributed, asynchronous reinforcement learning framework designed for efficient large-scale LLM training (8B to 405B+ models). It achieves up to 10.7× speedup over DeepSpeed-Chat by combining co-located model offloading, asynchronous off-policy training (AIPO), and fast GPU-native weight sync (DDMA), while maintaining model quality across tasks like math reasoning.
10. Predicting a Cyclone’s Track with AI
Google DeepMind's Weather Lab debuts an interactive platform featuring an AI cyclone forecasting model that generates 50 ensemble scenarios up to 15 days ahead, outperforming traditional systems in track and intensity predictions. It integrates live and historical forecasts with baselines for evaluation and is now used by agencies like the U.S. National Hurricane Center to support early warnings.
Great work mate