AI Newsletter

AI Newsletter

Share this post

AI Newsletter
AI Newsletter
🤖 AI Agents Weekly: DeepSWE, Cursor 1.2, Evaluating Multi-Agent Systems, Prover Agent, Top AI Devs News

🤖 AI Agents Weekly: DeepSWE, Cursor 1.2, Evaluating Multi-Agent Systems, Prover Agent, Top AI Devs News

DeepSWE, Cursor 1.2, Evaluating Multi-Agent Systems, Prover Agent, Top AI Devs News

Jul 05, 2025
∙ Paid
12

Share this post

AI Newsletter
AI Newsletter
🤖 AI Agents Weekly: DeepSWE, Cursor 1.2, Evaluating Multi-Agent Systems, Prover Agent, Top AI Devs News
3
Share

In today’s issue:

  • DeepSWE-Preview is a 32B coding agent trained from scratch

  • How to evaluate multi-agent tool calling capabilities

  • Agentic RAG for personalized recommendation

  • Threats in LLM-powered AI agents’ workflows

  • Kyutai has open-sourced its high-performance TTS system and Unmute

  • Survey on the evaluation of LLM-based agents

  • Tencent has open-sourced Hunyuan-A13B

  • Prover Agent is a new AI agent for theorem-proving

  • HCI challenges and opportunities in interactive multi-agentic systems

  • Cursor 1.2 introduces agent to-do lists for better task planning

  • Top AI dev news, research papers, and much more.



Top Stories

DeepSWE

The Agentica and Together AI teams present DeepSWE-Preview: a 32B coding agent trained entirely from scratch using only reinforcement learning (RL) atop Qwen3-32B. Achieving 42.2% Pass@1 and 71.0% Pass@16 on SWE-Bench-Verified, and 59.0% with test-time scaling, it sets a new SOTA for open-weight coding agents, without distillation or supervised fine-tuning.

  • DeepSWE is trained using rLLM, Agentica’s RL post-training system, across 4,500 real-world software engineering tasks from the R2E-Gym benchmark using 64 H100s over six days. Tasks require navigating real codebases, editing code, and verifying fixes with test suites.

  • A stable RL algorithm, GRPO++, was introduced with innovations like compact filtering, reward normalization removal, and entropy-free training, boosting learning stability in long-horizon, multi-step agentic tasks.

  • The system uses hybrid test-time scaling (up to 16 rollouts + verifier LLM) to boost Pass@1 by ~17 points over the base model. Beyond token-length scaling, trajectory-level diversity and verification proved more impactful.

  • Emergent behaviors arose from pure RL training: agents learned to reason about edge cases, allocate thinking tokens by step complexity, and self-check against regression test suites, all without explicit instruction.

  • The full training stack (code, data, and logs) is open-sourced to support reproducibility and further community-led advances in scaling LLM agents with reinforcement learning.

Blog

Keep reading with a 7-day free trial

Subscribe to AI Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 elvis
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share