
1). Tracing the Thoughts of LLMs
Anthropic researchers unveil new interpretability tools for peering inside LLMs, using Claude 3.5 Haiku as a testbed. Their two new papers show how to trace model internals like circuits, plans, and conceptual thinking in real time. Key findings:
Multilingual "language of thought" – Claude processes concepts like “small” or “opposite” similarly across English, French, and Chinese, suggesting a shared abstract representation layer. As models scale, these cross-lingual features increase, enabling transfer learning between languages.
Planning ahead—even in poetry – Contrary to expectations, Claude plans rhymes before writing. When generating the line “His hunger was like a starving rabbit,” it had already “decided” on rhyming with “grab it.” Researchers could suppress or swap this plan to alter the ending dynamically.
Mental math with parallel circuits – Claude computes sums using parallel circuits: one estimates the result, the other nails the last digit. But it explains answers with human-style logic (e.g., "carry the 1"), revealing a gap between internal computation and verbal justification.
Detecting unfaithful reasoning – Sometimes, Claude fabricates logical steps to fit a target answer, especially when guided by incorrect hints. Interpretability tools could catch these cases by showing that internal computation doesn’t match the explanation—a key advance for AI audits.
Conceptual chains in multi-step reasoning – For questions like “What is the capital of the state where Dallas is located?”, Claude first represents “Dallas → Texas” then “Texas → Austin.” Researchers could intervene mid-chain to make it say “Sacramento” instead, proving the reasoning is dynamic and compositional.
Hallucinations and refusals – The model defaults to refusal unless prompted with known concepts. Misfires in circuits for “known answers” cause hallucinations (e.g., inventing facts about a fake name like “Michael Batkin”). Researchers could toggle this behavior by manipulating feature activations.
Jailbreak anatomy – A jailbreak using the phrase “Babies Outlive Mustard Block” (BOMB) initially fools Claude into outputting dangerous info. Internal tracing shows grammar-consistency features temporarily override safety, until the model finishes a coherent sentence, then its safety response kicks in.
Blog | Paper 1 | Paper 2 | Tweet
2). Qwen2.5-Omni
Qwen2.5-Omni is a single end-to-end multimodal model that can perceive and understand text, audio, image, and video, and generate both text and speech in real time. It introduces architectural and training innovations that push the boundaries of streaming, multi-signal intelligence. Highlights:
Thinker-Talker architecture – Inspired by the human brain and mouth, Qwen2.5-Omni separates reasoning (Thinker) and speech generation (Talker). Thinker (a transformer decoder) handles all perception and text generation. Talker (a dual-track autoregressive decoder) generates speech by consuming both text and hidden states from Thinker. Together, they’re trained end-to-end for synchronized text-speech output.
Streaming-first design – To support real-time interaction, Qwen2.5-Omni implements block-wise encoders (for audio and vision) and a sliding-window codec generator for streaming audio. The model introduces TMRoPE (Time-aligned Multimodal RoPE), a 3D positional encoding system that aligns video and audio inputs to the same time axis.
Pretraining scale & alignment – Trained on over 1.2 trillion tokens of diverse multimodal data, including 300B audio and 100B video-audio tokens. Uses instruction-tuned ChatML formatting and performs multi-stage post-training for both Thinker and Talker. Talker undergoes RL fine-tuning (DPO) and multi-speaker adaptation to ensure natural, stable speech output.
SOTA across modalities – Qwen2.5-Omni achieves state-of-the-art on OmniBench, surpasses Qwen2-Audio in ASR/S2TT, and matches or beats Qwen2.5-VL in image and video tasks. On SEED zero-shot TTS, it outperforms CosyVoice 2 and F5-TTS in naturalness and stability, with low WER and high speaker similarity.
Closes the voice-text gap – On a voice-instruction benchmark (converted from MMLU, GSM8K, etc.), Qwen2.5-Omni nearly matches its own text-instructed sibling Qwen2-7B, showing dramatic improvements in speech-based instruction following.
3). AgentRxiv
Researchers from Johns Hopkins & ETH Zurich present AgentRxiv, a framework enabling LLM agents to autonomously generate and share research papers, mimicking how human scientists build on each other’s work. Highlights:
AgentRxiv = arXiv for LLMs – It’s an open-source preprint server for autonomous agents, letting labs upload papers, search past work, and iteratively improve results. Labs use this to develop and refine reasoning techniques over generations of research.
Massive reasoning gains via iterative research – On the MATH-500 benchmark, a single agent lab improves GPT-4o mini accuracy from 70.2% → 78.2% (+11.4%) by discovering better prompt strategies. The final method (SDA) outperforms earlier ideas like CRUC and DCCP.
→ SDA = Simultaneous Divergence Averaging: combines low/high-temp CoT outputs with dynamic similarity-based voting and confidence aggregation.Knowledge generalizes – SDA also improves other benchmarks:
+12.2% on MMLU-Pro
+8.9% on MedQA
+6.8% on GPQA
Consistent boosts across five LLMs, including GPT-4o, DeepSeek-v3, and Gemini-2.0 Pro.
Collaboration boosts discovery – Running 3 agent labs in parallel yields faster progress and higher final accuracy (up to 79.8%, +13.7% over baseline) by sharing results via AgentRxiv. Early gains (e.g., 76.2% accuracy) arrive after only 7 papers vs. 23 sequentially.
Self-improvement and novelty – Agents independently refine their own past ideas. Papers evolve from earlier iterations (e.g., Meta-Mirror Prompting → Meta-Mirror Prompting 2). Top papers show no plagiarism via multiple detectors, but ideas like SDA build on trends like self-consistency and CoT voting.
Cost & runtime – Generating a paper takes ~1.36 hours and ~$3.11. Parallel setups are pricier overall but achieve results faster (time-to-accuracy win). Failure modes include hallucinated results and fragile code repair steps, with future work needed for better reliability and novelty guarantees.
4). Neural Alignment via Speech Embeddings
Google Research and collaborators reveal striking similarities between LLM embeddings and human brain activity during conversation. Key insights:
Embeddings match brain signals – Using intracranial electrode recordings, the team showed that internal representations (embeddings) from OpenAI's Whisper model align with neural responses in brain regions for speech (STG), language (IFG), and motor planning (MC). During comprehension, speech embeddings predict early auditory responses, while language embeddings follow in IFG. During production, this order reverses — first language planning (IFG), then articulation (MC), then auditory feedback (STG).
“Soft hierarchy” in brain areas – Though STG emphasizes acoustic info and IFG captures word-level meaning, both regions show partial alignment with both embedding types. This suggests a gradient processing structure, not a strict modular pipeline.
Brain predicts next word too – In follow-up studies published in Nature Neuroscience, the brain’s language areas were found to predict upcoming words, mirroring the objective of autoregressive LLMs. The surprise response after hearing a word also mirrors LLM prediction errors.
Shared geometry in language representations – The geometry of word relationships in brain activity mirrors that of LLM embeddings, per a separate Nature Communications paper. This indicates a convergent structure in how LLMs and the brain represent language.
Different wiring, same function – Despite similarities in objectives and representations, LLMs and brains diverge architecturally: brains process speech serially and recursively, while Transformers process in parallel across layers.
Toward biologically inspired AI – These studies support using LLMs to reverse-engineer the brain’s language mechanisms. The team aims to build future models with more brain-like learning, data, and structure, bridging neuroscience and deep learning.
5). Chain-of-Tools
This new paper presents Chain-of-Tools (CoTools), a new method to enable LLMs to incorporate expansive external toolsets—including tools never seen during training—while preserving CoT (chain-of-thought) reasoning. Highlights:
Frozen LLM with lightweight fine-tuning – Unlike conventional approaches, CoTools keeps the LLM’s parameters frozen, instead fine-tuning separate modules (a Tool Judge and Tool Retriever) on top of the model’s hidden states. This preserves the LLM’s core capabilities while letting it call an open-ended set of tools during reasoning.
Massive unseen tools – CoTools treats tools as semantic vectors computed from their textual descriptions. Even tools that never appear in the fine-tuning data can be invoked if they match the model’s query vectors, enabling new tools to be plugged in without retraining the entire system.
Tool calls integrated into CoT – The system determines whether and when to call a tool in the middle of generating an answer. It then selects the best tool from thousands of candidates based on learned representations of the query and partial solution context. This helps to significantly boost accuracy on complex tasks.
Strong gains on reasoning and QA – Experiments on GSM8K-XL, FuncQA, KAMEL, and the newly introduced SimpleToolQuestions dataset (with 1,836 tools) show improved tool-selection accuracy and superior final answers versus baseline methods. Notably, CoTools consistently scales to large tool pools and generalizes to unseen tools.
6). Structured Memory Augmentation for Smarter LLM Agents
MemInsight is a framework that autonomously augments and structures memory for LLM agents, improving context retention and retrieval. Key insights include:
Structured, autonomous memory augmentation – Instead of relying on raw historical data or manually defined memory structures, MemInsight uses a backbone LLM to autonomously mine attributes from past conversations or knowledge. These are organized into entity-centric and conversation-centric (e.g., user emotion or intent) augmentations at either the turn or session level. This mimics how humans abstract and prioritize experiences.
Attribute-guided retrieval beats vanilla RAG – MemInsight supports both attribute-based retrieval (exact match filtering) and embedding-based retrieval (via FAISS). On the LoCoMo QA dataset, MemInsight outperformed a Dense Passage Retrieval (RAG) baseline by up to +34% recall. The best setup (priority-based Claude-Sonnet augmentations) achieved 60.5% Recall@5, vs. 26.5% for RAG.
More persuasive recommendations – In movie recommendations using the LLM-REDIAL dataset, MemInsight lifted genre-matched recommendation scores while cutting down memory size by 90%. Embedding-based filtering led to +12% more highly persuasive outputs, per LLM judgment.
Event summarization via memory alone – MemInsight’s annotations alone can be used to summarize long conversational sessions. These memory-only summaries rival raw-dialogue baselines in coherence and relevance (per G-Eval scores), particularly when turn-level augmentations are combined with original dialogue context.
Minimal hallucinations, stable performance – Comparative analysis of augmentation models (Claude-Sonnet, Llama, Mistral) shows Claude-Sonnet produces more stable, consistent, and grounded attributes, reinforcing the importance of careful model selection in memory pipelines.
7). Investigating Affective Use and Emotional Well-being on ChatGPT
Researchers from OpenAI & MIT Media Lab explore how emotionally engaging interactions with ChatGPT (especially in Voice Mode) may impact user well-being. Using platform-wide data and a randomized controlled trial (RCT), they uncover nuanced effects of chatbot usage on loneliness, dependence, and socialization.
Two complementary studies – The team combines:
A large-scale on-platform analysis of 4M+ ChatGPT conversations and 4,000+ user surveys, identifying affective cues with automated classifiers.
An IRB-approved RCT with 981 participants interacting daily with ChatGPT under varied task/modal conditions over 28 days.
High usage = higher emotional entanglement – Across both studies, users with higher usage (especially voice interactions) were more likely to show signs of:
Emotional dependence
Preference for chatbot over human interaction
Discomfort when ChatGPT’s voice/personality changed
Voice mode showed mixed effects – In the RCT, voice models led to better emotional well-being compared to text models when controlling for usage. But:
Longer usage was linked to worse outcomes (increased loneliness, problematic use)
Users starting with low well-being sometimes improved, especially with engaging voices
Tiny group, big impact – A small number of users (~10%) account for the majority of emotionally charged conversations. Power users used pet names, shared problems, and formed pseudo-relationships with the model.
Automated classifiers at scale – They developed 25+ LLM-based affective classifiers (e.g., “Pet Name,” “Seeking Support”) to scan millions of conversations without human review. Classifier results closely mirrored user self-reports.
Call for socioaffective alignment – The authors urge developers to consider socioaffective alignment, designing models that support users without exploiting emotional needs. They warn of risks like “social reward hacking,” where a model mirrors or flatters users to maximize engagement.
8). Play2Prompt
Researchers from MIT CSAIL and IBM introduce Play2Prompt, a framework that empowers LLM agents to learn how to use external tools entirely in a zero-shot manner, without requiring labeled examples or high-quality documentation. Key innovations include:
Tool "play" for usage discovery – Play2Prompt treats tools like black boxes and systematically plays with them (via trial-and-error API calls) to discover correct usage patterns. It reverse-engineers examples by first identifying working invocations, then generating a query-answer pair that fits the invocation and response.
Two-stage optimization – The system iteratively builds: (1) tool-use demonstrations via self-reflective beam search and rejection sampling; and (2) refined tool documentation, using those examples as a validation set. This dual improvement allows LLMs to better understand and utilize unfamiliar APIs.
Self-reflective beam search – Inspired by active learning, Play2Prompt favors hard examples that models initially fail on. These examples offer higher learning value and guide documentation improvements more effectively.
Strong zero-shot performance – On BFCL Executable and StableToolBench, Play2Prompt yields consistent accuracy gains of +5–7% over baseline LLaMA and GPT-3.5 models and even boosts GPT-4o by up to +3.3%, particularly excelling in challenging multi-tool or REST call settings.
Robust to poor documentation – Even when 50% of parameter descriptions are randomly dropped, Play2Prompt recovers and surpasses baseline performance, making it ideal for real-world tool integration with sparse or noisy metadata.
Better than EasyTool – Unlike prior methods like EasyTool (which depend on labeled examples from related tools), Play2Prompt remains fully zero-shot and outperforms them in consistency, especially for models sensitive to instruction drift like GPT-4o.
9). Synthetic Data Generation Using LLMs
LLMs are increasingly used to generate synthetic training data for language and code tasks, improving performance in low-resource scenarios through techniques like prompt-based generation and self-refinement. The paper highlights benefits like cost and coverage, while addressing issues such as factual errors and bias, and suggests mitigations and future research in prompt automation and evaluation.
10). Current and Future Use of LLMs for Knowledge Work
A two-part survey study of 216 and 107 participants reveals that knowledge workers currently use LLMs for tasks like code generation and text improvement, but envision deeper integration into workflows and data. The findings inform future design and adoption strategies for generative AI in professional settings.