1). LLM Surpass Human Experts in Predicting Neuroscience Results - proposes BrainBench to study how good LLMs are at predicting experimental outcomes in neuroscience; they tuned an LLM, BrainGPT, on neuroscience literature that surpasses experts in predicting neuroscience results; report that when LLMs indicated high confidence in their predictions, their responses were more likely to be correct. (paper | tweet)
2). Fugatto - a new generative AI sound model (presented by NVIDIA) that can create and transform any combination of music, voices, and sounds using text and audio inputs, trained on 2.5B parameters and capable of novel audio generation like making trumpets bark or saxophones meow. (paper | tweet)
3). o1 Replication Journey - Part 2 - shows that combining simple distillation from o1's API with supervised fine-tuning significantly boosts performance on complex math reasoning tasks; a base model fine-tuned on simply tens of thousands of samples o1-distilled long-thought chains outperform o1-preview on the American Invitational Mathematics Examination (AIME). (paper | tweet)
Editor Message
We’re excited to announce our new free course on LLM Evaluation. In this course, you will learn to test and evaluate your LLM applications using the latest tools and techniques, including LLM-as-a-judge metrics and production LLM monitoring. This course was done in collaboration with Comet and DAIR.AI.
4). LLM-Brained GUI Agents - presents a survey of LLM-brained GUI Agents, including techniques and applications. (paper | tweet)
5). High-Level Automated Reasoning - extends in-context learning through high-level automated reasoning; achieves state-of-the-art accuracy (79.6%) on the MATH benchmark with Qwen2.5-7B-Instruct, surpassing GPT-4o (76.6%) and Claude 3.5 (71.1%); rather than focusing on manually creating high-quality demonstrations, it shifts the focus to abstract thinking patterns; it introduces five atomic reasoning actions to construct chain-structured patterns; then it uses Monte Carlo Tree Search to explore reasoning paths and construct thought cards to guide inference. (paper | tweet)
6). Star Attention: Efficient LLM Inference over Long Sequences - introduces Star Attention, a two-phase attention mechanism that processes long sequences by combining blockwise-local attention for context encoding with sequence-global attention for query processing and token generation; achieves up to 11x faster inference speeds while maintaining 95-100% accuracy compared to traditional attention mechanisms by efficiently distributing computation across multiple hosts; a key innovation is the "anchor block" mechanism, where each context block is prefixed with the first block, enabling effective approximation of global attention patterns while reducing computational overhead. (paper | tweet)
7). Survey on LLM-as-a-Judge - provides a comprehensive survey of LLM-as-a-Judge, including a deeper discussion on how to build reliable LLM-as-a-Judge systems. (paper | tweet)
8). TÜLU 3 - releases a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. (paper | tweet)
9). Generative Agent Simulations of 1,000 People - introduces a new agent architecture that uses LLMs to create behavioral simulations of real individuals, achieving 85% accuracy in replicating human responses on the General Social Survey and reducing demographic biases compared to traditional approaches. (paper | tweet)
10). Measuring Bullshit in Language Games Played by ChatGPT - proposes that LLM-based chatbots play the ‘language game of bullshit’; by asking ChatGPT to generate scientific articles on topics where it has no knowledge or competence, the authors were able to provide a reference set of how this “bullshit” is manifested. (paper | tweet)
Friendly suggestion regarding your AI Agents Weekly.
Might be nice to offer a sample copy of a back issue. Doesn't have to be too current, but should be a standard issue rather than a special issue. Perhaps either the 27 October or 2 November issue.
From the snippets we get of AI Agents Weekly, I've noticed that you cover a lot of commercial offerings, but what about paper reviews? There are a lot of research papers covering this topic -- and that's where your expertise would come in handy and make a subscription feel more worthwhile. Please keep in mind that many of us are experiencing fatigue from the plethora of subscriptions costing $50 or more per year.