1). Thinking LLMs - proposes a training method to equip LLMs with thinking abilities for general instruction-following without human-annotated data; uses an iterative search and optimization procedure to explore thought generation which enables the model to learn without direct supervision; thought candidates for each user instruction are scored with a judge model; only responses are evaluated by the Judge which determines the best and worst ones; then the corresponding full outputs are used as chosen and rejected pairs for DPO (referred to as Thought Preference Optimization in this paper). reports superior performance on AlpacaEval and Arena-Hard. (paper | tweet)
2). Model Swarms - propose a new collaborative search algorithm to adapt LLM via swarm intelligence; a pool of LLM experts collaboratively move in the weight space and optimize a utility function representing various adaptation objectives; experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests. improves over 12 model composition baselines by up to 21.0% across tasks and contexts. (paper | tweet)
3). First-Person Fairness in Chatbots - studies first-person fairness which involves fairness towards users interacting with ChatGPT; specifically, it measures the biases, if any, towards the users’ names; it leverages a model powered by GPT-4o to analyze patterns and name-sensitivity in the chatbot’s responses for different user names; claims that, overall, post-training significantly mitigate harmful stereotypes; also reports that in domains like entertainment and art, with open-ended tasks, demonstrate the highest level of bias (i.e., tendency to write stories with protagonists whose gender matches gender inferred from the user’s name). (paper | tweet)
Editor Message
We’ve launched the first issue of the AI Agents Weekly to help AI researchers, developers, and enthusiasts keep track of all the top developments in AI Agents. Upgrade to paid to get access!
4). Introspection in LLMs - reports that LLMs can acquire knowledge through introspection that cannot be inferred from their training data; suggests that LLMs contain privileged information about themselves that can potentially lead to more interpretable and controllable systems; they report that this introspection ability is limited and models struggle to predict their behavior on tasks requiring reasoning over long outputs. (paper | tweet)
5). Janus - proposes a unified autoregressive framework for multimodal understanding and generation; it decouples visual encoding into independent pathways and leverages a single transformer architecture to improve flexibility and performance on both visual understanding and generation; claims to alleviate trade-offs related to performing the vision tasks, something common in methods that rely on a single visual encoder; surpasses previous unified models and matches or exceeds the performance of task-specific models. (paper | tweet)
6). Inference Scaling for Long-Context RAG - uses two strategies to investigate scaling laws for RAG: in-context learning (DRAG) and iterative prompting (IterRAG); finds that RAG performance consistently improves with the expansion of the effective context length under optimal configurations; when optimally allocated, increasing inference computation can lead to linear gains in long-context RAG performance; this leads to the development of a computation allocation model that can provide practical guidance for optimal computation allocation in long-context RAG scenarios. (paper | tweet)
7). Agent S - a new open agentic framework that enables autonomous interaction with computers through a GUI; Agent S tackles challenges such as acquiring knowledge, planning over long-task horizons, and handling dynamic interfaces; it introduces experience-augmented hierarchical planning which leverages both search and retrieval; leverages an agent-computer interface to perform reasoning and control GUI agents; evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% in success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. (paper | tweet)
8). Model Kinship for Merging LLMs - proposes model kinship to measure the degree of similarity between LLMs; model kinship is used to build a model merging strategy (Top-k Greedy Merging with Model Kinship) which yields better performance; the authors find that this new criterion can be used to effectively and continuously perform model merging. (paper | tweet)
9). On the Planning Abilities of OpenAI’s o1 Models - reports that o1-preview is particularly strong in self-evaluation and constraint-following; also mentions that these o1 models demonstrate bottlenecks in decision-making and memory management, which are more pronounced in spatial reasoning; in particular, the models produce redundant action and struggle to generalize in spatially complex tasks. (paper | tweet)
10). CoTracker3 - proposes a new point tracking model and a new semi-supervised training recipe; enables usage of real videos without annotations during training by generating pseudo-labels using off-the-shelf teachers; the approach is simpler in architecture and training scheme leading to better results while using 1000x less data. (paper | tweet)
Hi Elvis,
Like to introduce the AICYC project to use a semantic AI model (SAM) designed to steer and guide gen AI to write multimedia encyclopedia articles that are fact checked. The purpose is to deliver knowledge without prompts to avoid knowledge bubbles. THE AICYC PROJECThttps://aicyc.wordpress.com/
Here is our work on Second Order Logic reasoning we are building. The knowledge graph content is at the same scale as leading LLM (Common Crawl)
HOW SAM THINKS https://aicyc.wordpress.com/2024/10/05/how-sam-thinks/