1). DBRX - a new 132B parameter open LLM that outperforms all the established open-source models on common benchmarks like MMLU and GSM8K; DBRX was pretrained on 12T tokens (text and code) and uses a mixture-of-experts (MoE) architecture; its inference is up to 2x faster than LLaMA2-70B and is about 40% of the size of Grok-1 in terms of both total and active parameter counts; there is also DBRX Instruct which demonstrates good performance in programming and mathematics; while DBRX is trained as a general-purpose LLM, it still surpasses CodeLLaMa-70 Instruct, a model built explicitly for code generation. (paper | tweet)
2). Grok-1.5 - xAI’s latest long-context LLM for advanced understanding and reasoning and problem-solving capabilities; Grok-1.5 achieved a 50.6% score on the MATH benchmark and a 90% score on the GSM8K benchmark; this model can process long contexts of up to 128K tokens and demonstrates powerful retrieval capabilities. (paper | tweet)
3). SEEDS - a generative AI model based on diffusion models that shows powerful capabilities to quantify uncertainty in weather forecasting; it can generate a large ensemble conditioned on as few as one or two forecasts from an operational numerical weather prediction system. (paper | tweet)
4). LLMs for University-Level Coding Course - finds that the latest LLMs have not surpassed human proficiency in physics coding assignments; also finds that GPT-4 significantly outperforms GPT-3.5 and prompt engineering can further enhance performance. (paper | tweet)
5). Mini-Gemini - a simple framework to enhance multi-modality vision models; specifically, visual tokens are enhanced through an additional visual encoder for high-resolution refinement without token increase; achieves top performance in several zero-shot benchmarks and even surpasses the developed private models. (paper | tweet)
6). Long-form factuality in LLMs - investigates long-form factuality in open-domain by generating a prompt set of questions including 38 topics; also proposes an LLM-based agent to perform evaluation for the task; finds that LLM agents can achieve superhuman rating performance and is reported to be 20 times cheaper than human annotations. (paper | tweet)
7). Agent Lumos - a unified framework for training open-source LLM-based agents; it consists of a modular architecture with a planning module that can learn subgoal generation and a module trained to translate them to action with tool usage. (paper | tweet)
8). AIOS - an LLM agent operation system that integrates LLMs into operation systems as a brain; the agent can optimize resource allocation, context switching, enable concurrent execution of agents, tool service, and even maintain access control for agents. (paper | tweet)
9). FollowIR - a dataset with instruction evaluation benchmark and a separate set for teaching information retrieval model to follow real-world instructions; a FollowIR-7B model has significant improvements (over 13%) after fine-tuning on a training set. (paper | tweet)
10). LLM2LLM - an iterative data augmentation strategy that leverages a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used to effectively fine-tune models; it significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. (paper | tweet)
Elvis, your content is superb and your reputation is top-notch. Have you considered creating a Sub with a weekly recap of your top 10 X posts? I would much prefer a Sub; too much noise on X.