1). Genie - a foundation model trained from internet videos and with the ability to generate a variety of action-controllable 2D worlds given an image prompt; Genie has 11B parameters and consists of a spatiotemporal video tokenizer, an autoregressive dynamic model, and a scalable latent action model; the latent action space enables training agents to imitate behaviors from unseen video which is promising for building more generalist agents. (paper | tweet)
2). Mistral Large - a new LLM with strong multilingual, reasoning, maths, and code generation capabilities; features include: 1) 32K tokens context window, 2) native multilingual capacities, 3) strong abilities in reasoning, knowledge, maths, and coding benchmarks, and 4) function calling and JSON format natively supported. (paper | tweet)
3). The Era of 1-bit LLMs - introduces a high-performing and cost-effective 1-bit LLM variant called BitNet b1.58 where every parameter is a ternary {-1, 0, 1}; given the same model size and training tokens, BitNet b1.58 can match the perplexity and task performance of a full precision Transformer LLM (i.e., FP16); the benefits of this 1-bit LLM are significantly better latency, memory, throughout, and energy consumption. (paper | tweet)
4). Dataset for LLMs - a comprehensive overview (180+ pages) and analysis of LLM datasets. (paper | tweet)
5). LearnAct - explores open-action learning for language agents through an iterative learning strategy that creates and improves actions using Python functions; on each iteration, the proposed framework (LearnAct) expands the action space and enhances action effectiveness by revising and updating available actions based on execution feedback; the LearnAct framework was tested on Robotic planning and AlfWorld environments; it improves agent performance by 32% in AlfWorld compared to ReAct+Reflexion. (paper | tweet)
6). EMO - a new framework for generating expressive video by utilizing a direct audio-to-video synthesis approach; by leveraging an Audio2Video diffusion model it bypasses the need for intermediate 3D models or facial landmarks; EMO can produce convincing speaking videos and singing videos in various styles while outperforming existing methods in terms of expressiveness and realism. (paper | tweet)
7). On the Societal Impact of Open Foundation Models - a position paper with a focus on open foundation models and their impact, benefits, and risks; proposes a risk assessment framework for analyzing risk and explains why the marginal risk of open foundation models is low in some cases; it also offers a more grounded assessment of the societal impact of open foundation models. (paper | tweet)
8). StarCoder 2 - a family of open LLMs for code with three different sizes (3B, 7B, and 15B); the 15B model was trained on 14 trillion tokens and 600+ programming languages with a context window of 16K token and employing a fill-in-the-middle objective; it matches 33B+ models on many evaluation like code completion, code reasoning, and math reasoning aided through PAL. (paper | tweet)
9). LLMs on Tabular Data - an overview of LLMs for tabular data tasks including key techniques, metrics, datasets, models, and optimization approaches; it covers limitations and unexplored ideas with insights for future research directions. (paper | tweet)
10). PlanGPT - shows how to leverage LLMs and combine multiple approaches like retrieval augmentation, fine-tuning, tool usage, and more; the proposed framework is applied to urban and spatial planning but there are a lot of insights and practical tips that apply to other domains. (paper | tweet)