1). Movie Gen - a set of foundation models to generate high-quality, 1080p HD videos, including different aspect ratios and synchronized audio; the 30B parameter model supports a context length of 73K video tokens, which enables generation of 16-second videos at 16fps; it also presents a 13B parameter video-to-audio generation model and a novel video editing model that’s attained via post-training; achieves state-of-the-art performance on tasks such as text-to-video synthesis, video personalization, video-to-audio generation and more. (paper | tweet)
2). Were RNNs All We Needed? - revisits RNNs and shows that by removing the hidden states from input, forget, and update gates RNNs can be efficiently trained in parallel; this is possible because with this change architectures like LSTMs and GRUs no longer require backpropagate through time (BPTT); they introduce minLSTMs and minGRUs that are 175x faster for a 512 sequence length. (paper | tweet)
3). LLMs Know More Than They Show - finds that the "truthfulness" information in LLMs is concentrated in specific tokens; this insight can help enhance error detection performance and further mitigate some of these issues; they also claim that internal representations can be used to predict the types of errors the LLMs are likely to make. (paper | tweet)
4). Architecture Search Framework for Inference-Time Techniques - introduces a modular framework for building and optimizing LLMs by combining multiple inference-time techniques; this approach reframes the challenge of LLM system design as a hyperparameter optimization problem; tested on benchmarks including MT-Bench and CodeContests, Archon surpasses leading models such as GPT-4o and Claude 3.5 Sonnet, achieving a 15.1% average accuracy improvement. (paper | tweet)
5). RATIONALYST - a model for process-supervision of reasoning that enables generalization across diverse reasoning tasks; this process is achieved with pre-training on a collection of 79k rationales from the Pile and a combination of reasoning datasets with minimal human intervention; fine-tuned from LLaMa-3-8B, the proposed model improves the accuracy of reasoning by an average of 3.9% on 7 reasoning benchmarks. (paper)
6). An Analysis of o1-preview - reports that large reasoning models like o1-preview, while improving on more difficult tasks, display similar qualitative trends as previous LLMs; o1 is sensitive to the probability of examples and tasks, performing better and requiring fewer “thinking tokens” in high-probability settings than in low-probability ones. (paper | tweet)
7). FRAMES - a unified framework to evaluate an LLM’s ability to provide factual responses, assess retrieval capabilities, and the reasoning required to generate final responses; includes multi-hop questions that require the integration of information from multiple sources; reports that state-of-the-art LLMs struggle on the task and only achieve 40% accuracy with no retrieval; the proposed multi-step retrieval approach improves performance to 66% accuracy. (paper | tweet)
8). Not All LLM Reasoners Are Created Equal - investigates in depth the grade-school math problem-solving capabilities of LLMs; reports that LLMs show a significant gap in reasoning; finds that LLMs display a huge performance difference when solving compositional pairs and solving questions independently. (paper | tweet)
9). Evaluation of o1 - provides a comprehensive evaluation of OpenAI's o1-preview LLM; shows strong performance across many tasks such as competitive programming, generating coherent and accurate radiology reports, high school-level mathematical reasoning tasks, chip design tasks, anthropology and geology, quantitative investing, social media analysis, and many other domains and problems. (paper | tweet)
10). Designing Priors for Better Few-Shot Image Synthesis - training generative models like GAN with limited data is difficult; current Implicit Maximum Likelihood Estimation approaches (IMLE) have an inadequate correspondence between latent code selected for training and those selected during inference; the proposed approach, RS-IMLE, changes the prior distribution for training which improves test-time performance and leads to higher quality image generation. (paper | tweet)