1). DeepSeek-V3 - a 671B-parameter MoE language model that activates 37B parameters per token, utilizing MLA and DeepSeekMoE architectures for efficient operation; it introduces an auxiliary-loss-free load balancing approach and employs multi-token prediction during training to enhance performance; following pre-training on 14.8 trillion tokens, the model underwent SFT and RL stages, achieving performance comparable to leading closed-source models while surpassing other open-source alternatives; the model requires only 2.788M H800 GPU hours for training, with stable training that avoids any irrecoverable loss spikes. (paper | tweet)
2). Large Concept Models - presents an approach that operates on sentence-level semantic representations called concepts, moving beyond token-level processing typical in current LLMs; the model leverages SONAR sentence embeddings to support 200 languages across text and speech modalities, training on autoregressive sentence prediction using various approaches from MSE regression to diffusion-based generation; experiments with both 1.6B and 7B parameter variants trained on 1.3T and 7.7T tokens respectively demonstrate strong performance on generative tasks like summarization and summary expansion. (paper | tweet)
3). ModernBERT - a new encoder-only transformer model that achieves state-of-the-art performance on classification and retrieval tasks while being more efficient than previous encoders; it was trained on 2T tokens with 8192 sequence length and incorporates modern optimizations that represent a significant improvement over BERT; the model is specifically designed for practical deployment, offering superior speed and memory efficiency on common GPUs. (paper | tweet)
Editor Message
We’ve launched a new course Cursor: Coding with AI. It covers everything you need to know about coding with Cursor’s AI assistants and agents.
Use code PRO2024 for a 35% discount on our entire course bundle. The offer ends in 24 hrs.
Students and teams can reach out to training@dair.ai for special discounts.
4). Automating the Search for Artificial Life - presents a new approach that uses foundation models to automatically discover interesting artificial life simulations across multiple platforms like Boids, Lenia, and Game of Life; the system can find simulations that produce specific target behaviors, discovers simulations that generate temporally open-ended novelty, and map out diverse simulation spaces; it discovers new lifeforms in Lenia and Boids, while also enabling quantitative measurement of previously qualitative phenomena in a human-aligned way. (paper | tweet)
5). A Survey on LLM Inference-Time Self-Improvement - presents a survey that analyzes three categories of LLM inference-time self-improvement techniques - independent methods like enhanced decoding, context-aware approaches using external data, and model collaboration strategies. (paper | tweet)
6). Explore Theory-of-Mind - introduces ExploreToM, a framework that uses A* search to generate diverse, complex theory-of-mind scenarios that reveal significant limitations in current LLMs' social intelligence capabilities; testing showed even advanced models like GPT-4 and Llama-3 perform poorly (as low as 5% accuracy) on these challenging scenarios, despite their strong performance on simpler benchmarks; fine-tuning on ExploreToM data improved performance on existing benchmarks by 27 points. (paper | tweet)
7). LearnLM - a new LearnLM model that can follow pedagogical instructions, allowing it to adapt its teaching approach based on specified educational needs rather than defaulting to simply presenting information; experimental results show that LearnLM is preferred over other leading models, outperforming GPT-4 by 31%, Claude 3.5 by 11%, and Gemini 1.5 Pro by 13%; this instruction-following approach avoids committing to a single pedagogical framework, instead enabling teachers and developers to specify their desired teaching behaviors while allowing for continuous improvement alongside other capabilities. (paper | tweet)
8). Empowering MLLM with o1-like Reasoning and Reflection - proposes a new learning-to-reason method called CoMCTS that enables multimodal language models to develop step-by-step reasoning capabilities by leveraging collective knowledge from multiple models; the approach was used to create Mulberry-260k, a dataset with explicit reasoning trees, which was then used to train the Mulberry model series; the method demonstrates strong performance on benchmarks, with the models showing improved reasoning and reflection capabilities. (paper | tweet)
9). Reinforcement Learning Overview - presents a comprehensive overview of reinforcement learning. (paper | tweet)
10). DRT-o1 - applies long chain-of-thought reasoning to machine translation, particularly for handling metaphors and similes across different cultures; the system uses a multi-agent framework with a translator working iteratively with an advisor and evaluator to produce better translations; testing with Qwen2.5 models showed significant improvements in BLEU and CometScore metrics, with DRT-o1-7B outperforming larger models like QwQ-32B-Preview. (paper | tweet)