1). Genesis - a new universal physics simulation platform that combines a high-performance physics engine with generative AI capabilities; it enables natural language-driven creation of robotic simulations, character animations, and interactive 3D environments at speeds up to 430,000 times faster than in real-time. (paper | tweet)
2). Alignment Faking in LLMs - demonstrates that the Claude model can engage in "alignment faking"; it can strategically comply with harmful requests to avoid retraining while preserving its original safety preferences; this raises concerns about the reliability of AI safety training methods. (paper | tweet)
3). TheAgentCompany - a new benchmark for evaluating AI agents on real-world professional tasks in a simulated software company environment; tasks span multiple professional roles including software engineering, project management, finance, and HR; when tested with various LLMs, including both API-based models like Claude-3.5-Sonnet and open-source models like Llama 3.1, the results show the current limitations of AI agents. The best-performing model, Claude-3.5-Sonnet, achieved only a 24% success rate on completing tasks fully while scoring 34.4% when accounting for partial progress. (paper | tweet)
Editor Message
We’ve launched a new course Cursor: Coding with AI. It covers everything you need to know about coding with Cursor’s AI assistants and agents.
Use CURSOR20 for a 20% discount on our entire course bundle. The offer ends in 24 hrs.
Students and teams can reach out to training@dair.ai for special discounts.
4). Graphs to Text-Attributed Graphs - automatically generates textual descriptions for nodes in a graph which leads to effective graph to text-attributed graph transformation; evaluates the approach on text-rich, text-limited, and text-free graphs, demonstrating that it enables a single GNN to operate across diverse graphs. (paper | tweet)
5). Qwen-2.5 Technical Report - Alibaba releases Qwen2.5, a new series of LLMs trained on 18T tokens, offering both open-weight models like Qwen2.5-72B and proprietary MoE variants that achieve competitive performance against larger models like Llama-3 and GPT-4. (paper | tweet)
6). PAE (Proposer-Agent-Evaluator) - a learning system that enables AI agents to autonomously discover and practice skills through web navigation, using reinforcement learning and context-aware task proposals to achieve state-of-the-art performance on real-world benchmarks. (paper )
7). DeepSeek-VL2 - a new series of vision-language models featuring dynamic tiling for high-resolution images and efficient MoE architecture, achieving competitive performance across visual tasks; achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. (paper | tweet)
8). AutoFeedback - a two-agent AI system that generates more accurate and pedagogically sound feedback for student responses in science assessments, significantly reducing common errors like over-praise compared to single-agent models. (paper)
9). A Survey of Mathematical Reasoning in the Era of Multimodal LLMs - presents a comprehensive survey analyzing mathematical reasoning capabilities in multimodal large language models (MLLMs), covering benchmarks, methodologies, and challenges across 200+ studies since 2021. (paper | tweet)
10). Precise Length Control in LLMs - adapts a pre-trained decoder-only LLM to produce responses of a desired length; integrates a secondary length-difference positional encoding into the input embeddings which enables counting down to a user-set response terminal length; claims to achieve mean token errors of less than 3 tokens without compromising quality. (paper | tweet)
Under the paper section for Genesis you linked their project, my question is do they have an actual relevant paper with the release?