⚡AI Agents Weekly: o3, Gemini 2.0 Flash Thinking, Multi AI Agents in Production, Generative Physics Engine
o3, Gemini 2.0 Flash Thinking, Multi AI Agents in Production, Generative Physics Engine
In today’s issue:
OpenAI announced o3
Google announces Gemini 2.0 Flash Thinking
A generative and universal physics engine for robotics and AI applications
CrewAI Report on Multi AI Agents in Production
A new benchmark for evaluating AI agents on real-world professional tasks
LLM alignment faking, AI agents for feedback and discovery, and more.
Top Stories
OpenAI Announces o3
OpenAI announces a new frontier model called o3. Here's a summary of the main takeaways from the announcement:
OpenAI announces two new models o3 and o3-mini.
These models are not publicly available yet. Sam Altman, CEO of OpenAI, mentioned that they plan to launch o3-mini around the end of January and the full o3 model shortly after that. A safety program application is available for those who want to test the models for safety.
o3 is significantly better at coding. At the most aggressive high test-time compute setting, o3 achieves an Elo score of 2727 on the Codeforces.
o3 achieves 96.7% accuracy on AIME 2024, and 87.7% on GPQA Diamond. Claims state-of-the-art performances on both these benchmarks.
o3 gets over 25% on what OpenAI researchers consider one of the toughest benchmarks out there (EpochAI Frontier Math). The previous SoTA was 2.0%.
o3 achieves SoTA on ARC-AGI. When o3 is asked to think longer with high-compute, it scores 87.5%.
o3-mini will support reasoning with adaptive thinking time with three different modes (low, medium, and high). o3-mini performance on Codeforces with significant cost-efficiency compared to o1.
o3-mini (low) achieves comparable latency to gpt-4o on math benchmarks.
o3-mini will come with all the API features as o1.
Keep reading with a 7-day free trial
Subscribe to NLP Newsletter to keep reading this post and get 7 days of free access to the full post archives.