⚡AI Agents Weekly: o3, Gemini 2.0 Flash Thinking, Multi AI Agents in Production, Generative Physics Engine
o3, Gemini 2.0 Flash Thinking, Multi AI Agents in Production, Generative Physics Engine
In today’s issue:
- OpenAI announced o3 
- Google announces Gemini 2.0 Flash Thinking 
- A generative and universal physics engine for robotics and AI applications 
- CrewAI Report on Multi AI Agents in Production 
- A new benchmark for evaluating AI agents on real-world professional tasks 
- LLM alignment faking, AI agents for feedback and discovery, and more. 
Top Stories
OpenAI Announces o3
OpenAI announces a new frontier model called o3. Here's a summary of the main takeaways from the announcement:
- OpenAI announces two new models o3 and o3-mini. 
- These models are not publicly available yet. Sam Altman, CEO of OpenAI, mentioned that they plan to launch o3-mini around the end of January and the full o3 model shortly after that. A safety program application is available for those who want to test the models for safety. 
- o3 is significantly better at coding. At the most aggressive high test-time compute setting, o3 achieves an Elo score of 2727 on the Codeforces. 
- o3 achieves 96.7% accuracy on AIME 2024, and 87.7% on GPQA Diamond. Claims state-of-the-art performances on both these benchmarks. 
- o3 gets over 25% on what OpenAI researchers consider one of the toughest benchmarks out there (EpochAI Frontier Math). The previous SoTA was 2.0%. 
- o3 achieves SoTA on ARC-AGI. When o3 is asked to think longer with high-compute, it scores 87.5%. 
- o3-mini will support reasoning with adaptive thinking time with three different modes (low, medium, and high). o3-mini performance on Codeforces with significant cost-efficiency compared to o1. 
- o3-mini (low) achieves comparable latency to gpt-4o on math benchmarks. 
- o3-mini will come with all the API features as o1. 
Keep reading with a 7-day free trial
Subscribe to AI Newsletter to keep reading this post and get 7 days of free access to the full post archives.


