NLP Newsletter

NLP Newsletter

Share this post

NLP Newsletter
NLP Newsletter
🤖AI Agents Weekly: CodeScientist, Nova Act, Awesome MCP Servers, AWS MCP Servers
Copy link
Facebook
Email
Notes
More

🤖AI Agents Weekly: CodeScientist, Nova Act, Awesome MCP Servers, AWS MCP Servers

CodeScientist, Nova Act, Awesome MCP Servers, AWS MCP Servers

Apr 05, 2025
∙ Paid
24

Share this post

NLP Newsletter
NLP Newsletter
🤖AI Agents Weekly: CodeScientist, Nova Act, Awesome MCP Servers, AWS MCP Servers
Copy link
Facebook
Email
Notes
More
3
Share

In today’s issue:

  • AI2 releases CodeScientist

  • Letta AI releases open format for stateful agents

  • OpenHands LM is a new 32B open-source, local coding agent

  • DAIR.AI shared a new guide on getting started with MCP

  • AWS releases AWS MCP Servers

  • Nova Act: Amazon’s browser-native agent model

  • HuggingFace releases YourBench

  • Evaluating AI Agents on replicating AI research

  • Awesome MCP Servers is a curated list of MCP servers

  • AI dev news and much more



Top Stories


CodeScientist

A flow diagram of how CodeScientist generates ideas, executes experiments, and reports results.

Researchers at AI2 release CodeScientist, a system that autonomously generates and tests scientific hypotheses via code-based experimentation. It’s among the first to produce validated discoveries with minimal human input. Key ideas:

  • Code-first scientific agent – CodeScientist reviews research papers and assembles experiments using vetted Python code blocks (e.g., for analysis, simulation). It follows a five-step pipeline: Ideation → Planning → Code Execution → Reporting → Meta-Analysis.

  • Validated AI discoveries – From 50 AI research papers on agents and virtual environments, CodeScientist proposed 19 findings. Of these, 6 were judged scientifically sound and novel. Examples:

    • Confidence ≠ Accuracy – LLM self-assessed confidence in simulations often mismatched actual accuracy.

    • Simpler state = better prediction – Using binary vs. text states improved model reliability.

    • Graph memory helps – Agents with graph-structured memory outperformed baselines in a scientific simulation game.

  • Human-guided autonomy – Full automation is possible, but brief human feedback (e.g., ranking ideas) significantly boosts output quality. Human-in-the-loop interaction improves idea selection and experiment debugging.

  • Challenges remain – Despite successes, over half the generated experiments fail due to code errors, not scientific flaws. Peer review is still needed to verify results, and current systems lack deep methodological rigor.

Paper | Blog | GitHub

Keep reading with a 7-day free trial

Subscribe to NLP Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 elvis
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More