Discussion about this post

User's avatar
Doggy FP's avatar

Lot of RL this week

Expand full comment
Christopher's avatar

I really enjoyed the SFR-DeepResearch and LiveMCP-101 papers. My only gripe is neither released working code to make evaluation easier.

This is especially frustrating with LiveMCP-101. imo a benchmark is only useful if it can be used ongoing to test new inputs and this benchmark is timely & would have been very useful if they’d released the code. Really odd they didn’t - again imo.

Expand full comment
1 more comment...

No posts