I really enjoyed the SFR-DeepResearch and LiveMCP-101 papers. My only gripe is neither released working code to make evaluation easier.
This is especially frustrating with LiveMCP-101. imo a benchmark is only useful if it can be used ongoing to test new inputs and this benchmark is timely & would have been very useful if they’d released the code. Really odd they didn’t - again imo.
ParaThinker aims to solve one of the biggest challenges I find personally with LLM - digging itself into a reasoning rabbit hole, because it’s trying to incorporate its own reasoning tokens or prior conversational output (in multi-turn) in the context.
Lot of RL this week
I really enjoyed the SFR-DeepResearch and LiveMCP-101 papers. My only gripe is neither released working code to make evaluation easier.
This is especially frustrating with LiveMCP-101. imo a benchmark is only useful if it can be used ongoing to test new inputs and this benchmark is timely & would have been very useful if they’d released the code. Really odd they didn’t - again imo.
ParaThinker aims to solve one of the biggest challenges I find personally with LLM - digging itself into a reasoning rabbit hole, because it’s trying to incorporate its own reasoning tokens or prior conversational output (in multi-turn) in the context.