AgentSearchBench

A Benchmark for AI Agent Search in the Wild

University College London, AI Centre
*Equal Contribution

Introduction

We introduce AgentSearchBench: a large-scale benchmark for agent search built from nearly 10,000 real-world agents sourced from public platforms including the GPT Store, Google Cloud Marketplace, and AgentAI Platform. By drawing from real ecosystems, the benchmark captures practical challenges such as capability overlap and inconsistent documentation. It formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, where the agent relevance assessed through execution-grounded performance signals rather than textual similarity.

Tasks are generated by first creating concrete, executable queries from agent documentation, then grouping and abstracting these into broader high-level descriptions. We ground the relevance in real performance by executing candidate agents on each task and evaluating their outputs via an LLM Judge. Multiple quality controls ensure benchmark reliability, including task filtering and judge-to-human alignment validation.



AgentSearchBench Dataset

In total, we crawl 9,759 real-world AI Agents. Using them we build a validation set of 3,211 tasks and a test set of 798 tasks, spanning single-agent queries, multi-agent queries, and high-level task descriptions.

Split Total Task Description Single-Agent Task Query Multi-Agent Task Query
Validation 3,211 259 2,452 500
Test 798 65 633 100

Downloading the Dataset

We release the AgentBase dataset, the benchmark tasks (validation and test splits), and over 60K raw agent responses.

Alternatively, you can access the data from Google Drive.

Here are some examples from the validation set:

Leaderboard

Related Papers

BibTeX

@article{wu2026agentsearchbench,
      title={AgentSearchBench: A Benchmark for AI Agent Search in the Wild}, 
      author={Bin Wu and Arastun Mammadli and Xiaoyu Zhang and Emine Yilmaz},
      year={2026},
      eprint={2604.22436},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
}