We introduce AgentSearchBench: a large-scale benchmark for agent search built from nearly 10,000 real-world agents sourced from public platforms including the GPT Store, Google Cloud Marketplace, and AgentAI Platform. By drawing from real ecosystems, the benchmark captures practical challenges such as capability overlap and inconsistent documentation. It formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, where the agent relevance assessed through execution-grounded performance signals rather than textual similarity.
Tasks are generated by first creating concrete, executable queries from agent documentation, then grouping and abstracting these into broader high-level descriptions. We ground the relevance in real performance by executing candidate agents on each task and evaluating their outputs via an LLM Judge. Multiple quality controls ensure benchmark reliability, including task filtering and judge-to-human alignment validation.
In total, we crawl 9,759 real-world AI Agents. Using them we build a validation set of 3,211 tasks and a test set of 798 tasks, spanning single-agent queries, multi-agent queries, and high-level task descriptions.
| Split | Total | Task Description | Single-Agent Task Query | Multi-Agent Task Query |
|---|---|---|---|---|
| Validation | 3,211 | 259 | 2,452 | 500 |
| Test | 798 | 65 | 633 | 100 |
Snapshot of Topics Covered by Task Description.
Agent Diversity of AgentBase Dataset.
Number of Relevant Agents Per Different Task Types.
Agent Performance (Score Entropy) Across Different Task Types.
We release the AgentBase dataset, the benchmark tasks (validation and test splits), and over 60K raw agent responses.
Here are some examples from the validation set:
Task Description (AgentSearchBench-Tasks)
Single-Agent Task Query (AgentSearchBench-Tasks)
Multi-Agent Task Query (AgentSearchBench-Tasks)
Sample Agent from GPT Store (AgentSearchBench-Agents)
Sample Agent Response (AgentSearchBench-Responses)
The following papers are related to the AgentSearchBench:
@article{wu2026agentsearchbench,
title={AgentSearchBench: A Benchmark for AI Agent Search in the Wild},
author={Bin Wu and Arastun Mammadli and Xiaoyu Zhang and Emine Yilmaz},
year={2026},
eprint={2604.22436},
archivePrefix={arXiv},
primaryClass={cs.AI},
}