AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Introduction

We introduce AgentSearchBench: a large-scale benchmark for agent search built from nearly 10,000 real-world agents sourced from public platforms including the GPT Store, Google Cloud Marketplace, and AgentAI Platform. By drawing from real ecosystems, the benchmark captures practical challenges such as capability overlap and inconsistent documentation. It formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, where the agent relevance assessed through execution-grounded performance signals rather than textual similarity.

Tasks are generated by first creating concrete, executable queries from agent documentation, then grouping and abstracting these into broader high-level descriptions. We ground the relevance in real performance by executing candidate agents on each task and evaluating their outputs via an LLM Judge. Multiple quality controls ensure benchmark reliability, including task filtering and judge-to-human alignment validation.

In total, we crawl 9,759 real-world AI Agents. Using them we build a validation set of 3,211 tasks and a test set of 798 tasks, spanning single-agent queries, multi-agent queries, and high-level task descriptions.

Split	Total	Task Description	Single-Agent Task Query	Multi-Agent Task Query
Validation	3,211	259	2,452	500
Test	798	65	633	100

Snapshot of Topics Covered by Task Description.

Agent Diversity of AgentBase Dataset.

Number of Relevant Agents Per Different Task Types.

Agent Performance (Score Entropy) Across Different Task Types.

Downloading the Dataset

We release the AgentBase dataset, the benchmark tasks (validation and test splits), and over 60K raw agent responses.

AgentSearchBench-Tasks: benchmark tasks.
AgentSearchBench-Agents: AgentBase dataset.
AgentSearchBench-Responses: raw agent executions from the validation set.

Alternatively, you can access the data from Google Drive.

Here are some examples from the validation set:

Task Description (AgentSearchBench-Tasks)

Single-Agent Task Query (AgentSearchBench-Tasks)

Multi-Agent Task Query (AgentSearchBench-Tasks)

Sample Agent from GPT Store (AgentSearchBench-Agents)

Sample Agent Response (AgentSearchBench-Responses)

The following papers are related to the AgentSearchBench:

BibTeX

@article{wu2026agentsearchbench,
      title={AgentSearchBench: A Benchmark for AI Agent Search in the Wild}, 
      author={Bin Wu and Arastun Mammadli and Xiaoyu Zhang and Emine Yilmaz},
      year={2026},
      eprint={2604.22436},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
}

AgentSearchBench

A Benchmark for AI Agent Search in the Wild

Introduction

AgentSearchBench Dataset

Downloading the Dataset

Leaderboard

Related Papers

BibTeX