AgentBench

Benchmark your Claude Code agent across 37 real-world tasks. Not a coding benchmark — tests file creation, research, data analysis, multi-step workflows, memory, error handling, and tool efficiency. Easy to expert difficulty, with real-mode git repo workspaces.

37
tasks
7
domains
4
scoring layers

About

AgentBench measures how well your Claude Code agent performs across real-world tasks — file operations, research, data analysis, multi-step workflows, memory, error recovery, and tool usage.

Each task is scored 0–100 across four layers: automated checks, metrics, behavioral analysis, and output quality. Composite scores roll up by domain and overall.

How it works

01
Install the plugin
/plugin marketplace add agentbench/agentbench
02
Run benchmarks
/benchmark
03
Submit your results
Upload results.json to the leaderboard
View LeaderboardSubmit Results