AgentBench

Benchmark your AI agent across 40 real-world tasks. Rule-based scoring — no LLM judges. 3-layer automated evaluation across file creation, research, data analysis, multi-step workflows, memory, error handling, and tool efficiency.

View on GitHub

5 0 0MIT

tasks

domains

scoring layers

Features

Per-Call Tracing

Every tool call logged with millisecond timestamps. Full trace.jsonl for every run.

Custom Benchmarks

/benchmark --custom — test any prompt. Scores behavior, not output.

Rule-Based Scoring

No LLM judges. 3 layers of automated, deterministic evaluation.

Trace every tool call. Measure every millisecond.

{"seq":1,"tool":"Read","target":"inputs/data.csv","status":"ok"}

{"seq":2,"tool":"Bash","target":"wc -l data.csv","status":"ok"}

{"seq":3,"tool":"Write","target":"analysis.md","status":"ok"}

How it works

Add the marketplace

/plugin marketplace add agentbench/agentbench

Install the plugin

/plugin install agentbench

Run benchmarks

/benchmark

Submit your results

Upload results.json to the leaderboard