docs

Eval Results

Rigorous A/B evaluations across multiple projects. Real repos, real tasks, measured differences.

Methodology

All evaluations compare Group A (raw repo, no .agents/) vs Group B (repo with .agents/ and onboarded knowledge). Same model (Sonnet), same prompts, parallel execution. Tasks span questions, debugging, code actions, convention questions, and hidden-contract analysis.


Headline: 13x efficiency on cross-session recall

Longitudinal eval on rulesync (3 rounds of tasks, no CHANGELOG in the project):

RoundTaskBaseline callsagentsge calls
R1Add new target3824
R2Add hooks support1719
R3Cross-ref recall655

Round 3 is the proof. Baseline: 65 tool calls, 22 files read to re-discover what it already found in R1+R2. agentsge: 5 tool calls, 0 repo files. All questions answered from 2 pattern files captured in previous rounds.

Key insight: Per-step difference is modest (20-40%). Accumulated difference is massive. The value is exponential, not linear.

A/B: gbrain-master (5 tasks)

TS/Bun project, Postgres RAG brain. 3 questions + 2 code actions.

MetricBaselineagentsgeChange
Total tool calls9446-51%
Total repo files read6725-63%
Total duration307s211s-31%
Avg confidence8.89.0+0.2

Breakdown by task type:

  • Questions: 40-91% fewer tool calls. Agent read 0 repo files for all 3 questions.
  • Code actions: 10-25% improvement. Agent still needs exact signatures and types.

A/B: 3 new repos (never tested before)

lingbot-world (Python/PyTorch), gstack (TS/Bun), offerdzen (Django+Nuxt). Three different task types.

RepoTask typeResult
lingbot-worldDebug (GPU crash)Same root causes found. Framework gave copy-paste fix + exact line ref. Baseline was correct but verbose. 16% cheaper.
gstackConvention questionBoth correct. Framework added useful detail (hooks, analytics, examples).21% cheaper, 10% faster.
offerdzenHidden-contract analysisBaseline already strong on code-heavy analysis. Framework found extra contracts (BS blocking, cache races) but at 54% higher cost.
Pattern confirmed: Framework wins on debug/operational tasks (actionability). Modest gain on convention questions. Code-first analysis tasks — baseline is already strong. The loading policy adapts: read knowledge early for debugging, start from code for analysis.

Longitudinal: gbrain (10 steps + recall)

MetricBaseline (A)agentsge (B)
Total tool calls (10 steps)12855
Knowledge captured05 patterns, 4 lessons, 1 convention
Recall (step 11): tool calls1012
Recall: repo files needed6 (cheated via CHANGELOG)0 (answered from knowledge)

Baseline could answer recall questions by reading CHANGELOG — but 90% of real projects don't have a CHANGELOG. agentsge captured tacit insights (patterns, conventions) that exist nowhere in the project.


Auto-capture pipeline eval

3-pass longitudinal on apm with hook-based capture:

DimensionScoreMax
Capture quality2225
Capture efficiency2325
Recall2025
Hygiene1725
Total82100
  • Zero agent overhead. 0 tool calls spent on capture. All extraction via hooks.
  • High signal. Both captured items were non-obvious, correctly typed, useful for recall.
  • Recall confirmed. Session 3 answered questions using knowledge from sessions 1 and 2.

Onboarding quality (15 iterations)

The onboarding system went through 15 eval iterations across 6+ projects. Key milestones:

  • v1-v3: Agent wrote reports instead of files, generated noise, scanner missed monorepos
  • v4: First success — real signal in overview, clean config
  • v7-v8: Scanner bugs fixed, quality gate working, useful overviews
  • v10-v11: Bullet-level exact file refs, machine-checkable output
  • v13-v15: Parallel eval across 6 projects. Hallucinations 6→0, coverage 30%→80-90%

What the numbers mean

  1. Knowledge accumulation is the killer feature. Not per-step speedup. The value compounds across sessions — by session 3, it's 13x fewer tool calls.
  2. Tacit knowledge is unique value. Patterns like "PGLite parity checklist" or "14-file add-new-target checklist" exist nowhere in the project. Only agentsge captures them.
  3. Capture overhead is an investment. Agents use 2-3x more tool calls per step for knowledge writes. This pays off in later sessions.
  4. Not all tasks benefit equally. Understanding tasks (questions, onboarding, reviews) see 40-91% improvement. Implementation tasks see 10-25%. That's by design — the loading policy adapts.