Eval Results

Rigorous A/B evaluations across multiple projects. Real repos, real tasks, measured differences.

Methodology

All evaluations compare Group A (raw repo, no .agents/) vs Group B (repo with .agents/ and onboarded knowledge). Same model (Sonnet), same prompts, parallel execution. Tasks span questions, debugging, code actions, convention questions, and hidden-contract analysis.

Headline: 13x efficiency on cross-session recall

Longitudinal eval on rulesync (3 rounds of tasks, no CHANGELOG in the project):

Round	Task	Baseline calls	agentsge calls
R1	Add new target	38	24
R2	Add hooks support	17	19
R3	Cross-ref recall	65	5

Round 3 is the proof. Baseline: 65 tool calls, 22 files read to re-discover what it already found in R1+R2. agentsge: 5 tool calls, 0 repo files. All questions answered from 2 pattern files captured in previous rounds.

Key insight: Per-step difference is modest (20-40%). Accumulated difference is massive. The value is exponential, not linear.

A/B: gbrain-master (5 tasks)

TS/Bun project, Postgres RAG brain. 3 questions + 2 code actions.

Metric	Baseline	agentsge	Change
Total tool calls	94	46	-51%
Total repo files read	67	25	-63%
Total duration	307s	211s	-31%
Avg confidence	8.8	9.0	+0.2

Breakdown by task type:

Questions: 40-91% fewer tool calls. Agent read 0 repo files for all 3 questions.
Code actions: 10-25% improvement. Agent still needs exact signatures and types.

A/B: 3 new repos (never tested before)

lingbot-world (Python/PyTorch), gstack (TS/Bun), offerdzen (Django+Nuxt). Three different task types.

Repo	Task type	Result
lingbot-world	Debug (GPU crash)	Same root causes found. Framework gave copy-paste fix + exact line ref. Baseline was correct but verbose. 16% cheaper.
gstack	Convention question	Both correct. Framework added useful detail (hooks, analytics, examples).21% cheaper, 10% faster.
offerdzen	Hidden-contract analysis	Baseline already strong on code-heavy analysis. Framework found extra contracts (BS blocking, cache races) but at 54% higher cost.

Pattern confirmed: Framework wins on debug/operational tasks (actionability). Modest gain on convention questions. Code-first analysis tasks — baseline is already strong. The loading policy adapts: read knowledge early for debugging, start from code for analysis.

Longitudinal: gbrain (10 steps + recall)

Metric	Baseline (A)	agentsge (B)
Total tool calls (10 steps)	128	55
Knowledge captured	0	5 patterns, 4 lessons, 1 convention
Recall (step 11): tool calls	10	12
Recall: repo files needed	6 (cheated via CHANGELOG)	0 (answered from knowledge)

Baseline could answer recall questions by reading CHANGELOG — but 90% of real projects don't have a CHANGELOG. agentsge captured tacit insights (patterns, conventions) that exist nowhere in the project.

Auto-capture pipeline eval

3-pass longitudinal on apm with hook-based capture:

Dimension	Score	Max
Capture quality	22	25
Capture efficiency	23	25
Recall	20	25
Hygiene	17	25
Total	82	100

Zero agent overhead. 0 tool calls spent on capture. All extraction via hooks.
High signal. Both captured items were non-obvious, correctly typed, useful for recall.
Recall confirmed. Session 3 answered questions using knowledge from sessions 1 and 2.

Onboarding quality (15 iterations)

The onboarding system went through 15 eval iterations across 6+ projects. Key milestones:

v1-v3: Agent wrote reports instead of files, generated noise, scanner missed monorepos
v4: First success — real signal in overview, clean config
v7-v8: Scanner bugs fixed, quality gate working, useful overviews
v10-v11: Bullet-level exact file refs, machine-checkable output
v13-v15: Parallel eval across 6 projects. Hallucinations 6→0, coverage 30%→80-90%

What the numbers mean

Knowledge accumulation is the killer feature. Not per-step speedup. The value compounds across sessions — by session 3, it's 13x fewer tool calls.
Tacit knowledge is unique value. Patterns like "PGLite parity checklist" or "14-file add-new-target checklist" exist nowhere in the project. Only agentsge captures them.
Capture overhead is an investment. Agents use 2-3x more tool calls per step for knowledge writes. This pays off in later sessions.
Not all tasks benefit equally. Understanding tasks (questions, onboarding, reviews) see 40-91% improvement. Implementation tasks see 10-25%. That's by design — the loading policy adapts.