Eval Results
Rigorous A/B evaluations across multiple projects. Real repos, real tasks, measured differences.
Methodology
All evaluations compare Group A (raw repo, no .agents/) vs Group B (repo with .agents/ and onboarded knowledge). Same model (Sonnet), same prompts, parallel execution. Tasks span questions, debugging, code actions, convention questions, and hidden-contract analysis.
Headline: 13x efficiency on cross-session recall
Longitudinal eval on rulesync (3 rounds of tasks, no CHANGELOG in the project):
| Round | Task | Baseline calls | agentsge calls |
|---|---|---|---|
| R1 | Add new target | 38 | 24 |
| R2 | Add hooks support | 17 | 19 |
| R3 | Cross-ref recall | 65 | 5 |
Round 3 is the proof. Baseline: 65 tool calls, 22 files read to re-discover what it already found in R1+R2. agentsge: 5 tool calls, 0 repo files. All questions answered from 2 pattern files captured in previous rounds.
A/B: gbrain-master (5 tasks)
TS/Bun project, Postgres RAG brain. 3 questions + 2 code actions.
| Metric | Baseline | agentsge | Change |
|---|---|---|---|
| Total tool calls | 94 | 46 | -51% |
| Total repo files read | 67 | 25 | -63% |
| Total duration | 307s | 211s | -31% |
| Avg confidence | 8.8 | 9.0 | +0.2 |
Breakdown by task type:
- Questions: 40-91% fewer tool calls. Agent read 0 repo files for all 3 questions.
- Code actions: 10-25% improvement. Agent still needs exact signatures and types.
A/B: 3 new repos (never tested before)
lingbot-world (Python/PyTorch), gstack (TS/Bun), offerdzen (Django+Nuxt). Three different task types.
| Repo | Task type | Result |
|---|---|---|
| lingbot-world | Debug (GPU crash) | Same root causes found. Framework gave copy-paste fix + exact line ref. Baseline was correct but verbose. 16% cheaper. |
| gstack | Convention question | Both correct. Framework added useful detail (hooks, analytics, examples).21% cheaper, 10% faster. |
| offerdzen | Hidden-contract analysis | Baseline already strong on code-heavy analysis. Framework found extra contracts (BS blocking, cache races) but at 54% higher cost. |
Longitudinal: gbrain (10 steps + recall)
| Metric | Baseline (A) | agentsge (B) |
|---|---|---|
| Total tool calls (10 steps) | 128 | 55 |
| Knowledge captured | 0 | 5 patterns, 4 lessons, 1 convention |
| Recall (step 11): tool calls | 10 | 12 |
| Recall: repo files needed | 6 (cheated via CHANGELOG) | 0 (answered from knowledge) |
Baseline could answer recall questions by reading CHANGELOG — but 90% of real projects don't have a CHANGELOG. agentsge captured tacit insights (patterns, conventions) that exist nowhere in the project.
Auto-capture pipeline eval
3-pass longitudinal on apm with hook-based capture:
| Dimension | Score | Max |
|---|---|---|
| Capture quality | 22 | 25 |
| Capture efficiency | 23 | 25 |
| Recall | 20 | 25 |
| Hygiene | 17 | 25 |
| Total | 82 | 100 |
- Zero agent overhead. 0 tool calls spent on capture. All extraction via hooks.
- High signal. Both captured items were non-obvious, correctly typed, useful for recall.
- Recall confirmed. Session 3 answered questions using knowledge from sessions 1 and 2.
Onboarding quality (15 iterations)
The onboarding system went through 15 eval iterations across 6+ projects. Key milestones:
- v1-v3: Agent wrote reports instead of files, generated noise, scanner missed monorepos
- v4: First success — real signal in overview, clean config
- v7-v8: Scanner bugs fixed, quality gate working, useful overviews
- v10-v11: Bullet-level exact file refs, machine-checkable output
- v13-v15: Parallel eval across 6 projects. Hallucinations 6→0, coverage 30%→80-90%
What the numbers mean
- Knowledge accumulation is the killer feature. Not per-step speedup. The value compounds across sessions — by session 3, it's 13x fewer tool calls.
- Tacit knowledge is unique value. Patterns like "PGLite parity checklist" or "14-file add-new-target checklist" exist nowhere in the project. Only agentsge captures them.
- Capture overhead is an investment. Agents use 2-3x more tool calls per step for knowledge writes. This pays off in later sessions.
- Not all tasks benefit equally. Understanding tasks (questions, onboarding, reviews) see 40-91% improvement. Implementation tasks see 10-25%. That's by design — the loading policy adapts.