Note (session 28, 2026-06-04): P@10 = 0.330 (302 tasks, 17 repos, 8 languages). Corpus expanded with saleor (e-commerce) and calcom (scheduling). 277 equiv classes across 32 files with multi-phrase gate. Domain equiv classes (e-commerce, scheduling) proved the highest-leverage move: saleor +99.6%, calcom +497%. Embeddings confirmed dead neutral (session 23). See
docs/research/session-21-measurement-calibration.mdfor the measurement calibration narrative.
Cross-System Context Retrieval Benchmark
Running results: bench/cross-system/FINDINGS.md Study overview: bench/EVALUATION-OVERVIEW.md Implementation: bench/cross-system/
Current Status (2026-06-04)
Implementation Progress
| Component | Status | Notes |
|---|---|---|
| Benchmark harness | Done | harness_test.go, metrics, normalization, dot-bounded matching (session 21), task memory disabled (session 23) |
| Evaluation corpus (17 repos) | Done | kubernetes, VS Code, flask, cargo, django, spark-java, ocelot, kafka, caddy, fastapi, ripgrep, jekyll, terraform, rails, saleor, calcom, cross-cutting |
| Task fixtures (302 total) | Done | 8 languages (Go, Python, TypeScript, Rust, Java, C#, Ruby, TOML) |
| Ground truth validation | Done | 98% match rate, validate-fixtures tool |
| knowing adapter | Done | P@10=0.330 (session 28, honest: no task memory, no embeddings), 277 equiv classes with multi-phrase gate |
| grep adapter | Done | P@10=0.015 (baseline, honest matching) |
| codegraph adapter | Done | P@10=0.087 (honest matching) |
| GitNexus adapter | Done | P@10=0.055 (honest matching) |
| Gortex adapter | Done | P@10=0.052 (honest matching) |
| Aider adapter | Done | P@10=0.023 (honest matching) |
| codebase-memory adapter | Evaluated | Timed out on large repos |
| Diagnostic tools | Done | debug-seeds, debug-fts, debug-walk, bench-task, failure-analysis |
| Statistical analysis | Done | Wilcoxon, Cohen's d, bootstrap CI, 70+ experiments |
| SWE-bench integration | Done | 10 fixtures; finding: fault localization != context retrieval |
| Embedding re-ranker | Disabled | Net negative (session 19). Gap-fill also neutral (session 23). |
| Failure analysis | Done | 80% noise, 10.5% related_name, 8.6% test symbols (session 28 zero-task audit) |
| Corpus DB packaging | Done | Per-repo tarballs for release assets (corpus-setup.sh package/restore) |
Key Results (Session 28, honest cold-start, 302 tasks, 17 repos)
| System | P@10 | R@10 | NDCG@10 | MRR | vs knowing |
|---|---|---|---|---|---|
| knowing | 0.330 | 0.460 | 0.526 | 0.568 | baseline |
| codegraph (19K stars) | 0.087 | - | - | - | 0.26x |
| GitNexus (40K stars) | 0.055 | - | - | - | 0.17x |
| Gortex | 0.052 | - | - | - | 0.16x |
| Aider (~20K stars) | 0.023 | - | - | - | 0.07x |
| grep | 0.015 | - | - | - | 0.05x |
Competitive ratios: 3.79x codegraph, 6.00x GitNexus, 6.35x Gortex, 14.3x Aider, 22.0x grep. All measurements use honest matching (dot-bounded, session 21) with task memory disabled (session 23). 4 runs: 0.328, 0.331, 0.330, 0.330.
Per-Repo Performance (Session 28, honest cold-start, 302 tasks)
| Repo | Language | P@10 | Tasks | Key mechanism |
|---|---|---|---|---|
| Ripgrep | Rust | 0.464 | 11 | No framework classes (defensibility) |
| Terraform | Go | 0.440 | 20 | Terraform equiv classes |
| Kafka | Java | 0.437 | 19 | Kafka equiv classes |
| Jekyll | Ruby | 0.425 | 20 | Jekyll + Ruby enrichment |
| Kubernetes | Go | 0.423 | 13 | K8S equiv classes |
| Caddy | Go | 0.410 | 20 | Caddy equiv classes |
| Calcom | TypeScript | 0.409 | 11 | Scheduling equiv classes (+497% from baseline) |
| Saleor | Python | 0.527 | 11 | E-commerce equiv classes (+99.6%) |
| Flask | Python | 0.328 | 18 | Flask equiv classes |
| Rails | Ruby | 0.325 | 20 | Rails equiv classes |
| FastAPI | Python | 0.315 | 20 | FastAPI equiv classes |
| Ocelot | C# | 0.280 | 20 | Ocelot equiv classes |
| Cross-cutting | Multi | 0.263 | 8 | Multi-repo tasks |
| Cargo | Rust | 0.263 | 19 | Cargo equiv classes |
| Spark-Java | Java | 0.250 | 20 | Spark-Java equiv classes |
| VS Code | TypeScript | 0.200 | 19 | VS Code equiv classes + adaptive retrieval |
| Django | Python | 0.176 | 33 | Django equiv classes (+126%), 13 zeros (SWE-bench vocab gaps) |
Key Findings
- Framework equivalence classes are the breakthrough (+57% P@10, session 23). 277 concept-to-symbol mappings with forced injection solve the vocabulary gap.
- Domain equiv classes are the highest-leverage move (session 28). E-commerce (saleor +99.6%) and scheduling (calcom +497%) both cracked most zeros on first try. 14 new classes across 2 domain files.
- Multi-phrase gate prevents single-word flooding (session 28, +9.6%).
isStrongEquivMatchrequires >= 2 phrases or multi-word phrase. - RWR (graph traversal) is the structural foundation, providing reachability-based ranking.
- Inheritance propagation was the first major gain (+29% in session 13).
- Embeddings are dead neutral on cold start (session 23, 3 runs confirmed).
- Task memory contaminates benchmarks. 26K stale entries discovered in session 23. Must be disabled for honest A/B measurement.
- P@10 is reachability + vocabulary determined. Graph structure provides reachability; equiv classes provide vocabulary bridging. Both are necessary.
- Zero-task audit (session 28): 80% noise (wrong neighborhood), 10.5% related_name, 8.6% test symbols. Sibling dedup tested harmful (global -0.009, pkg-scoped -0.006). Test penalty sweep: noise (+-0.030 on 20 tasks).
- Test file detection for all 7 corpus languages (session 28). Ruby/Java/C# added. Rails +10.8%.
Remaining Work
| Item | Priority | Effort | Impact |
|---|---|---|---|
| More domain repos | Medium | 1 session each | Redash (BI), Discourse (forums). Each adds ~0.003-0.005 aggregate. |
| AI-generated evaluation corpus | Medium | 1 session | LLM-generated tasks for statistical coverage |
| Equiv class auto-discovery | Low | 1 session | Automatic extraction from graph structure |
| Enricher repo hash mismatch | Low | 1 session | Bug: enrich lsp computes different repo hash than index |
1. Motivation
AI coding agents spend 30-60% of their context window on orientation: finding the right code to read before making changes. The quality of this context directly determines task success. Multiple systems now compete to serve this context: knowledge graphs, repo maps, code search engines, and raw text search.
No rigorous, reproducible benchmark exists comparing these systems on the actual use case: "given a coding task, which system retrieves the most relevant symbols in the fewest tokens?"
This benchmark answers that question with: - Fixed evaluation corpus (specific repos, specific tasks, specific ground truth) - Formal metrics with statistical significance testing - Fairness controls that prevent home-field advantage for any system - Reproducible methodology anyone can run
The goal is publishable data that honestly shows where knowing wins, where it loses, and where systems are equivalent.
2. Systems Under Test
Seven systems covering the primary architectural approaches to code context retrieval. Evaluated: knowing, codegraph, GitNexus, Gortex, grep. Attempted but timed out: Aider, codebase-memory.
2.1 knowing (content-addressed graph)
Invocation:
# Index the target repo
knowing index --repo <path> --module <module-path>
# Retrieve context
knowing context-for-task --task "<description>" --budget <tokens> --format json
What to capture:
- symbols[] array with qualified names, scores, distances
- tokens_used (actual token count)
- Wall-clock latency (index time + query time separately)
- Session state (first query vs repeated query on same repo)
Configuration: - Token budget: match across all systems (benchmark uses 5000 tokens for fair comparison; product default is 50000) - Format: JSON (for automated parsing; not GCF, to avoid format advantage) - No pre-existing feedback (cold start unless measuring learning curve)
2.2 GitNexus (knowledge graph MCP)
Invocation:
# Index
gitnexus index <path>
# Query via MCP (simulated tool call)
# Tool: search_codebase
# Input: { "query": "<task description>", "limit": 20 }
What to capture: - Returned symbols/code snippets - Token count of response (tiktoken cl100k_base) - Wall-clock latency - Graph construction time
Configuration: - Default settings (no tuning for specific repos) - Native Tree-sitter parsing (not WASM mode) - LadybugDB backend (default)
Installation: npm install -g gitnexus (verify version at benchmark time)
2.3 Aider repo-map (PageRank on reference graph)
Invocation:
# Aider's repo-map is internal; extract via its API
from aider.repomap import RepoMap
rm = RepoMap(
root=repo_path,
main_model=model, # needed for token counting
io=io_instance,
)
repo_map_text = rm.get_repo_map(
chat_files=[], # no files in chat (cold start)
other_fnames=all_files, # all repo files as candidates
mentioned_fnames=set(),
mentioned_idents=extract_identifiers(task_description),
)
What to capture: - Full repo-map text (tree-context format) - Token count (tiktoken, Aider's own counting) - Which files/symbols appear in the map - Wall-clock generation time
Configuration:
- map_tokens=5000 (match other systems' budget)
- No files pre-loaded in chat (cold start)
- Mentioned identifiers extracted from task description (Aider's normal flow)
Installation: pip install aider-chat (pin version)
2.4 Sourcegraph / SCIP-based indexers
Invocation:
# Generate SCIP index
scip-go # or scip-typescript, scip-python, etc.
# Query via Sourcegraph CLI or API
src search -query="<symbols from task>" -json
# Alternative: use scip CLI directly for symbol lookup
scip snapshot --from index.scip --format json
What to capture: - Symbols returned with definitions and references - Precision of cross-file references (compiler-accurate) - Token count of formatted output - Index generation time
Configuration: - Language-appropriate SCIP indexer - Local mode (no Sourcegraph instance required for SCIP) - For Sourcegraph API comparison: use sourcegraph.com search on public repos
Note: SCIP provides precise navigation, not task-oriented retrieval. The benchmark adapter must translate a task description into symbol queries (extract identifiers, search for definitions). This represents the "expert user with precise tools" baseline.
2.5 Raw grep/ripgrep baseline
Invocation:
# Extract keywords from task description (simple: split on spaces, filter stopwords)
keywords=$(echo "$task" | extract_keywords)
# Search
for kw in $keywords; do
rg -n --type <lang> "$kw" <repo_path> | head -20
done
What to capture: - Lines returned per keyword - Unique files touched - Token count (4 tokens/line estimate, verified with tiktoken) - Which ground-truth symbols appear in output - Wall-clock time
Configuration: - Keywords: task description split on whitespace, stopwords removed, CamelCase split - Per-keyword limit: 20 lines (simulates agent grep behavior) - Total budget: stop when cumulative tokens exceed 5000 - File type filter: match target language
Adapter script: bench/cross-system/adapters/grep_baseline.py
2.6 CodeGraphContext (CGC)
Invocation:
# Index
cgc index <path> --backend kuzu
# Query via MCP tool
# Tool: search_symbols
# Input: { "query": "<task description>", "limit": 20 }
# Or via CLI
cgc search "<keywords>" --limit 20
What to capture: - Returned symbols with metadata - Token count of response - Wall-clock latency - Index time and database size
Configuration: - KuzuDB backend (default, embedded) - Default Tree-sitter parsing - No SCIP enhancement (baseline comparison)
Installation: pip install codegraphcontext (pin version)
3. Evaluation Corpus
3.1 Repository Selection
The corpus uses 9 repositories chosen for diversity along these axes:
| Repo | Language | Size (LOC) | Why |
|---|---|---|---|
| kubernetes/kubernetes | Go | ~3.5M | Large, well-structured, deep call chains |
| microsoft/vscode | TypeScript | ~1M | Large, classes/services/DI/inheritance |
| pallets/flask | Python | ~30K | Small, clear package boundaries, well-documented |
| rust-lang/cargo | Rust | ~200K | Medium, strong type system, module hierarchy |
| django/django | Python | ~350K | Large framework, cross-package dependencies |
| apache/kafka | Java | ~800K | Enterprise Java, deep class hierarchies |
| sparklemotion/spark-java | Java | ~14K | Small Java web framework |
| ThreeMammals/Ocelot | C# | ~50K | .NET API gateway, C# coverage |
| vercel/next.js | TypeScript | ~500K | Large TS framework, module boundaries |
Exclusion: The knowing repo itself is NOT in the evaluation corpus. This prevents home-field advantage from fixtures tuned to knowing's own structure.
Version pinning: Each repo is pinned to a specific commit SHA at benchmark
creation time. Document in bench/cross-system/corpus/repos.yaml:
repos:
- name: kubernetes
url: https://github.com/kubernetes/kubernetes
commit: <sha>
language: go
module: k8s.io/kubernetes
- name: vscode
url: https://github.com/microsoft/vscode
commit: <sha>
language: typescript
- name: flask
url: https://github.com/pallets/flask
commit: <sha>
language: python
- name: cargo
url: https://github.com/rust-lang/cargo
commit: <sha>
language: rust
- name: django
url: https://github.com/django/django
commit: <sha>
language: python
3.2 Ground Truth Tasks
Each repository gets ~20 tasks (167 total across 9 repos), distributed across 3 difficulty tiers:
| Tier | Tasks/repo | Characteristics |
|---|---|---|
| Easy (single-package) | 8 | All relevant symbols in one package/module |
| Medium (cross-package) | 8 | Symbols span 2-4 packages |
| Hard (cross-system) | 4 | Symbols span 5+ packages, require deep traversal |
3.3 Task Sources
Tasks are derived from three sources to ensure realism:
Source A: SWE-bench instances (40 tasks)
Select 40 tasks from SWE-bench that target our corpus repos (django, flask). For each: 1. Use the issue description as the task query 2. Use the gold patch's modified symbols as ground truth 3. Include any symbols the patch imports or calls that were not previously imported
This gives realistic "developer needs context for this issue" scenarios with objectively correct ground truth (the symbols the fix actually used).
Source B: Manual expert labeling (40 tasks)
For repos not in SWE-bench (kubernetes, VS Code, cargo), create tasks manually by: 1. Pick a recent merged PR (last 6 months) 2. Write a task description from the PR title/description (before seeing the diff) 3. Label ground truth from the PR's actual symbol modifications and their immediate callers/callees
This simulates "developer reads the issue, asks for context before implementing."
Source C: Synthetic cross-cutting tasks (20 tasks)
Create tasks that stress cross-package retrieval: - "Refactor error handling across the HTTP stack" - "Add tracing to all database operations" - "Update authentication to support OAuth2"
Ground truth: manually trace which symbols would need modification, using the repo's actual architecture. Two independent labelers; inter-rater agreement required (see Section 7).
3.4 Task Fixture Format
# bench/cross-system/corpus/tasks/kubernetes/easy/01-add-pod-status-field.yaml
id: "k8s-easy-01"
repo: kubernetes
commit: <sha> # must match repos.yaml
source: "manual" # or "swe-bench" or "synthetic"
source_ref: "https://github.com/kubernetes/kubernetes/pull/12345" # if from PR
difficulty: easy
task: "Add a new condition type to PodStatus for tracking init container readiness"
ground_truth:
- pkg/apis/core/types.PodConditionType
- pkg/apis/core/types.PodStatus
- pkg/apis/core/types.PodCondition
- pkg/kubelet/status/status_manager.SetPodStatus
- staging/src/k8s.io/api/core/v1/types.PodConditionType
tags: [single-package, type-extension, api-types]
notes: "Derived from PR #12345. Ground truth = symbols modified + direct callers."
4. Metrics
4.1 Primary Metrics
Precision@K
Fraction of returned results that are actually relevant.
Precision@K = |{relevant} ∩ {returned top-K}| / K
Measured at K = 5, 10, 20. K=10 is the primary comparison point (matches knowing's existing eval). K=5 captures "first screen" quality. K=20 captures deeper retrieval.
Recall@K
Fraction of ground-truth symbols found in the top-K results.
Recall@K = |{relevant} ∩ {returned top-K}| / |{ground-truth}|
Same K values. Note: Recall@K can exceed 1.0 if the system returns multiple symbols matching the same ground-truth entry (via substring matching).
NDCG@K (Normalized Discounted Cumulative Gain)
Rewards systems that rank relevant symbols higher.
DCG@K = Σ_{i=1}^{K} rel(i) / log2(i + 1)
IDCG@K = DCG for the ideal ranking
NDCG@K = DCG@K / IDCG@K
Where rel(i) = 1 if result at rank i is relevant, 0 otherwise. NDCG@10 is
the primary ordering metric.
Token Efficiency
Relevant symbols per token consumed.
TokenEfficiency = |{relevant} ∩ {returned}| / tokens_consumed
Where tokens_consumed is the total token count of the system's output
(measured via tiktoken cl100k_base). This penalizes verbose systems that
return relevant results buried in noise.
4.2 Secondary Metrics
Mean Reciprocal Rank (MRR)
MRR = 1/|Q| * Σ_{q∈Q} 1/rank_q
Where rank_q is the position of the first relevant result for query q.
Captures "how quickly does the developer get oriented?"
Time to Context (TTC)
Wall-clock seconds from query submission to complete response. Measured as:
- TTC_cold: first query on a freshly indexed repo
- TTC_warm: subsequent query on an already-indexed repo
- TTC_index: one-time indexing cost (amortized across queries)
F1@K
Harmonic mean of Precision@K and Recall@K.
F1@K = 2 * (P@K * R@K) / (P@K + R@K)
4.3 Longitudinal Metrics (Learning Curve)
Only applicable to systems with feedback/learning mechanisms (knowing, potentially GitNexus):
Precision Delta After Feedback
Run the same task set twice: 1. Round 1: cold start (no prior feedback) 2. Between rounds: record feedback for correct results 3. Round 2: same tasks, measure improvement
LearningGain = P@10_round2 - P@10_round1
4.4 Staleness Metrics
Staleness Recovery Time
- Index the repo at commit C1
- Introduce a code change (simulate commit C2: rename a function, add a file)
- Query a task that depends on the changed code
- Measure: does the system detect the stale context? How quickly does it update?
StalenessRecovery = time from code change to correct context response
Systems without incremental update (Aider, grep) get TTC_cold as their recovery time.
5. Methodology
5.1 Environment
All benchmarks run on the same machine to eliminate hardware variance: - Apple M-series (M2 Pro or better) or equivalent Linux (8+ cores, 32GB RAM) - All repos cloned locally (no network latency for file access) - Each system gets a warm filesystem cache (read all files once before timing) - Three runs per measurement; report median
5.2 Execution Protocol
For each (system, repo, task) triple:
1. Clone repo at pinned commit (or verify existing clone matches)
2. Clear any system-specific caches/databases
3. INDEX PHASE:
- Start timer
- Run system's indexing command
- Stop timer -> index_time
4. QUERY PHASE (cold):
- Start timer
- Submit task description to system
- Collect response
- Stop timer -> query_time_cold
5. QUERY PHASE (warm, 3 repetitions):
- Start timer
- Submit same task description again
- Collect response
- Stop timer -> query_time_warm (take median)
6. PARSE PHASE:
- Extract symbols from response (system-specific parser)
- Normalize to qualified names
- Match against ground truth
- Compute metrics
5.3 Symbol Normalization
Different systems return symbols in different formats. Normalize all to:
<package_path>.<TypeName>.<MethodName>
Rules:
- Strip leading module paths (e.g., k8s.io/kubernetes/ prefix)
- Preserve package-relative paths (e.g., pkg/kubelet/status/status_manager)
- Functions: package.FuncName
- Methods: package.Type.Method
- Types: package.TypeName
Normalization code lives in bench/cross-system/normalize.go.
5.4 Ground Truth Matching
A returned symbol matches a ground-truth entry if: - The normalized ground-truth string is a substring of the normalized result, OR - The normalized result is a substring of the normalized ground-truth string
This handles:
- Partial qualification (store.NodesByName matches internal/store.SQLiteStore.NodesByName)
- Over-qualification (system returns full path, ground truth uses short form)
Matching code reuses knowing's existing isRelevant() logic from bench/context-relevance/.
5.5 Token Counting
All systems' output is measured with tiktoken (cl100k_base encoding):
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = len(enc.encode(system_output_text))
This provides a uniform cost metric regardless of how each system formats output.
5.6 Statistical Significance
For system comparisons, report: - Mean and standard deviation across all tasks - Per-tier breakdown (easy/medium/hard) - Paired Wilcoxon signed-rank test (p < 0.05) for each system pair - Effect size (Cohen's d) for practical significance - 95% confidence intervals via bootstrap (1000 resamples)
A difference is only claimed as "significant" if both: 1. p < 0.05 on paired test 2. Cohen's d > 0.3 (at least small effect size)
6. Fairness Controls
6.1 No Home-Field Advantage
- knowing's own repo is excluded from the corpus
- No fixtures derived from knowing's existing eval (those test knowing-specific structure)
- All systems get the same task descriptions verbatim (no rephrasing for any system)
- Ground truth is derived from actual code changes, not from knowing's graph structure
6.2 Configuration Fairness
Each system uses its recommended defaults:
- No system-specific tuning for benchmark repos
- No custom configuration files beyond what <system> init generates
- Token budgets matched across all systems (5000 tokens primary, also test 2000 and 10000)
- If a system has no budget control, capture its full output but measure metrics at matched K
6.3 Cold Start vs Warm Start
Report both: - Cold start: No prior indexing, no feedback, no session history - Warm start: After indexing (but before any task-specific feedback) - Learned: After one round of feedback (only for systems that support it)
This prevents penalizing systems that require upfront investment (indexing) while also measuring that investment's payoff.
6.4 Query Formulation Fairness
The task description is passed verbatim to all systems. Systems that require different query formats (e.g., keyword extraction for grep) use a fixed, documented adapter that: - Extracts keywords via the same algorithm for all keyword-based systems - Does not use system-specific query optimization - Is published as part of the benchmark code
6.5 Version Pinning
All system versions are pinned and documented:
# bench/cross-system/versions.yaml
systems:
knowing: v0.10.1 # 38 edge types, embedding re-ranker, density-adaptive
codegraph: latest (npm)
gitnexus: latest (npm)
gortex: latest (go install)
grep/ripgrep: 14.x.x
aider: latest (pip) # timed out
codebase-memory: latest (npm) # timed out
6.6 Independent Ground Truth Verification
Ground truth labels are verified by a second reviewer who did NOT create the original labels. Disagreements are resolved by examining the actual PR diff and documenting the resolution.
7. Ground Truth Labeling Protocol
7.1 Who Labels
- Primary labeler: Developer familiar with the target repo (reads code, understands architecture)
- Verifier: Second developer who independently reviews labels against the source PR/issue
7.2 Labeling Criteria
A symbol is "ground truth relevant" if an expert developer would need to see its definition or signature to accomplish the task. Specifically:
Include: - Symbols directly modified by the task's solution - Symbols called by the modified code that the developer needs to understand - Type definitions the developer needs to see to write correct code - Interface definitions that constrain the implementation
Exclude: - Standard library symbols (os.Open, fmt.Sprintf, etc.) - Test helpers and mock implementations - Symbols the developer would already know from the task description - Transitive dependencies more than 2 hops from modified code
7.3 Labeling Process
For each task:
1. Read the task description (issue/PR title + body)
2. Read the gold-standard solution (PR diff)
3. List all symbols modified in the diff
4. For each modified symbol:
a. Add its direct callers (1 hop) that provide necessary context
b. Add type definitions it references
c. Add interface definitions it must satisfy
5. Remove standard library symbols
6. Remove test-only symbols (unless the task is about tests)
7. Cap at 15 symbols per task (forces prioritization of most important)
8. Record confidence: HIGH (obviously needed) or MEDIUM (helpful but not critical)
7.4 Inter-Rater Agreement
Measure Cohen's kappa between primary labeler and verifier: - kappa > 0.8: strong agreement, labels are reliable - 0.6 < kappa < 0.8: moderate agreement, discuss disagreements - kappa < 0.6: weak agreement, re-examine labeling criteria
Target: kappa > 0.75 before running the benchmark.
7.5 Label Storage Format
# bench/cross-system/corpus/tasks/kubernetes/easy/01-add-pod-status-field.yaml
ground_truth:
- symbol: "pkg/apis/core/types.PodConditionType"
confidence: HIGH
reason: "Type being extended"
- symbol: "pkg/kubelet/status/status_manager.SetPodStatus"
confidence: MEDIUM
reason: "Caller that validates conditions"
labeler: "developer-A"
verifier: "developer-B"
agreement: 0.85 # Cohen's kappa for this task
disputes:
- symbol: "pkg/apis/core/validation.ValidatePodStatus"
resolution: "included"
rationale: "Developer must understand validation to add a new condition"
8. Analysis Framework
8.1 Primary Comparisons
Table 1: Overall Performance (primary result)
| System | P@5 | P@10 | P@20 | R@10 | NDCG@10 | MRR | TokenEff |
|-------------|------|------|------|------|---------|------|----------|
| knowing | | | | | | | |
| GitNexus | | | | | | | |
| Aider | | | | | | | |
| Sourcegraph | | | | | | | |
| grep | | | | | | | |
| CGC | | | | | | | |
Table 2: Per-Tier Breakdown
| System | Easy P@10 | Easy R@10 | Med P@10 | Med R@10 | Hard P@10 | Hard R@10 |
|----------|-----------|-----------|----------|----------|-----------|-----------|
| ... | | | | | | |
Table 3: Per-Repo Breakdown
Shows whether any system has a language-specific advantage.
Table 4: Token Efficiency at Fixed Recall
For each system, find the minimum token budget needed to achieve R@10 >= 0.5. Lower is better.
8.2 Visualizations
- Precision-Recall curves (one per system, overlaid): vary K from 1 to 50
- Token efficiency scatter: X = tokens consumed, Y = recall achieved (one point per task)
- Radar chart: 6 axes (P@10, R@10, NDCG@10, TokenEff, TTC, MRR), one polygon per system
- Per-task heatmap: rows = tasks (sorted by difficulty), columns = systems, color = P@10
- Learning curve (knowing only): P@10 over 5 feedback rounds
- Box plots: distribution of P@10 scores per system (shows variance, not just mean)
- Statistical significance matrix: pairwise p-values between all system pairs
8.3 Failure Analysis
For each system, categorize failures: - Vocabulary miss: task uses different words than the code (e.g., "authentication" vs "auth") - Depth miss: relevant symbol is >2 hops from any keyword match - Noise overwhelm: relevant symbols exist in results but below K cutoff - Language gap: system doesn't support the target language well - Scale failure: system degrades on large repos (>1M LOC)
Document the top-3 failure modes per system. This informs improvement priorities.
8.4 Output Format
Results are written to:
bench/cross-system/results/
run-<timestamp>/
raw/
knowing-kubernetes-easy-01.json
gitnexus-kubernetes-easy-01.json
...
aggregated/
overall.csv
per-tier.csv
per-repo.csv
per-task.csv
analysis/
significance-tests.json
failure-analysis.md
FINDINGS.md # auto-generated summary
8.5 Interpretation Guidelines
When reporting results: - Never claim "X is better than Y" without statistical significance - Report effect sizes alongside p-values - Acknowledge when systems solve different problems (SCIP provides navigation, not discovery) - Note any system that was disadvantaged by the benchmark design - Separate "retrieval quality" from "system maturity" in conclusions
9. Iteration Protocol
9.1 Using Results to Improve knowing
After each benchmark run:
- Identify failure categories (Section 8.3) for knowing specifically
- Prioritize by impact: which failure mode, if fixed, would improve the most tasks?
- Implement the fix in knowing's context engine
- Re-run the benchmark on the same corpus (no fixture changes between iterations)
- Record the delta in the experiment log
9.2 When to Update Ground Truth
Ground truth fixtures are updated ONLY when: - A labeling error is discovered (wrong symbol, incorrect match criteria) - The pinned repo commit changes (new benchmark version) - Inter-rater agreement reveals ambiguity requiring clarification
Ground truth is NEVER updated to make any system look better.
9.3 Benchmark Versioning
bench/cross-system/CHANGELOG.md
## v1.0 (initial)
- 7 repos, ~117 tasks, 7 systems
- Pinned commits: [list SHAs]
## v1.1 (if ground truth corrections needed)
- Corrected 3 task labels per Section 7.4 review
- No system version changes
9.4 Comparative Learning
When another system outperforms knowing on a category:
1. Analyze WHY (what retrieval signal do they use that we don't?)
2. Document in bench/cross-system/analysis/competitive-lessons.md
3. Assess feasibility of adopting the technique
4. If adopted, re-run to verify improvement
10. Implementation Plan
Phase 1: Infrastructure (1 week)
Goal: Benchmark harness that can run any adapter against any fixture.
| Task | Effort | Output |
|---|---|---|
Create bench/cross-system/ directory structure |
1h | Directory layout |
Write repos.yaml with 5 pinned repos |
2h | Corpus definition |
Implement symbol normalization (normalize.go) |
4h | Shared normalization |
Implement metric computation (metrics.go) |
4h | P@K, R@K, NDCG, MRR, TokenEff |
Write adapter interface (adapter.go) |
2h | Common interface for all systems |
| Implement knowing adapter | 2h | Calls ForTask directly |
| Implement grep baseline adapter | 3h | Shell-out to rg, parse results |
| Write result aggregation and FINDINGS.md generation | 4h | Auto-report |
| Statistical significance tests (Wilcoxon, bootstrap CI) | 4h | stats.go |
Deliverables:
- bench/cross-system/harness_test.go (main entry point)
- bench/cross-system/adapters/ (one file per system)
- bench/cross-system/metrics/ (computation)
- bench/cross-system/normalize.go (symbol normalization)
Phase 2: Ground Truth (2 weeks)
Goal: ~117 labeled tasks across 7 repos with inter-rater verification.
| Task | Effort | Output |
|---|---|---|
| Select 40 SWE-bench instances for django/flask | 8h | 40 fixture YAMLs |
| Label 40 manual tasks for kubernetes/VS Code/cargo | 16h | 40 fixture YAMLs |
| Create 20 synthetic cross-cutting tasks | 8h | 20 fixture YAMLs |
| Second-reviewer verification pass | 12h | Verified labels with kappa scores |
| Resolve disagreements, document | 4h | Final ground truth |
Deliverables:
- bench/cross-system/corpus/tasks/<repo>/<tier>/*.yaml (100 files)
- bench/cross-system/corpus/labeling-report.md (inter-rater agreement)
Phase 3: Adapters (1 week)
Goal: All 6 systems integrated and producing parseable output.
| Task | Effort | Output |
|---|---|---|
| Implement GitNexus adapter | 4h | MCP tool call + response parser |
| Implement Aider repo-map adapter | 6h | Python bridge to extract repo-map |
| Implement SCIP/Sourcegraph adapter | 6h | Index generation + symbol lookup |
| Implement CGC adapter | 4h | MCP tool call + response parser |
| Verify all adapters on 3 test fixtures | 4h | Smoke test passing |
Deliverables:
- bench/cross-system/adapters/gitnexus.go
- bench/cross-system/adapters/aider.py (Python, called via subprocess)
- bench/cross-system/adapters/scip.go
- bench/cross-system/adapters/cgc.go
Phase 4: Execution and Analysis (1 week)
Goal: Full benchmark run with publishable results.
| Task | Effort | Output |
|---|---|---|
| Clone all 7 repos at pinned commits | 1h | Local corpus |
| Run full benchmark (all systems x all tasks) | 8h | Raw results |
| Generate analysis (tables, charts, significance) | 4h | FINDINGS.md |
| Failure analysis per system | 8h | failure-analysis.md |
| Write summary narrative | 4h | Publishable conclusions |
Deliverables:
- bench/cross-system/results/run-<timestamp>/FINDINGS.md
- bench/cross-system/results/run-<timestamp>/analysis/
Total Estimated Effort
| Phase | Duration | Person-hours |
|---|---|---|
| Phase 1: Infrastructure | 1 week | 26h |
| Phase 2: Ground Truth | 2 weeks | 48h |
| Phase 3: Adapters | 1 week | 24h |
| Phase 4: Execution | 1 week | 25h |
| Total | 5 weeks | 123h |
11. Prior Art and References
11.1 Existing Benchmarks in Code Intelligence
| Benchmark | What it measures | Relevance |
|---|---|---|
| SWE-bench | End-to-end issue resolution by AI agents | Source of realistic tasks + ground truth patches |
| RepoEval | Repository-level code completion | Measures context retrieval for completion (narrower than our use case) |
| CrossCodeEval | Cross-file code completion | Validates that cross-file context improves completion; our benchmark measures context quality directly |
| RepoBench | Repository-level benchmarking | Similar structure; we adapt their multi-level difficulty approach |
| DevBench | Full software development lifecycle | Broader scope; our benchmark isolates the retrieval step |
| SWE-bench Verified | Human-verified subset of SWE-bench | Use this subset for highest-quality ground truth |
11.2 What We Build On
-
SWE-bench task format: We adopt their issue description as task query and gold patch as ground truth source. We extend by extracting individual symbols from patches rather than measuring pass/fail on the whole issue.
-
RepoEval's repo selection: Their approach of selecting repos by size/language/domain diversity informs our corpus selection.
-
CrossCodeEval's cross-file analysis: Their finding that cross-file context significantly improves LLM performance validates our benchmark's focus on retrieval quality.
11.3 How This Benchmark Differs
Existing benchmarks measure end-to-end task completion (does the agent solve the issue?). This benchmark isolates the retrieval step (does the context system surface the right symbols?). This distinction matters because:
- End-to-end benchmarks confound retrieval quality with LLM capability
- Context retrieval is the only variable we can control (LLM is fixed)
- Retrieval quality is measurable independently of generation quality
- Results are actionable: they tell us exactly which symbols each system misses
12. Directory Structure
bench/cross-system/
README.md # Quick start and overview
harness_test.go # Main benchmark entry point
metrics/
precision.go # P@K computation
recall.go # R@K computation
ndcg.go # NDCG computation
mrr.go # MRR computation
token_efficiency.go # TokenEff computation
stats.go # Significance tests, bootstrap CI
adapters/
adapter.go # Interface definition
knowing.go # knowing adapter
gitnexus.go # GitNexus adapter
aider.py # Aider repo-map adapter (Python)
scip.go # SCIP/Sourcegraph adapter
cgc.go # CodeGraphContext adapter
grep_baseline.go # ripgrep baseline adapter
normalize.go # Symbol normalization
corpus/
repos.yaml # Pinned repository definitions
tasks/
kubernetes/
easy/ # 8 fixtures
medium/ # 8 fixtures
hard/ # 4 fixtures
typescript/
easy/
medium/
hard/
flask/
easy/
medium/
hard/
cargo/
easy/
medium/
hard/
django/
easy/
medium/
hard/
labeling-report.md # Inter-rater agreement scores
results/
run-<timestamp>/
raw/ # Per-system, per-task JSON
aggregated/ # CSV summaries
analysis/
significance-tests.json
failure-analysis.md
competitive-lessons.md
FINDINGS.md # Auto-generated report
versions.yaml # Pinned system versions
CHANGELOG.md # Benchmark versioning
13. Running the Benchmark
Quick Start (single system, single repo)
# Run knowing against flask corpus only
GOWORK=off go test ./bench/cross-system/ -v \
-run TestCrossSystem/knowing/flask \
-timeout 10m
Full Run (all systems, all repos)
# Ensure all repos are cloned
./bench/cross-system/scripts/clone-corpus.sh
# Ensure all systems are installed
./bench/cross-system/scripts/verify-systems.sh
# Full benchmark (takes ~2 hours)
GOWORK=off go test ./bench/cross-system/ -v \
-count=3 \
-timeout 4h
Single System Comparison
# Compare knowing vs grep baseline only
GOWORK=off go test ./bench/cross-system/ -v \
-run "TestCrossSystem/(knowing|grep)" \
-timeout 30m
Regenerate Analysis
# Re-analyze existing raw results without re-running systems
GOWORK=off go test ./bench/cross-system/ -v \
-run TestAnalyzeResults \
-timeout 5m
14. Success Criteria
The benchmark is considered successful (regardless of which system wins) if:
- Reproducibility: Two independent runs produce results within 5% of each other
- Discrimination: At least one system pair shows statistically significant difference
- Coverage: Results span the full range (no system gets 0% or 100% on all tasks)
- Fairness validation: No system's authors object to the methodology after review
- Actionability: Results identify at least 3 concrete improvements for knowing's engine
Confirmed Outcomes (26 runs)
Based on 26 iterative benchmark runs across 167 tasks: - knowing wins on precision (P@10=0.242, 1.79x the nearest competitor codegraph) - knowing wins on recall (R@10=0.362, only system with full-corpus recall data) - knowing wins on token efficiency (GCF format, graph-aware packing) - knowing wins on scalability (18s index on kubernetes, 200MB RAM vs 14GB for Gortex) - grep wins on time to first result (no indexing overhead, but 18.6x less precise) - codegraph is the strongest competitor (P@10=0.135) but fails on 60/167 tasks - Aider and codebase-memory both timed out on the 30-min limit - GitNexus cannot index enterprise repos (killed at >60 min on kubernetes) - Embedding re-ranker was the biggest single improvement: +17% P@10, +18.3% R@10 - Dense graph repos benefit most from re-ranker (Kubernetes +92.8%). Session 15 regressions (VS Code -16%, Ocelot -30.8%) resolved in session 16: both show 0% P@10 delta.
15. Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| System X not installable/broken | Medium | Drops one comparison | Document as "unable to evaluate," proceed with others |
| Ground truth disagreement > 25% | Low | Unreliable results | Re-examine criteria, bring in third labeler |
| Token counting inconsistency | Medium | Unfair comparison | Single tiktoken path for ALL systems |
| System requires paid API | Medium | Cannot reproduce freely | Document cost; provide cached results for verification |
| Repo too large to index in time | Low | Benchmark takes days | Set 30-min timeout per system per repo; skip with documented reason |
| knowing loses badly | Medium | Uncomfortable results | Publish honestly; losing is data; use to prioritize improvements |
16. Ethical Considerations
- All evaluated systems are used within their license terms
- GitNexus (PolyForm NC) is used for research/evaluation (permitted under NC)
- No private/proprietary code in the corpus (all repos are public)
- Results will be published with methodology, enabling contestation
- If any system's maintainers provide corrections to our integration, we incorporate them