Roadmap
What's shipped is in the changelog. This document covers what's next.
Current State (v0.15.0, 2026-06-04)
P@10 = 0.330 cold start (302 tasks, 17 repos, 8 languages). Honest measurement: no task memory, no embeddings. 38 edge types. 23 extractors. 263 equivalence classes across 30 files with multi-phrase gate. GCF default output format. 70+ experiments across 28 sessions.
Session 28 results (per-repo, all honest, 291 tasks): | Repo | P@10 | Tasks | |------|------|-------| | Ripgrep | 0.464 | 11 | | Terraform | 0.440 | 20 | | Kafka | 0.437 | 19 | | Jekyll | 0.425 | 20 | | Kubernetes | 0.423 | 13 | | Caddy | 0.410 | 20 | | Flask | 0.328 | 18 | | Rails | 0.325 | 20 | | FastAPI | 0.315 | 20 | | Ocelot | 0.280 | 20 | | Saleor | 0.264 | 11 | | Cross-cutting | 0.263 | 8 | | Cargo | 0.263 | 19 | | Spark-Java | 0.250 | 20 | | VS Code | 0.200 | 19 | | Django | 0.176 | 33 |
Key breakthroughs (sessions 23-28):
1. Framework equivalence classes with forced injection (session 23, +57%). 263 classes across 30 files. Multi-phrase gate added session 28 (+9.6%): isStrongEquivMatch prevents single-word flooding.
2. Code pattern keyword extraction (session 28): extractCodePatterns detects method calls, Class.method paths in task descriptions.
3. GCF default output format (session 27): 84% fewer tokens than JSON, 100% LLM comprehension at 500 symbols. Standalone gcf-go library.
4. Zero-task audit cycle: use bench-task to diagnose each zero, add defensible equiv classes, verify per-repo before full corpus.
5. Adaptive retrieval for massive repos (>200K nodes): falls back to direct FTS + contains-edge expansion when RWR produces flat results.
6. Language scoping: Lang field restricts framework classes to matching repos. detectRepoLanguage() from node QN patterns.
Critical findings (sessions 23-28): 1. Task memory contaminated all prior measurements (26K stale entries). Disabled in adapter. 2. Embeddings confirmed dead neutral (3 runs: 0.176/0.175/0.176). 3. Re-indexing without LSP produces fewer edges than original (tree-sitter only). Corpus DBs already fully enriched, re-indexing provides no benefit. 4. Keyword extraction fix and path boost are dead ends (both net negative). 5. Single-word equiv phrases (e.g., "command") can trigger framework injection that floods top-10 with infrastructure symbols. Fixed by multi-phrase gate (session 28).
Immediate Priorities
| # | Item | Why | Effort | Expected Impact |
|---|---|---|---|---|
| 5b | ~~Incremental RWR phase 2: per-package cache keys~~ | SHIPPED (session 26). computeRWRCacheHash includes per-package Merkle roots. Unchanged packages keep cached walks. See Shipped table. |
- | - |
| 6 | ~~Delta context packing~~ | SHIPPED (session 27). See Shipped table below. | - | - |
| 10 | AI-generated evaluation corpus | LLM generates tasks + ground truth, DB-validated. Hybrid: hand-curated for regression, AI-generated for coverage. | Medium | Eval credibility |
| 11 | More equiv class coverage | Message queues (RabbitMQ, Redis), cloud SDKs (AWS, GCP), build systems (Make, Gradle), observability (OpenTelemetry, Prometheus). | Ongoing | Incremental P@10 |
| 12 | Zero-task audit cycle | 57 zero-scoring tasks (19.6%). 80% noise (wrong neighborhood), 10.5% related_name (right concept, wrong sibling), 8.6% test symbols. Django 13 zeros, Rails 6, VS Code 5. Each cracked zero adds ~0.003 to aggregate. | Medium | +0.01-0.02 P@10 |
| 13 | ~~Sibling ranking by blast radius~~ | TESTED HARMFUL (session 28). Both global and package-scoped leaf-name dedup regressed aggregate (-0.009 and -0.006). Common names too frequent within packages. See Tested Neutral/Harmful table. | - | - |
| 14 | Enricher repo hash mismatch | enrich lsp computes a different repo hash than index, causing "no snapshot found" when enriching a previously indexed DB. Workaround: run index without -no-enrich (single pass). Root cause: enricher and indexer use different inputs to hash the repo identity. |
Low | Bug fix |
| 15 | More framework-using repos | Cal.com (TypeScript scheduling, indexed session 28), Redash (Python BI), Discourse (Ruby forum). Domain equiv classes generalize. | Medium | Eval credibility + P@10 |
| 16 | Extract benchmark to standalone repo (CRET) | Session 26 audit: 18 files clean, 5 trivial decouples, ~2 hours. Context packing benchmark also extractable. | Low | Credibility |
Shipped (sessions 23-28)
| Item | Session | Result |
|---|---|---|
| Multi-phrase equiv gate | 28 | isStrongEquivMatch: require >= 2 phrases or multi-word phrase for framework injection. Fixes VSCODE_COMMAND flooding. P@10 0.293 -> 0.321 (+9.6%). |
| Code pattern keyword extraction | 28 | extractCodePatterns detects method calls, Class.method paths, dotted paths with underscores in task descriptions. Injected as Compounds before standard extraction. |
| GCF default format | 27 | All MCP context tools emit GCF by default. 84% fewer tokens than JSON, 100% LLM comprehension at 500 symbols. gcf-go extracted as standalone library. |
| Fixture cleanup (17 removed) | 27-28 | 8 unresolvable ground truth + 9 ripgrep dependency crate fixtures. 308 -> 291 tasks. |
| Delta context packing (#6) | 27 | Structural diff on pack_root mismatch. 81.2% token savings at 96.6% symbol overlap (re-query benchmark). DiffPacks + EncodeDelta + MCP wiring. bench/delta-packing/ proves it. |
| Cross-task vocab validation (#3a) | 26 | Django +41.4%, corpus 0.0% (safe). Noise filter, soft RRF, confidence weighting. Mechanism #13. |
| Incremental RWR (#5 + #5b) | 26 | Merkle-cached walks with per-package cache keys. computeRWRCacheHash includes per-package Merkle roots; unchanged packages keep cached walks. Django cold 3.9s -> warm 1.9s (2x). P@10 correctness verified. debug-rwr-cache CLI. |
| Vocab association expiration (#3c) | 26 | Per-package Merkle roots. persistPackageRoots at index time, LoadPackageRoots + PackageRootForSymbol at query time. Both engine and MCP wired. Migration 022. |
| Platform deploy (#8) | 26 | api.blackwell-systems.com live. SQLite key store, admin CLI (create/list/revoke), hardened systemd, Cloudflare Tunnel, firewall. $12/mo. |
| Context packing benchmark | 26 | 4 strategies, 300 tasks. File-grouped tested negative (-10.8% P@10). Density-ranked confirmed optimal. |
| CRET extraction audit | 26 | 18 files clean, 5 trivial, ~2 hours. Proposal updated. |
| Doc cleanup | 26 | Renamed EVALUATION-OVERVIEW, deleted AGENT-EFFICIENCY-STUDY, updated 10+ docs |
| CI fix (merkle-diff, eval, scoped FTS) | 26 | Short-mode skip for flaky bench tests, removed eval gate from CI |
| Fix missing inheritance edges | 24 | 5,581 phantom extends edges eliminated on Django |
| Wire remaining 5 resolvers | 24 | All 7 wired. Kafka/Java 596K, Django/Python 58K, Cargo/Rust 27K edges |
| Process framework classes first | 23 | Already solved via equivSeen bypass |
| Add framework-using repos | 24 | saleor/saleor added, P@10=0.236, equiv classes validated on app code |
| Blog post | 24 | Updated with v0.13.0 numbers, methodology, honest findings |
| v0.13.0 release | 24 | P@10=0.278, 300 tasks, 16 repos, 8 languages |
| Task-noise correlation (implicit feedback) | 24 | Moved to engine, +5.9% P@10 on Django. Task memory disabled (neutral). Sole learning mechanism. |
| Equiv class auto-discovery | 24 | Deferred (language scoping sufficient) |
Tested Neutral or Harmful (sessions 14-24)
| Approach | Result | Session | Details |
|---|---|---|---|
| Sibling dedup by leaf name (global and package-scoped) | Harmful | 28 | Penalize duplicate method/field names in ranked results. Global: -0.009 (0.311 vs 0.320). Package-scoped (same directory): -0.006 (0.314 vs 0.320). Common names (Close, String, Error, Init) appear legitimately across unrelated types within the same package. Rails improved (+10.8%) but other repos regressed more. The 54 related_name misses are too rare to justify a universal penalty. |
| Test penalty sweep (0.01-0.50, 12 values on Rails) | Noise | 28 | Rails P@10 range 0.330-0.370 across 12 penalty values (0.01, 0.03, 0.05, 0.08, 0.10, 0.15, 0.20, 0.30, 0.50). No consistent peak: 0.05 scored 0.330 and 0.360 on two runs of the same value. Variance (+-0.030 on 20 tasks) dominates the signal. Full corpus neutral at 0.15 (0.319, 0.322) and 0.30 (0.318, 0.320). Default set to 0.15 (reasonable, not optimized). BENCH_TEST_PENALTY env var for future sweeps. |
| File-grouped packing | Harmful | 26 | Packing benchmark showed +15% GT coverage via substring matching, but P@10 dropped -10.8% on Django. Budget wasted on low-value siblings from same file. Density-ranked remains optimal. |
| Proximity-weighted scoring (BFS distance in ranking) | Neutral/Harmful | 24 | BFS hop distance from seeds as ranking component. P@10 0.278 on full corpus (neutral). On enriched saleor: 0.182 vs 0.200 without (slightly harmful). Problem is packing, not scoring. |
| Task memory compounding (keyword -> symbol recall) | Neutral | 24 | Django 5 rounds: P@10 0.194, 0.189, 0.197, 0.194, 0.192 (-0.3%). Previous "+11.5% round-over-round" was stale accumulated entries. Memory records (keywords -> top-5 symbols) but pipeline already finds those symbols. Boost is redundant. |
| Embeddings as Channel 3 (seed source) | Neutral | 15 | Three models find same symbols as BM25. Architecture was wrong. |
| Blended re-rank (weight > 0.0) | Harmful | 15 | Pure re-rank (weight=0.0) wins P@10/R@10. Blending preserves MRR but sacrifices recall. |
| Call-chain seeding | Neutral | 14 | Callees already reachable via RWR traversal; diffuses probability mass |
| Hub dampening | Neutral | 14 | No effect on VS Code (0.095 unchanged at any threshold) |
| BFS depth reduction | Neutral | 14 | No effect (depth 2/3/4 all produce same P@10) |
| Expanded framework thesaurus ("backend"->"base") | Harmful | 14 | Too noisy for BM25 |
| accesses_field for P@10 | Neutral | 15 | Fields already reachable via call edges. Adds graph completeness, not retrieval. |
| ~~LSP enrichment for P@10~~ | Revised: strongly positive | 13, 17 | Session 13 found neutral (tested confidence upgrades only). Session 17: Python enrichment +0.040 P@10 (django+flask). Go enrichment: k8s 0.000 -> 0.159 (192K new edges, 169K phantom nodes). Enrichment creates phantom external nodes; type_hint_of edges connect functions to those nodes. Moved to Tested Positive table. |
| Coherence-aware packing (CoherenceBonus=0.3) | Harmful (-1.8%) | 16 | Greedy density packing already near-optimal. File-based coherence adds noise. |
| Bidirectional inheritance edges | Harmful (-2.5%) | 16 | Reverse inherits add noise without new reachability. Django zeros are vocabulary gaps. |
| BM25 gap injection (no embedding filter) | Harmful (-1.4%) | 16 | Raw BM25 candidates too noisy. Displaces good graph results. |
| Seed count sweep (10-50 on Django) | Neutral | 16 | 10 and 15 and 25 seeds all produce P@10=0.222-0.228. Confirms parameter irrelevance. |
| Gap injection parameter sweep (15 configs) | Neutral | 16 | Threshold 0.1-0.5, maxgap 3-10 all produce P@10=0.225-0.228 on Django. Parameters don't matter. |
| Density-adaptive RWR alpha (0.15 on dense) | Neutral | 17 | Alpha=0.15 on dense repos (flask 5.9, cargo 13.5, kafka 12.5): P@10 0.280 vs baseline 0.278. Within run variance. |
| Density-adaptive inherits weight (1.0 on deep) | Neutral | 17 | Boosted implements/overrides/extends to 1.0 on repos >1.5% inherits edges. Django +0.009, kafka+flask -0.008. Net neutral. |
| Interface type hint propagation (post-processing) | Neutral | 17 | Connect type_hint_of targets to sibling implementors. Edge structure mismatch: type_hint_of and implements share 0 target hashes on Java/Python. Go (k8s): 393 edges on 523K, P@10 neutral. Needs extractor-level fix. |
| Disconnection rate adaptive seeding | Neutral | 20 | Measured disconnection rate (% zero-inbound nodes) across 12 repos: 0.2% (kafka) to 22.7% (caddy). Added seed bonus proportional to rate. Only flask/spark affected (+2 seeds). Django 0.261 (baseline 0.256), flask 0.337 (0.347), spark 0.260 (0.255). All within variance. Redundant with node count thresholds; confirms seed quantity doesn't move P@10. |
| Porter stemming in FTS5 | Neutral (-0.003) | 20 | Added porter tokenizer to FTS5 so "validates" matches "Validator". Django +0.006, cargo -0.009. Full corpus 0.264 vs 0.267 baseline. Stemming expands BM25 recall but brings in noisy seeds that dilute RWR on dense graphs. |
| Django framework equiv classes (session 20) | Harmful (-0.011) | 20 | SUPERSEDED by session 23 approach. Session 20 used broad phrases without forced injection. Session 23 uses specific framework concepts + forced injection bypassing RWR: Django +99%, Terraform +133%. Key difference: weight 0.9 + source "framework" triggers direct ranked-list injection. |
| Keyword extraction (promote Components) | Harmful (terraform 0.120->0.035) | 23 | Promoting generic nouns from task descriptions into BM25 floods large repos with irrelevant matches. |
| Path boost (5 variants) | All harmful | 23 | Hard reorder (0.140), soft +3 (0.115), selective (0.120), selective +1 (0.095), post-RWR (0.095). Path terms are core domain vocabulary matching most symbols. |
| Embeddings gap-fill | Dead neutral | 23 | 3 runs: 0.176/0.175/0.176 with and without. Previous "gap-fill works" was task memory contamination. |
| SCIP ingestion for Rust | Rejected | 20 | rust-analyzer SCIP on cargo: 124K edges, all connecting project code to external types (stdlib, serde, dependencies). Zero project-internal edges that tree-sitter didn't already find. Unfiltered: P@10 = 0.150 (-0.127). Filtered (project-only): +0 edges, identical to baseline. SCIP's value is cross-crate resolution, but those targets are always external. Macro-expanded edges (derive Serialize) create impl edges TO external types, not between project symbols. Dead end for P@10. |
| Graph pruning / ghost edges | Neutral | 20 | Three configs on cargo: (1) exclude similar_to: 0.245 (-0.014, reachability lost). (2) exclude references: 0.268 (+0.009, noise removed). (3) ghost references at 0.05 weight: 0.264 (+0.005). Full corpus ghost: 0.264 (-0.003). Density-adaptive ghost (threshold 5.0): 0.264 (-0.003). Per-repo wins cancel losses. Pruning/ghosting is edge weight tuning, which 57 experiments confirm doesn't move aggregate P@10. |
What Works (session 23)
| Approach | Result | Session | Details |
|---|---|---|---|
| Framework equiv classes + forced injection | +57% (0.176 -> 0.278) | 23 | 263 classes across 30 files. High-confidence framework matches (weight >= 0.9) bypass RWR and inject directly into ranked results. Django +126%, Terraform +238%. |
| Multi-phrase equiv gate | +9.6% (0.293 -> 0.321) | 28 | isStrongEquivMatch requires >= 2 phrases matched or multi-word phrase. Prevents single generic words (e.g., "command") from flooding top-10 with framework hub symbols. |
| Code pattern keyword extraction | Contributes to 0.330 | 28 | extractCodePatterns detects method calls, Class.method paths, dotted paths with underscores. Fires before standard word extraction as Phase 1.5 in extractKeywordSet. |
| Language scoping | Prevents regressions | 23 | Lang field restricts framework classes to matching repos. detectRepoLanguage() from node QN file extensions. |
| Adaptive retrieval (>200K nodes) | VS Code +43% | 23 | When RWR produces flat results on massive repos, falls back to direct FTS + contains-edge expansion. |
| equivSeen injection bypass | Fixes silent failures | 23 | Framework injection checks happen before dedup, so lower-weight classes can't block framework targets. |
Closed Paths (session 23, honest measurement)
Not re-testing: Embeddings (dead neutral, 3 runs confirmed), keyword extraction (net negative on large repos), path boost (5 variants all harmful), BM25 query broadening (floods results with noise), file-grouped packing (session 26: packing benchmark showed +15% GT coverage but P@10 dropped -10.8% on Django; budget wasted on low-value siblings from same file). These were rejected with clean measurement (no task memory contamination).
Enrichment Performance
gopls on-demand package loading dominates enrichment time on large Go repos. The two-phase warmup (didOpen + retry) solved the "zero upgrades" problem. Both Go repos are now fully enriched:
- Terraform: 82K new edges discovered, 73K phantom nodes, 12 min total
- Kubernetes: 192K new edges discovered, 169K phantom nodes, 58 min gopls (root module only). Sub-modules (30 staging packages) are intentionally excluded from indexing: staging code is dependency code that dilutes RWR (-20% P@10 when included). Multi-module enrichment infrastructure works but has nothing to enrich since staging files aren't indexed.
The persistent daemon (#3) is the real fix for repeat runs; everything else works around the cold start.
| # | Item | What it does | Expected Impact | Effort |
|---|---|---|---|---|
| 1 | Per-package gopls for single-module repos | Spawn one gopls per top-level package directory, each loads only its subtree. Already implemented for go.work repos (multi-module enrichment). Extend to single-module repos by synthetically partitioning. | 3-5x faster on large repos (parallel init, each instance loads fewer packages) | Medium |
| 2 | Lazy/streaming LSP requests | Fire LSP requests immediately without waiting for gopls to fully initialize. gopls queues and answers as packages load. Early requests may timeout (10s per-symbol limit), later ones succeed. Currently the enricher blocks on the first response, which waits for full init. | Eliminates init wait; trades some skipped symbols for 5-10 min wall clock savings | Low |
| 3 | Persistent gopls daemon (-remote mode) |
Run gopls as a persistent background process that stays warm between enrichment runs. Second enrichment of the same repo is near-instant (workspace already loaded). | Near-zero init on repeat runs. Requires daemon lifecycle management. | Medium |
| 4 | Incremental enrichment via CLI | Expose RunScoped(changedFiles) through knowing enrich lsp --files <list>. Only enrich symbols in changed files. Already implemented in the enricher (used by daemon mode), but the CLI always runs full enrichment. |
10-100x faster for incremental changes (enrich 5 files vs 2,000) | Low |
| 5 | Parallel git blame | git blame runs per-file sequentially (~40% of index time on large repos). Parallelize across files since blame is read-only. Or: batch blame using git log --follow for recent authorship. |
2-4x faster authorship extraction | Low |
| 6 | Node.js heap size for tsserver | Set NODE_OPTIONS="--max-old-space-size=8192" when spawning tsserver. Default heap (~4GB) causes GC thrashing on large TypeScript repos (vscode: 34 min enrichment, majority in GC). More heap = less GC = faster enrichment. |
2-3x faster TS enrichment on large repos | Low |
| 7 | Deno LSP for TypeScript | Use deno lsp (Rust-based) instead of tsserver for TypeScript enrichment. No GC, no Node.js heap limits. Add as alternative in enrichment config detection (check for deno on PATH, prefer over tsserver). Test on vscode to compare enrichment time and edge quality. |
Potentially 5-10x faster TS enrichment | Low |
| 8 | Import-based phantom nodes for Go (skip gopls) | Parse Go import statements and generate phantom stub nodes for stdlib/dependency types without running gopls. Now that gopls enrichment works (k8s: +0.159 P@10), the value proposition changed: this is a fast fallback for environments without gopls, not the primary path. gopls discovers 192K edges + 169K phantoms on k8s; import parsing would get only the phantoms. | Fast fallback for Go enrichment without gopls | Low (deprioritized) |
| 9 | ~~Wire remaining in-process resolvers~~ | SHIPPED session 24. See item 2 above. | Shipped | See item 2 |
Storage Backend (P0 Performance)
Current: SQLite (single-writer, FTS5 deferred to background). Extraction is parallel (GOMAXPROCS workers, producer-consumer pipeline), but all DB writes funnel through one goroutine. Performance pragmas: synchronous=NORMAL, mmap_size=256MB, cache_size=64MB, busy_timeout=5000, temp_store=MEMORY. Multi-row batch INSERTs (edges: 100/statement, nodes: 99/statement, files: 249/statement) reduce per-row overhead.
Options under evaluation
| Backend | Parallel writes | Query model | Deployment | Status |
|---|---|---|---|---|
| SQLite sharded by package | Yes (one file per package) | Cross-package queries need federation | Multiple files | Prototype next |
| DuckDB | Yes (appender API) | SQL, columnar scans | Single file, CGO | Evaluate |
| BadgerDB/Pebble | Yes (LSM concurrent memtable) | Key-value (custom query layer) | Single dir, pure Go | Evaluate |
| SQLite + deferred FTS | No (serial) | SQL + FTS5 | Single file | Shipped (current) |
Sharding by package (leading candidate)
Packages are already the unit of Merkle computation, cache invalidation, diffing, and RWR scoring. One SQLite file per package means: - Parallel writes: each extraction worker writes to its own package's DB - No contention: workers never touch the same file - Package-scoped queries are local reads - Delete a package = delete the file - Merkle computation per-package is already isolated - Cross-package queries (blast radius, transitive callers) federate across shards
Current performance (v0.6.0 + optimizations)
| Repo | Files | Edges | Extraction | Total (with deferred FTS) |
|---|---|---|---|---|
| knowing (84K LOC) | 448 | 25K | 0.4s | 1.7s |
| flask (15K LOC) | 97 | 9K | 0.04s | 0.3s |
| cargo (150K LOC) | 979 | 79K | 0.2s | 5.5s |
| kubernetes (3.5M LOC) | 4,877 | 705K (268K ast + 287K lsp + 150K other) | 18.6s extraction + 58 min enrichment | ~22s queryable (enrichment async) |
Cross-Repo Query Architecture
The context engine (ForTask, ExplainSymbol, RWR, HITS, BM25) has no repo-scoping anywhere in its query path. If multiple repos exist in the same database, cross-repo queries work with zero code changes. The challenge is the storage model: the roster currently assigns each repo its own SQLite file.
Two approaches are under evaluation:
Option A: Unified Database (shared graph)
All repos index into a single ~/.knowing/knowing.db. The roster tracks metadata (paths, URLs) but not separate DB files.
Pros:
- Zero engine changes. ForTask, BM25, RWR, FTS5 all work unchanged on the merged graph.
- Cross-repo edges resolve naturally (source and target in same DB).
- One FTS5 index covers all vocabulary. BM25 ranks across all repos in a single query.
- Simplest implementation (~30 LOC change: roster defaults to shared DB).
- Single snapshot chain covers all repos (Merkle diff shows cross-repo changes).
- knowing remove already deletes by repo_hash within a shared DB.
Cons:
- No isolation between projects. A personal side-project and work monorepo share one graph.
- Larger single file (5 repos x 30K edges = 150K edges, still trivial for SQLite, but conceptually messy).
- Can't delete a repo by deleting a file (must use knowing remove which does SQL DELETE).
- If the shared DB corrupts, all repos are affected.
- Users may not want their repos' symbols showing up when querying from a different project.
Mitigation: Add --isolated flag to knowing add for repos that should stay separate. Default to shared for most workflows.
Option B: Federated Store (query-time merge)
A FederatedStore wrapper implements GraphStore over N underlying SQLiteStores. The primary store (current repo) receives writes; all roster stores are opened read-only for queries.
type FederatedStore struct {
primary *SQLiteStore // writes go here
others []*SQLiteStore // read-only roster DBs
}
Query federation strategy per method:
- NodesByName: query all stores, concat results, dedup by hash
- SearchBM25Nodes: query all stores, merge by score, take top-N
- EdgesFrom/EdgesTo: query all stores, concat (cross-repo edges live in source DB)
- GetNode: try primary first, then others (hash-based lookup)
- FeedbackBoosts: query all stores, merge maps
- Write methods (PutNode, PutEdge, RecordFeedback): primary only
Pros:
- Per-repo isolation by default. Each repo is a separate file with independent lifecycle.
- knowing remove is just closing and deleting a file.
- No corruption propagation between repos.
- Each repo can be backed up, synced, or deleted independently.
- No storage model change; existing per-repo DBs work as-is.
- Users opt-in to cross-repo by having multiple repos in their roster. No surprise data mixing.
Cons:
- N queries per method call (latency scales linearly with roster size). 3-5 repos: negligible (<5ms). 20+ repos: needs parallel goroutines.
- FTS5 indexes are per-DB; BM25 merge is approximate (scores from different corpus sizes aren't directly comparable without normalization).
- RWR adjacency map must load edges from all stores, making the first query slower.
- Cross-repo edges are split: source DB has the edge, target DB has the target node. GetNode must check multiple stores to resolve targets.
- Medium implementation effort (~200 LOC new type + method-by-method federation logic).
- Feedback recorded in the primary DB may reference nodes in other DBs (works, but feedback is stored asymmetrically).
- Community detection runs per-DB (Louvain on isolated subgraphs); cross-repo communities won't form.
Comparison
| Dimension | Unified DB | Federated Store |
|---|---|---|
| Implementation effort | ~30 LOC | ~200 LOC |
| Engine changes required | None | None (same interface) |
| Query latency | 1 query | N queries, merged |
| FTS5 quality | Unified corpus, accurate IDF | Per-corpus IDF, approximate merge |
| Cross-repo edges | Free (same table) | Resolved via multi-store lookup |
| Community detection | Cross-repo communities form naturally | Per-repo communities only |
| RWR walk | Seamless cross-repo | Cross-repo via edge concat |
| Isolation | None by default (opt-in via --isolated) |
Full by default |
| Corruption blast radius | All repos | Single repo |
| Storage management | One file to manage | N files, cleaner lifecycle |
knowing remove |
SQL DELETE (fast) | Close + delete file (instant) |
| Feedback compounding | Cross-repo (symbol used in repo B helps repo A) | Asymmetric (feedback in primary only) |
Decision
Not yet decided. The choice depends on real usage patterns: - If most users work across 2-3 related repos (monorepo splits, frontend+backend): unified DB wins on simplicity and quality. - If users have many unrelated projects and want clean separation: federated store wins on isolation. - Both can coexist: unified by default with federated as the advanced mode, or vice versa.
Current status: per-repo isolation (no cross-repo queries). First real user who hits the limitation decides the approach.
Operational
| Item | Description | Priority |
|---|---|---|
| Cross-repo context_for_task | Search across ALL indexed repos simultaneously, not just one. Real projects span multiple repos (monorepo patterns, microservices). Merge results from all repos into one ranked list. See "Cross-Repo Query Architecture" section below. | P2 |
| Incremental context ("next page") | After an agent gets initial context, allow requesting the NEXT N symbols not yet seen. Avoids re-querying with bigger budget and getting duplicates. Session-stateful cursor. | P2 |
| Staleness annotations on MCP responses | When returning context, annotate symbols whose source files changed since last index. Agents know which results might be outdated without calling knowing stale separately. |
P2 |
CLI --format gcf output |
knowing context only supports json/xml/markdown. Adding gcf/gcb for direct agent consumption without MCP. |
P3 |
knowing daemon install-service |
Generate launchd plist (macOS) or systemd user unit (Linux). | P3 |
Per-repo config (.knowing.yaml) |
Excludes, local overrides, workspace membership. | P3 |
Diagnostic Tools (for dense-graph investigation)
These tools are needed to investigate and resolve the dense-graph dilution problem
(VS Code 87K nodes, P@10 drops from 0.163 to 0.084 with correct extraction).
See docs/research/dense-graph-dilution-analysis.md for full investigation plan.
| # | Tool | What it enables | Effort |
|---|---|---|---|
| 1 | Query-time edge exclusion | BENCH_EXCLUDE_EDGES=similar_to filters edges during RWR without reindexing. Enables rapid hypothesis testing (test each edge type's contribution in seconds, not minutes). Add type filter to adjacency map loading. |
Low (5 lines) |
| 2 | Hub analysis tool | Reports top-N nodes by in-degree for a given DB. Identifies probability sinks that absorb RWR mass on dense graphs. Answers: "which nodes accumulate walk probability regardless of query?" | Low (30 lines) |
| 3 | RWR score distribution tool | For a given task, reports score distribution (min, max, median, p90, gap between rank-1 and rank-50). Diagnoses whether the walk is diffusing (flat distribution) or focused (steep dropoff). | Low (20 lines) |
| 4 | Top-10 comparison tool | For a given task, shows top-10 results from two different DBs (or configs) side-by-side. Answers: "which new nodes pushed correct results out of the top 10?" | Medium (50 lines) |
Benchmarking Roadmap
14 benchmark harnesses exist today (see bench/README.md). The following gaps remain for a complete competitive evaluation story.
P1: Would convince someone to adopt knowing
| Benchmark | What it proves | Status | Effort |
|---|---|---|---|
| SWE-bench integration | knowing + Claude solves N% more SWE-bench tasks than Claude alone. The definitive "does graph context help real agent work?" | Not started | High (full eval harness, 300 tasks, automated agent loop) |
| Real-session replay | Replay 10+ real claudewatch session transcripts. Measure: context calls saved, symbols used that came from knowing, tasks where knowing provided the critical symbol. | Not started (implicit feedback tracker now exists for attribution) | High (transcript parser, attribution detection, manual annotation) |
P2: Proves production readiness
| Benchmark | What it proves | Status | Effort |
|---|---|---|---|
| Query latency p50/p95/p99 | Instrument all 28 MCP tool handlers. Report latency distribution per tool across 1000 calls. | Single number (2ms cached) exists; need per-tool distribution | Medium |
P3: Completeness and rigor
| Benchmark | What it proves | Status | Effort |
|---|---|---|---|
| Ruby benchmark repo (Rails) | Adds 7th language to corpus. Rails mirrors Django: heavy framework conventions, deep class hierarchy, method_missing magic. Tests whether retrieval improvements generalize to Ruby. Candidates: Rails (large, MVC), Devise (auth, focused), Sidekiq (jobs, moderate). Requires 10-20 task fixtures with ground truth symbols. | Not started | Medium (fixture curation is the bottleneck) |
| Multi-language extraction coverage | For each of the 24 extractors: number of node types extracted, edge types produced, lines of test coverage. Comparison vs Sourcegraph SCIP, GitNexus, tree-sitter-graph. | Not started | Low (automated count + table) |
| Grafana scale validation | Full retrieval quality measurement on 714K-edge production graph (not just latency). P@10 with Grafana-specific task fixtures. | Latency test exists (grafana_scale_test.go); no retrieval quality measurement |
Medium |
| Graph integrity under load | Spawn 10 concurrent indexers on overlapping repos. Run knowing fsck after. Proves content-addressing prevents corruption under concurrency. |
Not started (fsck bench exists for single-indexer correctness) | Medium |
| Concurrent query performance | 100 parallel context_for_task calls on a 100K-edge graph. Measure throughput (queries/sec), latency degradation, and WAL checkpoint behavior. |
Not started | Medium |
| Cross-repo retrieval quality | P@10 for tasks that span repo boundaries (e.g., "which frontend components call this backend endpoint?"). | Needs cross-repo implementation first | Medium |
Standalone Publication: Code Retrieval Evaluation Toolkit (CRET)
Extract knowing's benchmarking infrastructure as the SWE-bench equivalent for code context retrieval. Full proposal: docs/proposals/code-retrieval-eval-toolkit.md.
Status: Not started. Prerequisite complete (Aider comparison done, Run 19-22).
Release Infrastructure
| Item | Description | Status |
|---|---|---|
| Corpus DB tarball in releases | Attach corpus-dbs-vX.Y.Z.tar.gz to each GitHub release as a separate asset (not bundled with binaries). Contains all 12 pre-built benchmark DBs with enrichment + pre-embedded vectors (1.6GB). Enables instant corpus restore via make corpus-restore TARBALL=... instead of 30+ min rebuild. DBs are gitignored and can't be recovered from git; losing them means hours of re-indexing + re-enrichment + re-embedding. |
HIGH PRIORITY |
| Corpus DB integrity check | CI job that runs knowing fsck on each corpus DB after release to verify no corruption. |
Not started |
Not yet benchmarked (tracked for completeness)
- Proof verification throughput: N proofs/sec verified (currently 1.2µs each = ~800K/sec theoretical)
- Snapshot chain walk cost: O(chain_length) for history queries
- FTS5 rebuild cost vs graph size: scaling curve for the deferred FTS rebuild
- Language-specific P@10 breakdown: already have per-repo numbers; need per-language aggregate
Retrieval Pipeline
Current results: see bench/cross-system/FINDINGS.md. P@10=0.189 cold (277 tasks, 14 repos, 8 languages). 2.17x vs codegraph, 3.44x vs GitNexus, 3.63x vs Gortex, 12.6x vs grep. Query latency 2ms on k8s (with adjacency cache). Embedding gap-fill adds 220ms (cached vectors). Focused seed selection + cluster-aware gap-fill: +6.0% over previous high. Equivalence classes: +4%.
Key findings: (1) 32-config parameter sweep proved P@10 is reachability-determined; ranking parameters are irrelevant. (2) Embedding re-ranker was initially measured at +17% but session 19 per-repo A/B test showed it was net negative (9/13 repos hurt). (3) Session 23 confirmed both re-ranker AND gap-fill seeds are neutral on cold start (P@10 identical with/without, 3 runs). Previous "+11% gap-fill" was task memory contamination. Embeddings disabled by default.
Retrieval Improvements
| # | Item | Why | Status |
|---|---|---|---|
| 7 | ~~Bidirectional inheritance edges~~ | Tested session 16: Django -2.5%, Flask -1.5%. Reverse inherits edges add noise without new reachability. Django's 42% zero-rate is vocabulary gaps, not connectivity gaps. | Rejected |
| 9 | ~~Density-adaptive RWR alpha~~ | Tested session 17: alpha=0.15 on dense repos (flask 5.9, cargo 13.5, kafka 12.5). P@10 0.280 vs baseline 0.278 (+0.002, within run variance). Neutral. Confirms parameter tuning doesn't move the metric. | Rejected |
| 9 | ~~Density-adaptive inherits weight~~ | Tested session 17: boosted implements/overrides/extends to 1.0 on repos with >1.5% inherits edges. Django +0.009, kafka+flask -0.008. Net neutral. | Rejected |
| 10 | Adaptive seed count by structural richness | % of type nodes with contains edges indicates how productive type seeds are. High % (>60%): fewer seeds needed (types reach methods). Low %: more seeds needed to compensate. | To test |
| 11 | Community count adaptive walk | Many small communities: community-scoped RWR is effective. Few large communities: unconstrained walk is better. Threshold currently hardcoded; should adapt to detected modularity. | Experiment |
| 12 | FTS hit rate channel balancing | Adaptive RRF weights per query based on channel result counts. Likely neutral: parameter sweep (32 configs) and entry point channel (session 19) both confirmed P@10 is reachability-determined, not ranking-determined. RRF weight changes reshuffle seed order but don't change what's reachable. | Low priority (likely neutral) |
| 13 | ~~Disconnection rate adaptive seeding~~ | Tested session 20: measured disconnection rate across 12 repos (0.2% kafka to 22.7% caddy). No repo exceeds 30%. Implemented as seed bonus: int(rate * 20) extra seeds. Only flask and spark see any change (+2 seeds, 15->17) because all other repos already exceed the bonus via node count thresholds. Results: django 0.261 (baseline 0.256), flask 0.337 (baseline 0.347), spark 0.260 (baseline 0.255). All within variance. Redundant with existing node count thresholds. Seed count doesn't move P@10 (confirmed by prior 32-config sweep). |
Neutral |
| 14 | ~~Hub node dampening (H1)~~ | Re-tested session 17 on enriched graphs (BENCH_HUB_DAMPEN=50). P@10 = 0.219 vs 0.220 baseline. Still neutral. Edge weights already handle high-degree nodes. | Rejected |
| 15 | ~~Entry point seed channel~~ | Tested session 19: route_handler/service nodes as Channel 6 in RRF (weight 1.5x, keyword-filtered, cached). Django +10% without embeddings (0.250 -> 0.275), but neutral on full corpus with embeddings (0.264 vs 0.266 baseline). Embedding re-ranker already captures what entry point seeding provides. Route handlers have phantom targets (handles_route -> external), limiting RWR reachability from entry points. | Neutral |
| 16 | More equivalence concepts | Only add when a specific task fixture exposes a gap. Must respect Run 22 constraint (no single-word phrases, no generic targets). | On-demand |
| 16b | Rust equivalence classes | Cargo at 0.216 with rust-analyzer enrichment. Zero Rust-specific equiv classes. Macro vocabulary gap: task says "serialize", ground truth is Serialize::serialize. Candidates: serde (serialize/deserialize/from_str), tokio (async/spawn/runtime), error handling (thiserror/anyhow/Result/From), derive traits (Clone/Debug/Default), web (axum/Router/handler/extract). 10-15 classes. Also: SCIP ingestion (rust-analyzer scip .) would capture proc-macro expanded code that tree-sitter misses. |
On-demand |
| 17a | ~~Gap-fill threshold tuning~~ | Tested < 3, < 8, < 10 vs baseline < 5. All within variance (+-0.005). Threshold doesn't matter: tasks with 0-4 and 0-9 candidates are largely the same set. Neutral. | Rejected |
| 17b | Graduated gap-fill weight | Binary activation (on/off at threshold 5) could be graduated: lower weight (0.5) when BM25 found 3-4 seeds, full weight only when BM25 found 0-1. Proportional intervention instead of binary. | Experiment |
| 17c | ~~Embedding re-ranker with code-tuned model~~ | Session 20: tested jina-embeddings-v2-base-code (code-tuned) as re-ranker. Django P@10 = 0.258 (vs 0.261 no re-rank, 0.256 no embeddings). Round 2: 0.253 (vs 0.267 baseline). Code-tuned model does not fix the re-ranker architecture problem. The issue is not model quality; cosine similarity cannot capture structural relevance regardless of training data. Three models tested across sessions 15-20 (nomic-text, jina-code, bge-small), all neutral or harmful as re-rankers. Re-ranker architecture is closed. | Rejected |
| 19a | Parallel benchmark P@10 variance | BENCH_PARALLEL=1 has +-0.009 P@10 variance vs sequential (0.264 stable). SHIPPED: (4) PreloadVectors: eager vector cache at init, round 1 25 min -> 5.3 min. (1) Shared ONNX Embedder: single session, less memory. Tested, didn't help: semaphore (4 concurrent repos): 0.238, worse than unbounded. Shared embedder alone: 0.255, didn't reduce variance. Root cause: non-deterministic goroutine scheduling affects RWR walk convergence, not I/O or ONNX contention. Per-task scores differ between sequential and parallel on identical inputs (cargo-easy-001: 0.40 seq vs 0.00 parallel). Remaining ideas: (5) Serialize HNSW index to file. (6) PRAGMA mmap_size. (2) Pre-compute query embeddings for benchmark-only. None address the root cause. Sequential remains official scoring mode; parallel is for iteration speed only. |
Investigated, open |
| 20 | sqlite-vec integration | Replace brute-force cosine with sqlite-vec ANN for persistent search. Current brute-force from SQLite works but scales O(n). sqlite-vec would give O(log n) queries. Pure Go option: viant/sqlite-vec. |
Infrastructure (not urgent: brute-force is fast enough for current corpus sizes) |
| 22 | More corpus repos | Every enriched repo at 0.200+ lifts the aggregate. Candidates: celery (Python, 80K LOC), Spring Boot (Java/Kotlin). Target: 16+ repos, 300 tasks. | Corpus expansion |
| 22a | Homebrew corpus repo (blocked) | 278K LOC Ruby, 8,476 nodes, density 15.2. Tree-sitter P@10 = 0.275 (no embeddings). 20 fixtures written. Blocked on Ruby LSP enrichment. Investigated extensively (session 19): (1) ruby-lsp's composed bundle uses bundle exec which fails when project has BUNDLE_DISABLE_SHARED_GEMS/BUNDLE_PATH in .bundle/config (Homebrew-specific). Even with gem in Gemfile + lockfile + vendor/bundle, bundler 4.0 can't find the executable. (2) BUNDLE_GEMFILE="" bypasses bundler but ruby-lsp produces zero semantic edges (syntax only). (3) solargraph too slow (9+ min on 23K LOC Jekyll, timeout on 278K LOC Homebrew). (4) .bundle/config rename: ruby-lsp caches composed bundle state. Root cause: ruby-lsp requires functioning bundler context for semantic resolution, Homebrew's bundler config is incompatible. Unblock path: try on a Ruby repo without custom bundler config (Discourse, Sidekiq), or wait for ruby-lsp --use-launcher flag to mature. |
Blocked |
| 23 | Fixture quality review | Manual review of 60 agent-created fixtures (caddy, ocelot, fastapi). Agent ground truth may include technically correct but practically unhelpful symbols. Tuning fixture quality is higher ROI than code changes. A wrong ground truth symbol penalizes the system unfairly. Will be partially obsoleted by AI-generated evaluation corpus (#5 in Immediate Priorities). | Quality |
| 18 | Feedback parameter sweep (warm-start) | Session boost (0.20), task memory formula (0.5+score*0.4), decay (7-day linear), top-N (5) are untuned. Only affects real-user compounding. | When users exist |
| ### Continuous Adaptation (moat, not P@10) |
The adaptive infrastructure is knowing's core differentiator. Competitors use fixed strategies. knowing observes its own graph and adjusts retrieval automatically. Seven mechanisms ship today (PreferTypeSeeds, adaptive seed count, equiv classes, gap-fill, task memory, Merkleized feedback, LSP phantom nodes).
Honest assessment (session 20): All five items below are parameter optimization in different flavors. 51 experiments across sessions 8-20 have proven that P@10 is reachability-determined: only new edges or new seed sources move the metric. Seed count sweeps (32 configs), gap threshold sweeps (15 configs), edge weight sweeps, and disconnection-rate seeding all produced zero variance. These items are valuable for product differentiation ("self-adapting") and user experience on diverse codebases, but they will not move P@10 on the benchmark.
| # | Item | What adapts | Priority |
|---|---|---|---|
| 30 | Graph topology features for seed strategy | Disconnection rate, path length, clustering coefficient shape walk strategy. Partially tested (#48: disconnection rate alone was redundant with node count). | Moat (won't move P@10) |
| 27 | Per-query confidence estimation | Estimate seed quality pre-RWR, adjust gap-fill aggressiveness. But gap threshold sweep (15 configs) was neutral. | Moat (won't move P@10) |
| 26 | Continuous density-proportional seeding | Smooth function replacing threshold steps. But seed count sweep (10-50) was zero variance. | Moat (won't move P@10) |
| 28 | Learned edge weights from ground truth | Train optimal RWR weights from 277-task corpus. But 32-config parameter sweep was zero variance. | Moat (won't move P@10) |
| 29 | Feedback-driven per-repo thresholds | System discovers own parameters from task memory. Parameters don't matter, but UX improves. | Moat (requires users) |
| 25 | ~~Co-change edges from git history~~ | Tested session 20: full redesign with proper concurrency (writeMu, atomic stats, producer-consumer). Deepened all 12 corpus clones to 200+ commits. Three configs tested: (1) min=1 cap=50: Django +0.013, k8s +0.042, but cargo -0.066. Full corpus 0.263 (-0.004). (2) min=1 cap=5: cargo -0.018. (3) min=2 cap=5: cargo -0.004, full corpus 0.267 (exactly baseline). Per-repo wins and losses cancel out. Bulk refactor commits create O(n^2) noisy pairs that dilute RWR on dense graphs; filtering the noise also filters the signal. | Neutral |
Next-generation retrieval (beyond incremental experiments)
55 experiments across sessions 8-20 exhausted the incremental path (adding edges, tuning parameters, swapping models). Session 21 broke through the 0.267 ceiling with focused seed selection (#36): cluster seeds by package path and concentrate the walk in the dominant structural neighborhood. Combined with cluster-aware gap-fill, P@10 = 0.189 (+6.0%). The insight: seed quality (structural cohesion) matters more than seed quantity. 57 experiments proved count doesn't matter, but cohesion was an untested dimension.
| # | Item | Approach | Why it might work |
|---|---|---|---|
| 31 | Query-time LLM symbol prediction | Ask an LLM to predict likely symbol names from the task description before retrieval. "In Django, a field validator would be clean, BaseValidator.__call__." Inject predictions as high-confidence seeds. |
Solves the vocabulary gap with intelligence instead of string matching. The "find" half done by reasoning, not BM25. Trade-off: adds LLM latency and cost. Could be optional (local model or API). |
| 32 | ~~Per-repo graph pruning~~ | Tested session 20 (#56). Three configs on cargo: exclude similar_to (0.245, reachability lost), exclude references (0.268, +0.009), ghost references at 0.05 weight (0.264). Full corpus ghost edges: 0.264 (-0.003). Density-adaptive ghost (threshold 5.0): same. Per-repo wins cancel losses. | Neutral |
| 33 | Two-phase retrieval (search-walk-search) | Phase 1: current BM25+RWR finds a neighborhood (~500 nodes). Phase 2: run BM25 again within that neighborhood only, re-seeding with the most relevant matches. First walk finds the structural area; second search finds specific symbols within it. No ML required, uses existing infrastructure. Most practical next step. | The core problem: 15-25 seeds dilute the walk across the graph. Two-phase narrows the search space before the final ranking. Phase 1 answers "what area of the code?" Phase 2 answers "which specific symbols?" |
| 34 | Ground truth expansion | Current 277 tasks may have incomplete ground truth. If the system finds useful symbols that aren't in the ground truth, it's penalized unfairly. Systematic review: for each zero-scoring task, examine what the system actually returns and judge relevance independently. | Free P@10 if ground truth is wrong. Session 20 confirmed fixtures are valid (symbols exist, are connected), but relevance of returned symbols was not reviewed. The system might be returning contextually useful symbols that aren't in the curated ground truth. |
| 35 | Query-conditioned walk | Weight edges differently per query during RWR, not just by edge type. "Validate request body" amplifies edges toward validators, attenuates edges toward serializers. The walk becomes query-aware. Could use query keywords to boost edges whose target node names match, or train a lightweight model to predict per-query edge relevance. | The fundamental bottleneck: RWR walks blind from seeds. It doesn't know what it's looking for. On a graph with density 13.5, each step splits probability 13 ways. Query-conditioned edges focus the walk toward the answer. |
| 36 | ~~Focused seed selection~~ | SHIPPED session 21 (#58). Cluster RRF candidates by package path, promote largest cluster. Combined with cluster-aware gap-fill (embedding seeds filtered to dominant package). Full corpus: 0.283 vs 0.267 (+0.016, +6.0%). Django: 0.275 vs 0.253 (+8.7%). First experiment to break the session 20 ceiling. | Shipped |
| 37 | Learned scoring from ground truth | Train a lightweight model (logistic regression, small NN) on the 277-task corpus. Features: BM25 rank, node degree, path distance to nearest type node, edge type distribution, package depth. Predict: is this candidate ground truth? Even a simple model could outperform hand-tuned RankSymbols formula. | We have labeled data (277 tasks with ground truth) that we only use for evaluation, never for training. Cross-validation across repos prevents overfitting. Risk: overfitting. Mitigation: leave-one-repo-out validation. |
| 38 | Per-query edge type selection | For a "middleware" query, prefer calls/implements edges. For a "configuration" query, prefer configures/imports. Map query concepts to edge type weight profiles. Hand-curated profiles (like equiv classes) or learned from the corpus. | Different task types traverse different parts of the graph. A query about "error handling" should walk along throws/catches edges; a query about "routing" should walk along handles_route edges. Current RWR uses fixed weights for all queries. |
Edge Type Expansion
38 edge types shipped. See Edge Types Reference and CHANGELOG for full details. Recent additions: accesses_field (36th, 6 languages), reads_env (37th, supply chain), executes_process (38th, supply chain).
Remaining failure analysis (sessions 13-14): - Django: 117/192 ground truth symbols unreachable. Root cause: framework base classes referenced by type hint and interface contract, not direct call. - Kubernetes: 71/116 unreachable. Root cause: interface-heavy architecture where functions accept interfaces but ground truth is concrete implementations. - Kafka: 50/93 unreachable. Root cause: consumer/producer patterns referenced via type parameters and configuration.
P2: Structural edges
| Category | Items | Status |
|---|---|---|
| Runtime | runtime_queries, runtime_connects_to |
Planned |
| Configuration | configures (config key to symbol that reads it) |
Planned |
| Agent workflow | suggested_for_task / used_by_agent |
Planned |
Observability Ingestion
Beyond OTLP traces (shipped), these observability signals map to graph edges. The pattern: any system that records "X talked to Y" at runtime becomes a runtime_* edge. Static analysis says what CAN happen. Runtime signals say what DID happen. The diff is where findings live.
| Signal Source | Edge Types | What It Enables | Priority |
|---|---|---|---|
| Database query logs (pg_stat_statements, slow query log) | queries_table, writes_table, reads_table |
"Change this table schema, what code breaks?" | P2 |
| HTTP access logs (nginx, ALB, API gateway) | runtime_serves, frequency metadata |
Dead route detection without full APM | P2 |
| Message queue metrics (Kafka consumer lag, SQS depth) | runtime_consumes, runtime_produces |
Verify static pub/sub edges against reality | P2 |
| Error tracking (Sentry, Bugsnag) | runtime_throws, error frequency |
Prioritize blast radius by error-prone paths | P3 |
| Container orchestration (K8s events) | runs_on, colocated_with |
Infrastructure topology in the graph | P4 |
| Service mesh (Envoy, Istio, Consul) | runtime_connects_to |
Compare declared vs actual service topology | P4 |
| Continuous profiling (pprof) | hot_path, duration metadata |
Weight blast radius by performance impact | P4 |
Key insight: Static edge with no runtime observation = dead code candidate. Runtime observation with no static edge = undocumented dependency. Both agree = high-confidence relationship.
Underexploited Capabilities
| Item | Next step |
|---|---|
| Edge event log | Temporal queries: "when did this dependency appear?" |
| Leiden algorithm | Add via community registry when a Go implementation exists |
Phase 4: Remaining Items
| Feature | Status |
|---|---|
| Federated sync (exchange roots, transfer only differing branches) | Planned |
| Merkle-based bisection (binary search on snapshot chain) | Planned |
| Lazy materialization (load only visited subtrees; triggered at ~1M+ edges) | Planned |
Cross-Repo Validation
Tier 1: Synthetic Multi-Repo Fixture (built)
3 Go modules at test/cross-repo/. Cross-repo edge resolution verified. Remaining dogfooding tests:
knowing proveacross reposknowing prove-absentacross reposknowing auditacross reposknowing exportto knowing-viz with cross-repo edgesblast_radiuson module-a function showing callers in B and C- Incremental invalidation across repos
Tier 1.5: Java Monolith + Frontend (cross-language validation)
Target: Spring PetClinic (Java REST API) + React/Vue frontend consuming it.
What it validates:
- Cross-language HTTP edges: TypeScript fetch() → Java @GetMapping resolution
- Java extractor correctness: Spring Boot annotations, layered architecture (Controller → Service → Repository)
- API contract detection: Which frontend components consume which backend endpoints
- Runtime vs static comparison: Spin up service, generate OTLP traces, compare observed vs extracted edges
- Full-stack test scope: Change Java service → knowing surfaces which frontend tests to run
- Dead endpoint detection: REST endpoints defined but never called (static or runtime evidence)
- Breaking change prevention: "You're removing /api/users but 5 frontend components call it"
Why useful: - Knowing is heavily validated on Go (dogfooding itself), less on Java/TypeScript - REST API consumption edges aren't validated cross-language yet - Enables full-stack test selection (backend change → frontend tests) - Realistic monolith structure (50K LOC, deep call hierarchies, framework-heavy)
Effort: Low (4-8 hours to setup, index, validate)
Priority: After session memory persistence (Priority #2). Useful once we have real users requesting Java/cross-language support.
Tier 2: Grafana Ecosystem (scale validation)
Grafana + Loki + Tempo + Mimir (~1.3M LOC, 4 repos). Validates cross-repo at realistic scale. Run manually, not in CI.
Production Scale: Permanent Runtime Record
The endgame: knowing with continuous OTLP trace ingestion alongside static analysis. After a year:
- Static edges: ~150K (stable)
- Runtime edges: millions (every observed call path)
- Snapshot chain: 365+ daily snapshots
Git-Inspired Optimizations
Derived from a deep dive into git's C implementation (pack-objects, commit-graph, refs, bitmaps, merge-ort, shallow clones).
Medium (1-3 days):
| Capability | Git Pattern | Why |
|---|---|---|
| Filter-based graph materialization | list-objects-filter.c | Push predicates into SQL queries; context retrieval skips irrelevant subgraphs (2-5x speedup) |
| Persistent named snapshot refs | refs/packed-backend.c | knowing tag stable, knowing diff stable..latest; stored in snapshot_refs table |
| Bloom filters for package changes | commit-graph bloom filters | Per-snapshot bloom filter of changed packages; eliminates edge_events scan during diff |
| Snapshot-graph acceleration file | commit-graph binary format | Binary file with fanout+hashes+metadata avoids N SQL queries for chain walking |
| String interning for package paths | merge-ort strmap | Pointer equality for hot-path comparisons; reduce allocation pressure |
Architectural (3-5 days):
| Capability | Git Pattern | Why |
|---|---|---|
| EWAH edge-reachability bitmaps | pack-bitmap.c | One bit per edge per snapshot; Diff = XOR + popcount instead of O(E) scan; blast_radius via precomputed reachability |
| XOR-compressed bitmap chains | stored_bitmap.xor | Store consecutive snapshot bitmaps as XOR deltas; 100 snapshots in <10KB vs 125KB |
| Delta-compressed snapshot packs | diff-delta.c, Rabin fingerprint | Sliding-window delta over edge groups; 40-60% smaller sync payloads |
| Promisor nodes (lazy cross-repo) | shallow.c promisor semantics | Record cross-repo edge targets as "promisor" nodes; fetch full data on-demand from source DB |
| Three-way graph merge | merge-ort.c staged computation | Federated sync with conflict awareness: confidence_conflict, provenance_conflict, type_conflict |
What's Needed at Scale
| Capability | Why |
|---|---|
| Lazy materialization | Load only visited subtrees at millions of edges |
| Merkle bisection | O(log N) snapshot search instead of O(N) |
| Parallel tree hashing | Concurrent bottom-up hash computation for 1M+ edge trees. Current computeMerkleRoot is single-threaded; goroutine pool pattern for leaf-level parallelism. |
| Partitioned storage | Static and runtime edges have different lifecycles |
| Runtime edge compaction | Collapse observation history |
| Federated sync | CI instance + production instance exchange diffs |
| Drift alerts | Static analysis vs production traffic divergence |
| Dashboard | Real-time runtime graph visualization |
| Automated compliance reports | Scheduled knowing audit with diff against prior |
Commercial Angle
| Offering | Revenue model |
|---|---|
| knowing Cloud | Managed hosting, per-service pricing |
| Compliance reporting | Automated quarterly audit reports with proofs |
| Federated sync service | Org-wide intelligence sharing |
| Drift detection | Alerts on static/runtime divergence |
| Enterprise dashboard | Cross-repo visualization, team analytics |
Git Design Audit (open items from docs/architecture/git-design-audit.md)
All CRITICAL and HIGH items shipped (session 12). Remaining are LOW priority.
| # | Item | Priority | Effort | Verdict |
|---|---|---|---|---|
| 9.2 | MaxOpenConns(1) on SQLite |
Do now | 1 line | Free perf. Single writer, no reason for connection pool. |
| 5.2 | Incremental snapshot computation | Do eventually | 3h | Real speedup on large repos. Compute snapshot from changed files only. |
| 7.1 | Named snapshot refs (snapshot_refs table) |
Do eventually | 4h | Needed for knowing tag v1.0 and diff-mode supply chain product. |
| 7.2 | Reflog table | Only if 7.1 ships | 2h | Audit trail for ref mutations. Pointless without named refs. |
| 5.1 | ReconstructEdgeSet from event log | Skip | 1 week | Over-engineering. SQLite has the full edge table. Nobody replays events. |
| 2.3 | Edge observation column split | Skip | 1 day | Premature optimization. No repo has hit row-size bottleneck. |
| 10.1 | Merkle-diff sync protocol | Not yet | 2 weeks | Zero users need multi-machine sync. Build when someone asks. |
| 10.2 | knowing export / knowing import |
Maybe | 1 week | Useful for platform API. But cp knowing.db works today. |
Git-Inspired (Not Yet Built)
| Item | Priority | Effort |
|---|---|---|
| Proposed graph overlay (staging area) | P2 | Medium |
| Delta-compressed snapshots | P3 | High |
| N-way hierarchical diff | P3 | Medium |
| Rerere (enrichment conflict resolution) | P4 | Low |
| Transfer protocol (federated sync) | P4 | High |
| Replace/grafts (edge correction) | P4 | Medium |