Architecture Documentation
The full architecture specification for knowing, split into focused subdocuments. Each covers one area of the system.
New to knowing? Start with the Introduction: builds understanding from zero with no assumed background. Covers the problem, content-addressing, hierarchical Merkle trees, proofs, and learning with worked examples.
Reading Order
If you are new to knowing, read these in order:
- Concepts (foundational vocabulary)
- System Overview (component map, two-tier extraction, edge types)
- Extraction Pipeline (tree-sitter, 23 extractors, post-processing)
- Enrichment Pipeline (LSP enrichment, phantom nodes, gopls warmup)
- Retrieval Pipeline (seeds, RWR, HITS, scoring, gap-fill seeds)
- Data Flow (end-to-end trace of a single commit)
- Edge Types (38 edge types with RWR weights)
- Embedding Re-ranker (disabled, confirmed neutral, vector cache architecture)
- Context Engine (ForTask/ForFiles/ForPR entry points, scoring formula)
- Wire Formats (GCF, binary, JSON codec system)
- Design Principles (goals, architectural planes, MCP tool split)
- Deep Dives (15 foundational architecture decisions)
- Merkle Tree Algorithms (hierarchical roots, subgraph caching)
- Git Design Audit (gap analysis vs. git's reference implementation)
Documents
| Document | What it covers |
|---|---|
| concepts.md | Content-addressed storage, Merkle DAG, domain primitives (Node, Edge, Hash, Snapshot, Provenance), event sourcing, staleness, artifact boundary |
| system-overview.md | Component map, language-agnostic graph model, two-tier extraction (tree-sitter + LSP), 26 extractor packages, parallel indexer (8 workers, 1.8s), multi-language auto-detection, edge type taxonomy |
| data-flow.md | End-to-end trace of a single commit: git detection, indexing, snapshot computation (hierarchical Merkle), LSP enrichment |
| concurrency.md | Daemon goroutine architecture, RWMutex coordination, channel buffer sizes, SQLite WAL mode |
| runtime-traces.md | OTLP trace ingestion, span-to-edge mapping, confidence scoring, production observability edges |
| context-engine.md | Retrieval pipeline: 5-channel seed fusion, RWR, HITS, BM25, knapsack packing, 115 equivalence classes, concept thesaurus |
| wire-formats.md | GCF (84% token savings), binary, JSON codec, format comprehension eval |
| cli-commands.md | All CLI commands: index, export (with --algorithm flag), watch, why, enrich (blame, coverage), mcp, init, fsck |
| data-model.md | SQLite schema, 17 migrations, identity vs metadata layers, cross-repo edges, Merkle tree storage, per-repo isolation, GraphStore interface, why SQLite. |
| design-principles.md | Nine design goals, three architectural planes, MCP tool split, artifact boundary |
| deep-dives.md | 15 foundational architecture decisions with rationale and retrofit cost |
| merkle-algorithms.md | 13 Merkle tree algorithms: hierarchical roots, subgraph caching, incremental recompute, context packs, proofs, federated sync, semantic change classification, bisection. Phase 1+2+3 shipped. |
| merkle-proofs.md | Merkle proof format, generation/verification, CLI (knowing prove/knowing verify), performance (72us generate, 1.2us verify), use cases (audit, CI gates, federated trust). |
| adr-hierarchical-merkle.md | Architecture decision record: why the hierarchical Merkle tree changes knowing's identity from integrity mechanism to performance architecture. |
| git-design-audit.md | Systematic audit of knowing's content-addressed design against git's reference implementation: 10 areas, 23 recommendations, severity-ranked. |
| cross-repo.md | Per-repo isolation model, content-addressed cross-repo identity, roster infrastructure, module mapping, phantom external nodes, limitations, and architectural proofs from the cross-repo fixture test. |
| semantic-pr-diff.md | Relationship-level PR diff: design, output format, implementation (internal/diff/), MCP tools (snapshot_diff, semantic_diff, pr_impact), CLI (knowing audit-diff), GitHub Actions workflow. |
| extraction-pipeline.md | Tree-sitter extraction: 23 extractors, multi-dispatch, post-processing (9 steps), producer-consumer pipeline, content-addressed hashing, incremental indexing, CLI usage. |
| enrichment-pipeline.md | LSP enrichment: three phases (readiness, upgrade, discovery), phantom nodes, two-phase gopls warmup, multi-module Go support, per-symbol timeout, performance characteristics. |
| retrieval-pipeline.md | Full retrieval reference: keyword extraction, 5-channel RRF seed fusion, RWR (parameters, edge weights, adjacency cache), HITS, scoring formula, gap-fill seeds, budget packing, session/task memory. |
| embedding-reranker.md | Embedding architecture: both re-ranker and gap-fill confirmed neutral on cold start (session 23). nomic-embed-text model, pure Go ONNX, SQLite vector cache. Infrastructure preserved, disabled by default. |
| edge-types.md | Full catalog of 38 edge types with RWR weights, categories, and provenance. |
| equivalence-classes.md | Equivalence class system: 263 concepts across 4 layers (seed, universal, language-specific, framework). |
| context-packing.md | Context packing: density-ranked greedy knapsack, token estimation, persistent pack cache, staleness detection. |
| hooks-integration.md | Git hooks integration: post-commit, post-checkout, pre-push hooks for daemon change detection. |
| wire-formats-guide.md | Practical guide to wire format usage and integration. |