Architecture Documentation

The full architecture specification for knowing, split into focused subdocuments. Each covers one area of the system.

New to knowing? Start with the Introduction: builds understanding from zero with no assumed background. Covers the problem, content-addressing, hierarchical Merkle trees, proofs, and learning with worked examples.

Reading Order

If you are new to knowing, read these in order:

Concepts (foundational vocabulary)
System Overview (component map, two-tier extraction, edge types)
Extraction Pipeline (tree-sitter, 23 extractors, post-processing)
Enrichment Pipeline (LSP enrichment, phantom nodes, gopls warmup)
Retrieval Pipeline (seeds, RWR, HITS, scoring, gap-fill seeds)
Data Flow (end-to-end trace of a single commit)
Edge Types (38 edge types with RWR weights)
Embedding Re-ranker (disabled, confirmed neutral, vector cache architecture)
Context Engine (ForTask/ForFiles/ForPR entry points, scoring formula)
Wire Formats (GCF, binary, JSON codec system)
Design Principles (goals, architectural planes, MCP tool split)
Deep Dives (15 foundational architecture decisions)
Merkle Tree Algorithms (hierarchical roots, subgraph caching)
Git Design Audit (gap analysis vs. git's reference implementation)

Documents

Document	What it covers
concepts.md	Content-addressed storage, Merkle DAG, domain primitives (Node, Edge, Hash, Snapshot, Provenance), event sourcing, staleness, artifact boundary
system-overview.md	Component map, language-agnostic graph model, two-tier extraction (tree-sitter + LSP), 26 extractor packages, parallel indexer (8 workers, 1.8s), multi-language auto-detection, edge type taxonomy
data-flow.md	End-to-end trace of a single commit: git detection, indexing, snapshot computation (hierarchical Merkle), LSP enrichment
concurrency.md	Daemon goroutine architecture, RWMutex coordination, channel buffer sizes, SQLite WAL mode
runtime-traces.md	OTLP trace ingestion, span-to-edge mapping, confidence scoring, production observability edges
context-engine.md	Retrieval pipeline: 5-channel seed fusion, RWR, HITS, BM25, knapsack packing, 115 equivalence classes, concept thesaurus
wire-formats.md	GCF (84% token savings), binary, JSON codec, format comprehension eval
cli-commands.md	All CLI commands: index, export (with --algorithm flag), watch, why, enrich (blame, coverage), mcp, init, fsck
data-model.md	SQLite schema, 17 migrations, identity vs metadata layers, cross-repo edges, Merkle tree storage, per-repo isolation, GraphStore interface, why SQLite.
design-principles.md	Nine design goals, three architectural planes, MCP tool split, artifact boundary
deep-dives.md	15 foundational architecture decisions with rationale and retrofit cost
merkle-algorithms.md	13 Merkle tree algorithms: hierarchical roots, subgraph caching, incremental recompute, context packs, proofs, federated sync, semantic change classification, bisection. Phase 1+2+3 shipped.
merkle-proofs.md	Merkle proof format, generation/verification, CLI (`knowing prove`/`knowing verify`), performance (72us generate, 1.2us verify), use cases (audit, CI gates, federated trust).
adr-hierarchical-merkle.md	Architecture decision record: why the hierarchical Merkle tree changes knowing's identity from integrity mechanism to performance architecture.
git-design-audit.md	Systematic audit of knowing's content-addressed design against git's reference implementation: 10 areas, 23 recommendations, severity-ranked.
cross-repo.md	Per-repo isolation model, content-addressed cross-repo identity, roster infrastructure, module mapping, phantom external nodes, limitations, and architectural proofs from the cross-repo fixture test.
semantic-pr-diff.md	Relationship-level PR diff: design, output format, implementation (`internal/diff/`), MCP tools (`snapshot_diff`, `semantic_diff`, `pr_impact`), CLI (`knowing audit-diff`), GitHub Actions workflow.
extraction-pipeline.md	Tree-sitter extraction: 23 extractors, multi-dispatch, post-processing (9 steps), producer-consumer pipeline, content-addressed hashing, incremental indexing, CLI usage.
enrichment-pipeline.md	LSP enrichment: three phases (readiness, upgrade, discovery), phantom nodes, two-phase gopls warmup, multi-module Go support, per-symbol timeout, performance characteristics.
retrieval-pipeline.md	Full retrieval reference: keyword extraction, 5-channel RRF seed fusion, RWR (parameters, edge weights, adjacency cache), HITS, scoring formula, gap-fill seeds, budget packing, session/task memory.
embedding-reranker.md	Embedding architecture: both re-ranker and gap-fill confirmed neutral on cold start (session 23). nomic-embed-text model, pure Go ONNX, SQLite vector cache. Infrastructure preserved, disabled by default.
edge-types.md	Full catalog of 38 edge types with RWR weights, categories, and provenance.
equivalence-classes.md	Equivalence class system: 263 concepts across 4 layers (seed, universal, language-specific, framework).
context-packing.md	Context packing: density-ranked greedy knapsack, token estimation, persistent pack cache, staleness detection.
hooks-integration.md	Git hooks integration: post-commit, post-checkout, pre-push hooks for daemon change detection.
wire-formats-guide.md	Practical guide to wire format usage and integration.