ADR: Hierarchical Merkle Tree
Date: 2026-05-18 (updated 2026-05-19)
Status: Shipped (Phases 1-3, Phase 4 partially shipped). Implementation extracted to merkle-strata library (v0.4.0).
Impact: Foundational. Changes the role of the Merkle tree from integrity mechanism to performance architecture.
Context
knowing's identity since day one has been content-addressed: every node, edge, and snapshot is a SHA-256 hash. Snapshots are Merkle roots over sorted edge hashes. This gave us integrity, deterministic identity, and efficient equality checks.
But the original Merkle tree was flat: sort all edge hashes, build a binary tree, compare roots. Diffing two snapshots required O(edges) work (building hash sets and scanning for differences). Caching query results was keyed to the global root, so any change anywhere invalidated everything.
The flat model scaled linearly with graph size and couldn't answer the question every downstream system actually asks: "did the part I care about change?"
Decision
Replace the flat Merkle tree with a hierarchical structure organized by semantic boundaries:
repo root
package roots (one per Go/TS/Python package)
edge-type roots (calls, imports, implements, references, throws)
edge leaf hashes
The repo root is still a single hash (backward compatible). But now intermediate roots exist at every level, and each root can be compared independently.
Consequences
The Merkle tree becomes the query engine, not just an integrity check.
Before: "Did anything change?" (compare one root, then scan all edges to find what) After: "Did package X change?" (compare one package root). "Did call edges change?" (compare one edge-type root). "Is my cached blast_radius still valid?" (compare the subgraph root for the relevant packages).
Performance: - Diff: O(packages) with semantically meaningful output (package names, edge types). Scales with edge count: 249x at 10K, 516x at 50K, 565x at 100K edges (vs flat linear scan). Validated on Grafana (249K edges, 3552 packages): diff remains microseconds. - Subgraph root lookup: 65ns (knowing), 1.5us (Grafana scale, 10 packages). - Build cost: 1.4-1.7x slower than flat construction (additional intermediate root computation). Amortized over every subsequent diff and cache lookup.
What this unlocked (shipped):
- Content-addressed context packs: ContextBlock.PackRoot = hash(task_normalized + sorted(selected_node_hashes)). Same task + same graph = same PackRoot. Enables cache lookup, dedup (93-99% byte savings), cross-session replay.
- Community Merkle roots: each Louvain community carries a root over its packages. Disjoint roots = disjoint work.
- Subgraph caching (93x speedup): context_for_task, blast_radius, test_scope keyed to package subgraph roots. Unchanged code = cached result.
- Daemon invalidation: file save changes one package; only that package's caches invalidated.
- Semantic change classification: behavioral, structural, runtime_drift, metadata_only.
- Merkle proofs (shipped): knowing prove (72us), knowing verify (1.2us), knowing prove-absent. Offline verifiable.
- Merkleized feedback validity (v0.5.0): feedback records store SubgraphRoot; expires automatically when code changes.
- Generation numbers on snapshots: O(1) ancestry checks without chain walking.
- Incremental RWR cache (session 26): per-package Merkle roots in RWR cache keys. Unchanged packages keep cached walks. Django cold 3.9s -> warm 1.9s (2x). Package roots persisted to notes table during indexing.
- Vocab association expiration (session 26): per-package roots anchored to learned vocab associations. When a package changes, only that package's associations expire. Same PackageRoots infrastructure.
Remaining (planned): - Federated sync: exchange roots, descend only differing branches. - EWAH bitmaps for reachability (from git deep dive). - Bloom filters for per-snapshot package changes. - Delta context packing: Merkle diff of two context packs (80-90% token savings on subsequent queries).
What changed about knowing's identity:
Content-addressing was always the foundation. But it was primarily an integrity and identity mechanism: "prove this graph state is what you think it is." With the hierarchical tree, content-addressing becomes a computation architecture: "the structure of the identity itself determines what's cheap to compute." The hash tree doesn't just prove state; it organizes every downstream operation.
No competitor uses hierarchical Merkle trees over code relationship graphs. Most code intelligence tools don't even have content-addressed snapshots. The ones that do (if any) use flat hashes. The hierarchical structure is a moat because it requires the content-addressing to be architectural from the start; it can't be bolted on.
Implementation
internal/snapshot/hierarchical.go:HierarchicalTree,BuildHierarchicalTree(delegates togithub.com/blackwell-systems/merkle-stratav0.4.0 viastrata.BuildMultiLevelwithWithPrefix([]byte("merkle\x00"))),DiffHierarchicalTrees,DiffHierarchicalTreesWithOptions(withDiffOptions:PackageFilter,MaxChanges),SubgraphRoot,EdgeTypeRoot,ContextPackRootinternal/snapshot/merkle.go:BuildMerkleTreedelegates tostrata.Build;combineHashesretained for proof.go compatibilityinternal/snapshot/proof.go:binaryProofTreegenerates self-paired proof steps for single-element top-level roots (matchingcomputeTreeRootdomain separation in merkle-strata v0.4.0)internal/snapshot/manager.go:ComputeSnapshotbuilds the hierarchical tree (flat tree was dropped);extractPackagePathnow returns an error on malformed names; setsSnapshot.Generationfor O(1) ancestry;persistPackageRootsstores htree.PackageRoots to notes table;LoadPackageRootsreads them back;PackageRootForSymbolextracts package from QNinternal/snapshot/gc.go:GarbageCollectFullwith reachability sweep andGCStatsreturn typeinternal/snapshot/verify.go: integrity verification functions used byknowing fsckinternal/store/sqlite.go: in-process LRU cache (50K entries) onGetNode/GetEdge;IntegrityCheckmethod forPRAGMA integrity_checkinternal/store/gc.go:DeleteNodesNotIn,DeleteEdgesNotInimplementationsinternal/daemon/lockfile.go: daemon lockfile to prevent multiple instancesinternal/types/types.go: hash domain prefixes (node\0,edge\0,snapshot\0,merkle\0)internal/types/verify.go:VerifyNodeHash,VerifyEdgeHashinternal/community/:Algorithminterface, registry, Louvain and label propagation implementationscmd/knowing/fsck.go:knowing fsckCLI commandbench/merkle-diff/: benchmark harness with auto-generatedFINDINGS.mdandFINDINGS-context-packs.mddocs/architecture/merkle-algorithms.md: full specification of 13 algorithms across 4 phases
Alternatives Considered
-
Keep flat tree, add a separate package-level index. Would require maintaining two data structures that can drift. The hierarchical tree is one structure that serves both purposes.
-
Use database indexes instead of Merkle trees for caching. Database queries are O(log N) per lookup; Merkle root comparison is O(1). At scale, the constant factor matters enormously for hot-path operations like "is this cached?"
-
Skip the tree entirely, use git commit hashes as cache keys. Loses semantic granularity. A commit that touches one file invalidates everything. The Merkle tree knows which packages were affected.