Data Model

knowing uses SQLite in WAL mode as its sole storage backend. Every table serves one of two roles: identity layer (content-addressed entities whose hashes form the Merkle tree) or metadata layer (derived state that never affects Merkle computation).

Storage Architecture

SQLite database (one per repo, at ~/.knowing/repos/<safe-name>.db)
├── Identity layer (affects Merkle tree)
│   ├── repos          canonical repo identity
│   ├── files          content-addressed file records
│   ├── nodes          content-addressed symbol declarations
│   ├── edges          content-addressed relationships
│   ├── edge_events    append-only mutation log
│   └── snapshots      Merkle root chain
│
├── Metadata layer (never affects Merkle tree)
│   ├── graph_notes    general-purpose key/value annotations
│   ├── feedback       symbol usefulness signals
│   ├── task_memory    passive retrieval learning (disabled)
│   ├── vocab_associations  learned keyword -> symbol mappings
│   ├── route_symbols  runtime route-to-symbol mappings
│   └── schema_version migration tracking
│
└── Search layer
    ├── nodes_fts          FTS5 full-text index over nodes
    └── nodes_fts_content  backing content table for FTS

The distinction matters: modifying a note, recording feedback, or updating task memory never changes any hash and never invalidates any Merkle proof or cache key. The identity layer is the audit surface; the metadata layer is the learning surface.

Tables

repos

Tracked repositories. One row per indexed repo.

CREATE TABLE repos (
    repo_hash    BLOB PRIMARY KEY,   -- SHA-256(canonical repo URL)
    repo_url     TEXT NOT NULL,       -- canonical URL or local path
    last_commit  TEXT,                -- git commit hash from most recent index
    last_indexed INTEGER              -- unix timestamp
);

files

Content-addressed source files. Hash includes repo, path, and content.

CREATE TABLE files (
    file_hash    BLOB PRIMARY KEY,   -- SHA-256(repo_hash || path || content_hash)
    repo_hash    BLOB NOT NULL REFERENCES repos(repo_hash),
    path         TEXT NOT NULL,       -- relative to repo root
    content_hash BLOB NOT NULL        -- SHA-256(raw file bytes)
);

nodes

Content-addressed symbol declarations. Identity depends on (repo, package, name, kind), not physical location.

CREATE TABLE nodes (
    node_hash      BLOB PRIMARY KEY, -- SHA-256("node\0" || repo || package || name || kind)
    file_hash      BLOB NOT NULL REFERENCES files(file_hash),
    qualified_name TEXT NOT NULL,     -- "repoURL://package/path.SymbolName"
    kind           TEXT NOT NULL,     -- function, type, method, interface, const, var, service, route, external, file, package
    line           INTEGER,
    signature      TEXT,              -- function/method signature
    doc            TEXT,              -- extracted doc comment
    last_author    TEXT,              -- git blame: last author
    last_commit_at INTEGER,           -- git blame: last commit timestamp
    coverage_pct   REAL,              -- test coverage percentage
    indexed_at     INTEGER DEFAULT 0  -- unix timestamp of last index run
);

Moving a function between files does not change its hash (identity is logical, not physical). Renaming it creates a new node (old node's edges become stale, detectable via snapshot diff).

Node kinds: function, type, method, interface, const, var are the standard source symbol kinds. service is used for microservice declarations (Protobuf services, Docker Compose services). route is used for HTTP/API route declarations (OpenAPI, Serverless, CloudFormation). external with file_hash=EmptyHash are phantom nodes representing stdlib or external symbols (created by the Go tree-sitter extractor for inferred stdlib targets and by the LSP enricher for remaining dangling targets). file and package are structural nodes emitted by the Go tree-sitter extractor for file-level and package-level declarations. Phantom nodes make the graph complete: every edge has both a source and a target, and knowing fsck reports zero dangling errors on a correctly indexed repo.

edges

Content-addressed relationships. Identity includes provenance, so the same structural relationship observed by different methods produces distinct edges.

CREATE TABLE edges (
    edge_hash         BLOB PRIMARY KEY, -- SHA-256("edge\0" || source || target || type || provenance)
    source_hash       BLOB NOT NULL REFERENCES nodes(node_hash),
    target_hash       BLOB NOT NULL REFERENCES nodes(node_hash),
    edge_type         TEXT NOT NULL,    -- calls, imports, implements, references, etc.
    confidence        REAL NOT NULL DEFAULT 1.0,
    provenance        TEXT NOT NULL DEFAULT 'ast_resolved',
    callsite_line     INTEGER,          -- source location of the call/reference
    callsite_col      INTEGER,
    callsite_file     TEXT,
    observation_count INTEGER NOT NULL DEFAULT 0,  -- total observations in current window (0 for static edges)
    last_observed     INTEGER NOT NULL DEFAULT 0,  -- unix timestamp of last observation (0 for static edges)
    indexed_at        INTEGER DEFAULT 0
);

Edge types (38 total): calls, imports, implements, references, handles_route, depends_on, deploys, exposes, configures, publishes, subscribes, connects_to, throws, extends, overrides, decorates, owned_by, tests, authored_by, documents, consumes_endpoint, implements_rpc, consumes_rpc, gated_by_flag, deployed_by, tested_by, runtime_calls, runtime_rpc, runtime_produces, runtime_consumes, contains, member_of, similar_to, co_tested_with, type_hint_of, accesses_field, reads_env, executes_process.

Provenance tiers (ordered by confidence): ast_resolved (1.0), scip_resolved (0.95), lsp_resolved (0.9), runtime_observed (0.8), ast_inferred (0.7), otel_trace (0.2-0.95 based on observation count).

edge_events

Append-only log of edge mutations. Powers snapshot diff and temporal queries.

CREATE TABLE edge_events (
    event_id      INTEGER PRIMARY KEY AUTOINCREMENT,
    edge_hash     BLOB NOT NULL,
    event_type    TEXT NOT NULL,     -- "added" or "removed"
    snapshot_hash BLOB NOT NULL,
    source_commit TEXT NOT NULL,
    indexer_ver   TEXT NOT NULL,
    timestamp     INTEGER NOT NULL,
    source_hash   BLOB,             -- full edge data for removed-edge diffs (migration 013)
    target_hash   BLOB,
    edge_type     TEXT,
    confidence    REAL,
    provenance    TEXT
);

The source_hash through provenance columns (migration 013) store full edge data so that removed-edge diffs work without joining back to the edges table (removed edges are deleted from edges). NULL for pre-migration events.

snapshots

Point-in-time graph state. Each snapshot is a hierarchical Merkle root computed from all edges, tied to a git commit. Snapshots form a singly-linked chain via parent_hash.

CREATE TABLE snapshots (
    snapshot_hash BLOB PRIMARY KEY, -- ComputeSnapshotHash(hierarchical_merkle_root)
    parent_hash   BLOB REFERENCES snapshots(snapshot_hash),
    repo_hash     BLOB NOT NULL REFERENCES repos(repo_hash),
    commit_hash   TEXT NOT NULL,
    timestamp     INTEGER NOT NULL,
    node_count    INTEGER NOT NULL,
    edge_count    INTEGER NOT NULL,
    generation    INTEGER NOT NULL DEFAULT 0  -- chain depth: parent.Generation + 1 (migration 015)
);

Generation numbers enable O(1) ancestry checks: "is snapshot A an ancestor of B?" reduces to A.Generation < B.Generation when A is on B's chain. This prunes chain walks during diff and GC operations.

graph_notes

General-purpose metadata layer (Phase 3 F1). Attaches key/value pairs to any content-addressed object without affecting Merkle computation. Composite primary key: one value per key per object.

CREATE TABLE graph_notes (
    object_hash BLOB    NOT NULL,   -- any hash: node, edge, snapshot, community, pack root
    key         TEXT    NOT NULL,   -- "community_id", "context_pack", "quality_score", etc.
    value       TEXT    NOT NULL,   -- opaque to the store; callers may use JSON
    updated_at  INTEGER NOT NULL,
    PRIMARY KEY (object_hash, key)
);

Current uses: - community_id: persisted community assignments for incremental detection - context_pack: persisted context blocks for cross-session replay - rwr_cache: cached RWR walk results keyed by seed+weight+Merkle root hash (incremental RWR, session 26) - package_roots: per-package Merkle roots persisted during indexing for vocab/feedback expiration (session 26) - quality_score: node quality annotations for ranking calibration

BatchPutNotes wraps multiple inserts in a single prepared-statement transaction (21x faster than individual PutNote calls). SaveChangedAssignments writes only the delta (5.0x e2e speedup).

feedback

Symbol usefulness signals from agent sessions. As of migration 014, feedback records store the neighborhood_root (SubgraphRoot of the symbol's package at feedback time) to enable merkleized expiration: feedback becomes invalid when the symbol's package changes (detected via SubgraphRoot mismatch).

CREATE TABLE feedback (
    id               INTEGER PRIMARY KEY AUTOINCREMENT,
    symbol_hash      BLOB NOT NULL,
    session_id       TEXT NOT NULL,
    useful           INTEGER NOT NULL,  -- 1 = relevant, 0 = noise
    timestamp        INTEGER NOT NULL,
    neighborhood_root BLOB,             -- SubgraphRoot of symbol's package (migration 014)
    keyword_cluster   BLOB              -- keyword cluster hash for scoped feedback (migration 020)
);
CREATE INDEX idx_feedback_neighborhood ON feedback(neighborhood_root);
CREATE INDEX idx_feedback_cluster ON feedback(keyword_cluster);

The keyword_cluster column (migration 020) scopes feedback to keyword clusters, preventing cross-task interference. The cluster is derived from sorted primary keywords of the task. Noise demotion for "checkout" queries doesn't affect "order" queries.

vocab_associations

Learned keyword -> symbol associations from agent usage (migration 021). When an agent uses a symbol after a context_for_task query, the association is recorded. After 2+ observations, the association becomes a learned equivalence class with soft RRF injection (confidence-weighted, not forced).

CREATE TABLE vocab_associations (
    keyword        TEXT NOT NULL,
    symbol_name    TEXT NOT NULL,
    symbol_hash    BLOB NOT NULL,
    count          INTEGER DEFAULT 1,
    last_seen      INTEGER NOT NULL,
    subgraph_root  BLOB,              -- per-package Merkle root at recording time (migration 022)
    UNIQUE(keyword, symbol_hash)
);
CREATE INDEX idx_vocab_keyword ON vocab_associations(keyword);

The subgraph_root column (migration 022) ties each association to the symbol's package state at recording time. When querying, associations where subgraph_root doesn't match the current package Merkle root are filtered out. This provides per-package expiration: when package A changes, only associations for symbols in package A expire.

task_memory (disabled)

Historical passive retrieval learning. Confirmed neutral in session 24 (task memory records keywords -> symbols that the pipeline already finds). Creation and recording disabled in the MCP server. Table preserved for backward compatibility.

CREATE TABLE task_memory (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    keywords    TEXT NOT NULL,
    symbol_hash BLOB NOT NULL,
    score       REAL NOT NULL,
    timestamp   INTEGER NOT NULL
);

nodes_fts

FTS5 full-text index for BM25 search over symbol names, docstrings, and signatures. BM25 weights: symbol_name=10x, concepts=5x, file_path=4x, qualified_name=3x, doc=3x, signature=1x.

CREATE VIRTUAL TABLE nodes_fts USING fts5(
    symbol_name, concepts, qualified_name, signature, file_path, doc,
    content='nodes_fts_content',
    content_rowid='rowid',
    tokenize="unicode61 tokenchars '_' remove_diacritics 2"
);

The symbol_name column (migration 016) stores the terminal identifier extracted by extractSymbolName, which strips repo URL, package path, and file extension prefix. The concepts column (migration 017) stores CamelCase-split tokens from file names and parent directories (e.g., "commandLineParser.ts" becomes "command Line Parser commandLineParser"), bridging the vocabulary gap between developer terminology and symbol names. RebuildFTSForPackages scopes rebuild to changed packages (2.9x faster than full rebuild).

The FTS index is backed by a separate content table (nodes_fts_content) that maps rowids to node hashes for result lookup:

CREATE TABLE nodes_fts_content (
    rowid          INTEGER PRIMARY KEY AUTOINCREMENT,
    node_hash      BLOB NOT NULL,
    symbol_name    TEXT NOT NULL DEFAULT '',
    concepts       TEXT NOT NULL DEFAULT '',
    qualified_name TEXT NOT NULL,
    signature      TEXT NOT NULL DEFAULT '',
    file_path      TEXT NOT NULL DEFAULT ''
);

route_symbols

Runtime route-to-symbol mappings. Maps HTTP routes, RPC methods, and message queue topics to graph nodes, enabling the trace ingestor to resolve OpenTelemetry spans to graph symbols.

CREATE TABLE route_symbols (
    service_name  TEXT NOT NULL,
    route_pattern TEXT NOT NULL,
    node_hash     BLOB NOT NULL,
    mapping_type  TEXT NOT NULL,   -- "http", "rpc", "messaging"
    created_at    INTEGER NOT NULL,
    PRIMARY KEY (service_name, route_pattern, mapping_type)
);

Merkle Tree (Computed, Not Stored)

The hierarchical Merkle tree is computed in memory from the edge table, not stored in SQLite. Tree construction delegates to github.com/blackwell-systems/merkle-strata v0.4.0: BuildMerkleTree calls strata.Build and BuildHierarchicalTree calls strata.BuildMultiLevel. The merkle\0 hash domain prefix is passed via strata.WithPrefix. BuildHierarchicalTree constructs:

repo_root = merkle(sorted(package_roots))
  package_root = merkle(sorted(edge_type_roots for this package))
    edge_type_root = merkle(sorted(edge_hashes of this type in this package))
      leaf = edge_hash

This is recomputed on every ComputeSnapshot call. The tree lives in memory for the lifetime of the daemon process. Package roots and edge-type roots are keyed by string ("package_path" and "package_path:edge_type").

The tree is not stored because: 1. It's fast to compute (~3ms for 12K edges) 2. Storing it would add a sync problem (tree must match edges exactly) 3. The SubgraphCache and notes table handle persistence of derived results

When the tree needs to be larger than memory (lazy materialization), the roots and structure would move to SQLite. This is not needed at current scale.

Migration History

#	File	What it adds
001	initial_schema.sql	repos, files, nodes, edges, edge_events, snapshots, all indexes
002	add_dangling_edge_support.sql	(index for dangling edge queries)
003	add_callsite_columns.sql	callsite_line, callsite_col, callsite_file on edges
004	add_runtime_columns.sql	observation_count, last_observed on edges; route_symbols table
005	add_feedback_table.sql	feedback table
006	add_fts5_index.sql	nodes_fts virtual table + content table
007	add_node_doc.sql	doc column on nodes
008	add_task_memory.sql	task_memory table
009	add_blame_columns.sql	last_author, last_commit_at on nodes
010	add_coverage_column.sql	coverage_pct on nodes
011	add_indexed_at.sql	indexed_at on nodes and edges
012	add_notes.sql	graph_notes table
013	add_edge_event_data.sql	source_hash, target_hash, edge_type, confidence, provenance on edge_events (removed-edge diffs)
014	add_neighborhood_root.sql	neighborhood_root on feedback (merkleized expiration)
015	snapshot_generation.sql	generation column on snapshots (O(1) ancestry checks)
016	fts_symbol_name.sql	Adds symbol_name column to FTS content table; recreates FTS5 virtual table with 4 columns (symbol_name, qualified_name, signature, file_path)
017	fts_concepts_column.sql	Adds concepts column to FTS content table; stores CamelCase-split file/module names as searchable concepts; recreates FTS5 virtual table with 5 columns
018	fts_doc_column.sql	Adds doc column to FTS content/virtual table for docstring-based BM25 retrieval; recreates FTS5 with 6 columns (symbol_name, concepts, qualified_name, signature, file_path, doc)
019	add_embeddings.sql	Embeddings table for vector cache (keyed by node_hash + model)
020	add_feedback_cluster.sql	keyword_cluster column on feedback table for per-cluster scoping; prevents cross-task interference
021	add_vocab_associations.sql	vocab_associations table for learned keyword -> symbol mappings
022	add_vocab_subgraph_root.sql	subgraph_root column on vocab_associations for per-package Merkle expiration

Migrations run automatically on NewSQLiteStore. Each runs in its own transaction. Schema version is tracked in schema_version table (current version: 22). No rollback/down migrations.

Per-Repo Isolation

Each repository gets its own SQLite database at ~/.knowing/repos/<safe-name>.db. This means: - Community detection operates on one repo's data (no cross-repo noise) - RWR, HITS, BM25 scores are repo-scoped - Databases can be backed up, shared, or deleted independently - Cross-repo edges are planned via a separate resolution layer

The roster (~/.knowing/roster.json) maps repo paths to database paths.

Cross-Repo Edges

When repo A calls a function in repo B, the edge's target_hash points to a node that lives in repo B's database, not repo A's. This creates a "dangling edge" in repo A's graph: an edge whose target does not exist locally.

The resolution pipeline (internal/resolver/): 1. DanglingEdges() finds all edges whose target hash has no matching node. 2. AllRepos() lists all known repos. 3. For each dangling edge, search other repos' node tables for a matching qualified name. 4. If found, retarget the edge to the correct node hash in the foreign repo. 5. The edge is resolved; the cross-repo relationship is now explicit.

This works because content-addressing provides global identity without coordination: SHA-256("node\0" + repoURL + package + name + kind) produces the same hash regardless of which machine computes it. Two indexers running on different repos produce matching hashes for the same symbol.

Current state: - Cross-repo edges are resolved within a single knowing instance that has indexed multiple repos. - Each repo has its own database; the resolver queries across databases. - ModuleToRepoURL mapping (from go.mod) helps the Go extractor target cross-repo calls correctly at extraction time. - Federated sync (exchanging edges between separate knowing instances) is planned (Phase 4 roadmap).

Multi-Repo Indexing Workflow

# Register and index multiple repos
knowing add ./repo-a
knowing add ./repo-b

# Each gets its own database
# ~/.knowing/repos/repo-a.db
# ~/.knowing/repos/repo-b.db

# Cross-repo edges are resolved when both repos are indexed
# The MCP server can query across repos via cross_repo_callers tool

Identity Agreement

Two knowing instances that have never communicated will produce the same hash for the same symbol, as long as they use the same canonical repo URL. This is why canonicalization is core infrastructure: github.com/org/repo and https://github.com/org/repo.git must resolve to the same canonical identity for cross-repo edges to match.

ExtractPackagePath (the canonical package path extractor) and ComputeNodeHash (the canonical node hash) are the two functions that define cross-repo identity. Both are deterministic, and both use the "node\0" domain prefix to prevent cross-type collisions.

GraphStore Interface

All database access goes through types.GraphStore (39 methods). SQLiteStore is the sole implementation. The interface exists so: - Tests can use mock stores - Future backends (Pebble, remote) can implement the same interface - The daemon, MCP server, CLI, and context engine all consume the same abstraction

Non-interface methods on SQLiteStore (accessed via type assertion): - BatchPutNodes, BatchPutEdges, BatchPutFiles: bulk insert in single transaction - BatchPutNotes: bulk note insert (21x faster) - RebuildFTS, RebuildFTSForPackages: FTS index management - SearchBM25Nodes: full-text search - IntegrityCheck: PRAGMA integrity_check - UpdateNodeBlame, UpdateNodeCoverage: enrichment stamping - CommunitiesForNodes: batch community_id lookups - TruncateGraph: delete all nodes, edges, and edge_events (used by reindex) - DeleteRepoData: atomic eviction of all data for a repo (files, nodes, edges, edge_events, snapshots, feedback, task_memory, graph_notes) in a single transaction; returns DeleteRepoResult with counts of deleted rows per table - InvalidateCache: clears in-process node/edge caches - DB(): raw access for task memory and feedback queries

Why SQLite

Single file: the database IS the artifact. Copy it, share it, verify it.
WAL mode: concurrent readers during daemon indexing.
Embedded: no external service to configure, manage, or secure.
PRAGMA integrity_check: filesystem-level corruption detection built in.
Pure Go driver (modernc.org/sqlite): no CGo, cross-compiles to all platforms.
Fast enough: 98ms fsck on 7,224 nodes + 24,936 edges. 72us proof generation. 42ns cache lookup.

Performance Pragmas

On connection open, NewSQLiteStore sets:

Pragma	Value	Rationale
`journal_mode`	WAL	Concurrent readers, no blocking on writes
`synchronous`	NORMAL	Safe with WAL (fsync on checkpoint, not every commit)
`mmap_size`	256 MB	Memory-mapped I/O for read-heavy workloads
`cache_size`	64 MB (negative KB)	Large page cache for hot-path traversals
`busy_timeout`	5000 ms	Retry on lock contention instead of immediate SQLITE_BUSY
`temp_store`	MEMORY	Temp tables and indexes kept in memory