Most AI memory systems are a vector store with a chat wrapper. Embed the text, store the vector, retrieve by cosine similarity, inject into the prompt. The retrieval is stateless. The ranking is fixed. The system never learns whether what it retrieved actually helped.
We built something different. This is the engineering story.
Two processes, one memory pool
KongCode runs as two cooperating processes. A long-lived daemon owns the graph database, the embedding model, and the learned reranker, all in one memory pool. A thin per-session client forwards calls to the daemon over a Unix domain socket using JSON-RPC 2.0. Multiple Claude Code sessions share the same daemon. One BGE-M3 model in RAM, not N copies.
The daemon survives plugin updates, MCP restarts, and Claude Code crashes. When a newer version connects, the old daemon waits for active sessions to finish before handing off. When no clients are attached, it idles out after 6 seconds and frees RAM. The lifecycle engineering alone is not what tutorials produce.
The graph
SurrealDB 3.0, running embedded. 37 tables across five structural pillars: agent, project, task, artifact, and concept. Nine knowledge tables store everything the system learns: episodic memories, skills, reflections, causal chains, monologue traces, identity chunks, conversation turns, core directives, and session metadata. Three tables track retrieval quality. Four tables handle identity and soul graduation. Seven handle system maintenance.
28 relation tables connect them. The edge vocabulary is not arbitrary. Five structural edges (decomposes, elaborates, contextualizes, enables, extends). Six mechanism edges (explained_by, prerequisite_for, mechanism_for, identification_for, supported_by, necessitates). Five tension edges (contrasts_with, tempered_by, fails_when, complemented_by, corrects). Five implication edges. Three provenance edges (derived_from, cites, supersedes). Plus five more for turn-level wiring: responds_to, tool_result_of, mentions, about_concept, artifact_mentions.
Every edge has declared IN/OUT type constraints. A schema-edge integrity guard validates every graph relation against its declared types at PR time. The invariants are enforced, not hoped for.
Embeddings and reranking
BGE-M3 runs locally in the daemon via node-llama-cpp native bindings. 1024-dimensional vectors. GGUF-quantized, approximately 420MB on disk. A persistent L2 embedding cache (SHA256-keyed, model-version-aware) survives daemon restarts so the same text is never re-embedded. Seven HNSW indexes cover turns, concepts, memories, identity chunks, artifacts, monologues, and skills.
The reranker is where it gets interesting. Most systems use fixed weights for retrieval scoring: recency times importance times similarity, with coefficients someone tuned by hand. We started there. Then we replaced it.
ACAN (Attentive Cross-Attention Network) is a 130K-parameter learned reranker. It takes 7 input features per candidate: an attention logit computed from projected query and key embeddings, recency, importance, access count, graph-neighbor bonus, historical utilization, and reflection boost. It outputs a single score. The attention matrices are 1024x64 (query and key projections), plus a 7-element weight vector and a bias scalar.
ACAN ships dormant. It activates after 5,000 labeled retrieval outcomes accumulate. It trains in a background worker thread: 80 epochs, 20% validation split, early stopping with patience of 8, learning rate decay. When a sibling MCP session retrains the model, other sessions hot-reload the weights by checking the file mtime. The reranker improves over time, from the system's own retrieval history, without any manual tuning.
The extraction pipeline
Every session generates raw conversation. The daemon extracts structured knowledge from it in the background, triggered every ~4K tokens or 3 turns. Nine extraction types, quality-gated:
- Causal chains: cause-effect patterns from debugging sessions
- Monologue traces: doubts, insights, tradeoffs, realizations. Episodic reasoning moments.
- Corrections: user correcting the agent. Highest signal in the system.
- Concepts: technical facts worth remembering
- Decisions: choices with rationale
- Preferences: user workflow and style signals
- Artifacts: files created, modified, or read
- Skills: multi-step procedures that worked
- Resolved memories: the daemon marks issues done when they are mentioned as fixed
Weak confidence extractions are skipped. The same conversation may yield different extractions depending on signal strength. Volume is not the goal. Fidelity is.
Decay and correction
Every concept has a stability field. Default 1.0. When a belief is superseded by a correction, the old concept's stability decays by a factor of 0.4 with a floor of 0.15. It does not get deleted. It loses retrieval priority and stops competing with the correction. The original belief stays queryable. You can trace why the system used to believe what it believed.
Memories that the system thinks might be useful but the user ignores get deprioritized on a Fibonacci schedule. Each ignore pushes the next surfacing attempt further out (1, 1, 2, 3, 5 intervals). User engagement resets the counter to zero. The system learns which knowledge is worth resurfacing and which is noise.
Zero cloud on the read path
Embedding: local. Reranking: local. Graph queries: local. Extraction triggers: local. The only external calls are to the LLM provider during the extraction pipeline, because extraction requires language understanding. Every read, every retrieval, every rerank runs on your machine with no network dependency.
98.2% Recall@5 on LongMemEval. 728 tests passing. The full architecture is on github.com/42U/kongcode.