What KongCode learned between 0.4.0 and 0.5.0

KongCode v0.4.0 was the milestone tag: subagent tracking, MCP tools, bootstrap migration, 446 passing tests. The plumbing existed. The feedback loops did not. Between v0.4.0 and v0.5.0 we shipped six patch releases, and almost none of them added a feature. They unbroke a different way the metrics were lying. 0.5.0 is the first version where KongCode can grade itself.

Six tags in two days, one per fix. Here is what each one taught us, in order.

v0.4.1: be present before you measure

First bug after the milestone tag: fresh installs were not registering with Claude Code at all. The MCP main() ran initialization before accepting the stdio transport, so the handshake landed on a closed pipe and the plugin silently failed to attach. Existing dev installs worked because they were already attached from a prior session. Any new user got a no-op.

Fix in PR #4: connect the transport, then run initialize. One call-order swap. The lesson is dumber than the patch: a learning system that does not register with its host learns nothing, and looks fine while doing it. Test counts are not enough. You need at least one fresh-install integration check that proves the plugin actually attached.

v0.4.2: the input pipe was empty

Two production hook handlers, UserPromptSubmit and PostToolUse, were reading the wrong fields off the payload. The code asked for payload.user_prompt; the actual key was payload.prompt. Same story for tool_result versus payload.tool_response. The handlers ran without errors and wrote rows. The rows had nothing in them.

At the same time, the Stop hook never fed assistant transcripts into evaluateRetrieval. The function existed. Nobody was calling it with real text. We added a transcript reader that pulls the assistant turn from the Claude Code JSONL and routes it into the evaluator. Until that wire was connected, every retrieval outcome was a zero.

This is the kind of bug a unit test cannot catch, because the unit test stubs out the payload. The handler reads payload.prompt in the test and the test passes. The handler reads payload.user_prompt in production and silently drops it. Hook contracts need a living fixture, not a hand-written one.

v0.4.3: stop counting the dead

Each Claude Code restart left zombie sibling MCPs running. They held the same SQLite file open and wrote alongside the live process. Metrics looked plausible, then drifted, because two or three half-alive instances were stamping rows on top of each other.

v0.4.3 added a stale-MCP reaper: on startup, sweep zombie siblings with SIGTERM. Opt-out flag KONGCODE_KEEP_SIBLINGS=1 for users who run multiple intentional instances. After the reaper, the unit of measurement is one process, not a wavefront of half-dead ones. Without that guarantee, no downstream aggregation is honest.

v0.4.4: a quality signal that meant something

The retrieval-utility score was using a max-based formula: take the strongest of three lexical match types and call that the score. It was loud and brittle. A single trigram hit dragged the score up; a near miss across two of the three got nothing. We replaced it with a weighted blend:

util = 0.6 * (keyTerm + trigram)
     + 0.4 * unigram
     + 0.2 if tool_success

Higher weight on the strongest lexical signal, partial credit for the weaker one, and a small additive bonus when the retrieval led to a successful tool call. Same inputs, calmer score, more correlated with downstream behavior.

The second half of v0.4.4 was windowing. The all-time retrieval-utility aggregate was being dragged down by data from before the v0.4.2 hook fix, when most outcomes were poisoned zeros. We narrowed the rolling aggregate to 14 days. Pre-fix legacy rows still exist in storage; they no longer get to vote. Quality signals are perishable. Old, broken data should not haunt the dashboard forever.

v0.4.5: count the tokens you actually used

orchestrator_metrics tracks per-turn cost: input tokens, output tokens, cache reads, cache writes. Every field was zero. The schema was right. The writer was running. Nobody was reading the transcript.

v0.4.5 added readTurnTokenUsage(), which takes a transcript_path, walks the assistant's turn record in the JSONL, and pulls actual token counts. After the patch, the orchestrator can finally answer “how much did this turn cost.” Before it, every cost-per-turn chart on the introspect view was a flat line at zero, and no alert would ever fire.

v0.5.0: roll it up, alert on the anomalies

With the input pipe wired (0.4.2), the process unit clean (0.4.3), the score honest (0.4.4), and the cost accounting real (0.4.5), v0.5.0 finally adds the layer this whole arc was for.

Daily rollups. A new orchestrator_metrics_daily table aggregates per-turn metrics into per-day rows. Trends become a SQL SELECT over fourteen rows, not a scan over thousands of turns.
Anomaly-only health injection. A detectAnomaliesroutine runs four absolute-threshold detectors against the rollups and only injects a health alert into the agent's context when a flag fires. No flags, no noise. The agent stops getting paged about itself unless something is actually off.
A “trends” introspect action. Surfaces the rolled-up history on demand without polluting the steady-state context.

The 0.5.0 commit is small. It only works because the five patches in front of it are honest.

The pattern

A learning system's hardest bug is the one that makes the system think it is learning when it isn't. Each fix in this arc was the same shape: the code looked fine, the writes ran, the tests passed, and the underlying signal was empty or poisoned. The only way to find them was to insist that the signal mean something downstream and follow the contradictions back up.

Six versions, six different ways the metrics were lying. The full trail (commits, diffs, test counts, the formula change, the reaper, the rollups) is on github.com/42U/kongcode. v0.5.0 is the first one we trust to grade itself.