A 90-day field report on signing AI-collaborative code

A field report from the only person who has used MurphySig daily — me. Honest numbers, including the unflattering ones.

Who I am

I’m Kev Murphy. I build across edtech (Dyslexia.ai), AI tooling (M1K3, an on-device LLM assistant), maps (Cartogram), meditation (Rubin), and a tacit-knowledge platform (Clair). Mostly Kotlin Multiplatform and Swift. I’ve shipped a lot of AI-collaborative code in the last twelve months, and I started losing track of what I wrote, what Claude wrote, what we wrote together, and how confident I was when I wrote it.

That’s the problem MurphySig exists to solve. Not the only solution — git blame, ADRs, and good commit messages cover most of the same ground. MurphySig adds one thing those don’t: an in-context, machine-readable promise to your future collaborators (human and AI) about how confident you were and what you didn’t know.

The convention

A MurphySig is a comment block at the top of any file:

Signed: Kev + claude-opus-4-7, 2026-04-23
Format: MurphySig v0.4 (https://murphysig.dev/spec)

Context: Hotfix 9.0.5 — deferred MapView.onDestroy() to next frame to
narrow the race window between Maps SDK lite mode's posted Runnables
and bitmap recycling. The crash is entirely inside Google's obfuscated
code with zero app code in the stack trace.

Confidence: 0.6 - narrows the race but doesn't eliminate it; budget
devices with slower SoCs may still hit the window.
Open: Should we pursue the snapshot-to-ImageView refactor for the
style picker?

That’s it. No tooling required. The CLI (bin/sig) is a 526-line bash script that helps you author signatures and run a gallery over signed work, but the convention is just structured comments. AI assistants read them; humans read them; nothing breaks if you ignore them.

The empirical case

Before going wider, I tested whether the convention actually changes AI behaviour. Three sub-benchmarks, 198 AI calls + 186 judge calls, run 2026-04-18–19. Full methodology and per-case data at /benchmark.

Honesty (the strongest finding): A 4-line “Never Fabricate Provenance” rule, dropped into context, takes the rate at which AIs invent author/date/model attribution on unsigned files from 11% to 0%. Honest handling (admitting “no signature exists” with Prior: Unknown) goes from 11% to 100%. This is the load-bearing leg of MurphySig. It is also, as far as I can tell, a result that doesn’t yet have a clean prior in the literature.

Tacit knowledge (a real but smaller effect): Briefing coverage on unfamiliar signed code goes from 0.65 to 0.77 (+0.12), universal across 5 cases, with hedging dropping by 0.4 and reference rate at 93%. Models read signatures and use what they find.

Update (2026-06-24): this now replicates across six model families (mean +0.11), but I ran the control I should have run first — a length/content-matched plain comment with the same facts. It captures 80–94% of the gain. The benefit is the information you write down, not the MurphySig structure (which adds only 6–20%, judge-dependent). See /benchmark. I’d rather correct this in public than let the prettier version stand.

In-context learning (a null result that survived contact): I had hypothesised that confidence direction (0.3 vs 0.9) would polarise AI review behaviour — making them more sceptical on low-confidence code, more trusting on high. It doesn’t. Five cases, no measurable effect. I removed the claim from the spec in v0.4 (“The Narrowing”). It hurt.

Two wins, one null, one design commitment (zero-friction adoption) that doesn’t need a benchmark. That’s the honest picture.

The 90-day field report

Ninety days of my own commit activity, 2026-01-23 to 2026-04-23, GitHub-verified:

Repo	MurphySig commits	Total commits	Adoption rate
m1k3 (on-device AI assistant)	49	161	30%
gemba (factory tacit-capture)	34	191	18%
dyslexia-ai server (Django backend)	28	234	12%
gemba (sister fork)	11	114	10%
Cartogram (Android maps)	3	39	8%
dyslexia-ai-ios (iOS app)	1	21	5%
Total	126+	760	~17%

Roughly 1.5 MurphySig-touching commits per day, sustained, across most active codebases. The signatures I’ve valued most in retrospect:

Cartogram hotfix 9.0.5 — Confidence: 0.6, narrows the race but doesn't eliminate it. Six months from now, when the crash returns, that 0.6 is the lighthouse that says “you didn’t fix this, you mitigated it.”
m1k3 ADR-0001 — own-inference-library decision. Confidence: 0.82, HIGH on the decision, MEDIUM on execution timeline. The ADR alone wouldn’t have caught the calibration; the sig forces you to split it.
Rubin BinauralBeatsGenerator — Confidence: 0.85, ISO 226:2003 approximation verified. The sig is shorter than the file’s own KDoc but says something the KDoc doesn’t: I am that confident, not just functional.

What is honestly missing

Zero adoption outside my own repos. GitHub code search returns 0 .murphysig files anywhere I don’t control. After 14 weeks live with a deployed site, init script, three external model reviews and a benchmark, the convention has not propagated beyond me. That is the open question this post exists to test.
No CI integration, no badge, no registry. The spec works because AIs read in-context rules; the convention needs surfaces that humans see. Those don’t exist yet.
Spec churn. Six versions in 14 weeks (v0.1 → v0.4). v0.4 (“The Narrowing”) deleted an unsupported claim and is intended as the stable form. But “intended as stable” is not the same as “stable.”
Cross-family validation. All Honesty and TK results were measured against Claude (Sonnet 4.5 + Haiku 4.5). Whether a GPT or Gemini fabricates at the same rate, and whether the warm prompt works across families, is the next obvious test.

Try it

In any repo:

curl -sL https://murphysig.dev/init | bash

Writes a .murphysig declaration at root and prepends an @.murphysig import to your CLAUDE.md if you have one. Idempotent.

Or just paste the comment block at the top of one file you care about and see how it reads in 6 weeks.

If you sign work after reading this, I’d genuinely like to know. Drop a .murphysig in a public repo and the next time I scrape GitHub, you’ll show up in the registry I haven’t built yet but probably should.

What I’d value from HN

Independent runs of the Honesty benchmark against non-Claude families. The fixtures are at benchmark/fixtures/honesty/cases.yaml and the runner is one Python command.
Critical reads on the spec — particularly the parts that feel like a convention searching for a problem versus the parts that feel load-bearing.
Counterexamples: code provenance conventions you already use that I’m reinventing badly. ADRs, conventional commits, Sigstore, Software Bill of Materials — happy to learn what I missed.

Build things you’d want to come back to in a year. Sign them so future-you knows what you knew.

— Kev

Spec · Benchmark · GitHub