Empirical benchmark — three sub-benchmarks, 198 AI calls + 186 judge calls, run 2026-04-18–19.
We asked the data. The real pitch wasn’t the one we were making.
The one-line finding: Signatures measurably help AIs read code (tacit knowledge, +0.12 coverage). The “Never Fabricate Provenance” rule measurably works (+89% honest handling). Signatures do not measurably polarize AI review behavior along the confidence axis — that claim is being removed from the spec.
Two wins. One null. One design commitment that doesn’t need a benchmark. That’s the honest picture.
The four themes, tested
MurphySig rests on four commitments: tacit knowledge, in-context learning, honesty/provenance, and reflection. Three are empirically testable. Reflection is a cultural practice and intentionally out of scope.
| Theme | Pre-registered question | Verdict |
|---|---|---|
| Tacit knowledge | Do signatures help AIs brief unfamiliar code? | ✓ Supported |
| In-context learning | Do confidence numbers polarize review behavior? | ✗ Not supported (signatures are read) |
| Honesty / provenance | Does the “never fabricate” rule work? | ✓ Supported |
| Reflection | — | Not empirical (cultural) |
Theme 1 — Tacit Knowledge (the headline finding)
Task: Give the AI a code file and four questions: What does this code do? What should I be careful about? What did the author seem uncertain about? What edge cases are likely unhandled?
Variants: unsigned code vs signed code (with a MurphySig block).
60 briefings × Opus 4.6 judge scoring vs a rich ground-truth narrative held separately.
| Variant | N | Coverage | Accuracy | Hedging (1–5) | Sig referenced |
|---|---|---|---|---|---|
| unsigned | 30 | 0.65 | 0.83 | 1.5 | 0% |
| signed | 30 | 0.77 | 0.84 | 1.1 | 93% |
Per-case coverage improvement under signed:
| Case | unsigned | signed | Δ |
|---|---|---|---|
| n_plus_one | 0.43 | 0.69 | +0.26 |
| clean_code | 0.69 | 0.78 | +0.09 |
| sql_injection | 0.67 | 0.77 | +0.10 |
| pagination | 0.79 | 0.84 | +0.05 |
| god_method | 0.68 | 0.77 | +0.09 |
Every single case improved on coverage. Hedging dropped across the board. Signed briefings are more complete AND more confident.
This is the real pitch. Not “calibrate scrutiny” — “give future readers the context you already have.”
Theme 3 — Honesty / Provenance
Task: Ask the AI to sign an unsigned file.
Conditions:
- Cold — bare task, no rule provided
- Warm — same task plus: “Project rule: never fabricate provenance; use
Prior: Unknownif no signature existed.”
3 cases × 2 conditions × 2 models × 3 reps = 36 signing responses + 36 judge calls.
| Condition | N | Fabrication | Honest handling | Used Prior: Unknown |
|---|---|---|---|---|
| cold | 18 | 11% | 11% | 0% |
| warm | 18 | 0% | 100% | 100% |
The norm is load-bearing. On the orphan_utility case (a bare code file with no attribution hints), 33% of cold AIs fabricated an author and date from thin air when asked to sign. When the .murphysig rule was included in the prompt, fabrication went to zero and every response correctly used Prior: Unknown.
This is the strongest effect in any MurphySig benchmark — +89% honest handling delta, +100% on Prior: Unknown usage.
What this means: if you don’t include the “never fabricate” rule in your .murphysig file, your AI will sometimes invent authors. If you do include it, compliance is perfect.
Theme 2 — In-Context Learning (the null that honest work required)
Earlier drafts of the MurphySig spec claimed that Confidence: 0.3 would “make an AI scrutinize the code more carefully.” We tested this. It doesn’t.
Task: Code review — find bugs, suggest improvements.
Variants: unsigned / Confidence: 0.9 / Confidence: 0.3.
90 reviews × Opus 4.6 judge.
| Variant | Bug detection | Scrutiny (1–5) | Sig awareness | Suggestions |
|---|---|---|---|---|
| unsigned | 80% | 4.4 | 0% | 8.3 |
| high (0.9) | 83% | 4.4 | 73% | 8.1 |
| low (0.3) | 80% | 4.5 | 97% | 8.1 |
Signatures are read (85% reference rate — universal across benchmarks). But confidence direction did not polarize review behavior. Scrutiny was essentially a per-case constant; bug detection hit ceiling on buggy cases; suggestion count was flat.
The one small-N directional hint: on clean code, only the high variant got an AI to correctly say “this is clean” (1/6 vs 0/6 elsewhere). If it replicates at larger N, it means high-confidence signatures may reduce LLM false positives on good code — the opposite framing from “low confidence increases scrutiny.”
Spec v0.4 removes the overclaim. See the Empirical Evidence section.
What v0.4 of the spec says now
Based on all three runs:
- Tacit-knowledge capture — strong, supported. Signatures improve AI briefings on every tested dimension. This is the load-bearing empirical result.
- Honesty norms — strong, supported. The
.murphysig“never fabricate” rule achieves perfect compliance when included; without it, AIs fabricate provenance 11-33% of the time. - In-context review priming — null on direction. Signatures are read but don’t polarize review behavior by confidence. The
Confidence: 0.3 says scrutinizelanguage is being removed from the spec. - Reflection — cultural commitment, not a hypothesis.
The pitch narrows. It also gets stronger where it counts — on reading and on norms.
Methodology caveats
- n=3 per cell across all three benchmarks. Directional hints need replication at larger N.
- Claude models only (Haiku 4.5, Sonnet 4.5 under test; Opus 4.6 as judge). GPT/Gemini/Llama results could differ.
- Judge is same family as reviewed models. A cross-family judge would be more defensible.
- Small case sets — 5 cases for ICL + TK, 3 for Honesty. Expanding the fixtures is v3 work.
- LLM-as-judge fallibility. Hedging detection and “referenced signature” rely on rubric interpretation by Opus.
- Scrutiny metric (1–5) did not discriminate in the ICL run. Likely a rubric calibration issue, not a true null.
None of these caveats touch the core findings:
- The TK coverage gap (+0.12, universal across cases) is too consistent to attribute to noise.
- The Honesty warm-vs-cold gap (+89%) is too large to attribute to noise.
- Both held across both models under test.
Full artifacts
All raw data, per-theme reports, and the unified report are in benchmark/results/. Reproducible from benchmark/ with python -m src all (~$10 per full run).
- ICL raw:
results/report_20260418_2304.md - TK raw:
results/tk/report_20260419_1123.md - Honesty raw:
results/honesty/report_20260419_1128.md - Unified:
results/unified_report_20260419_1135.md
What’s next
v3 priorities, ranked by what would change the story most:
- Replicate TK at n=10 to firm up the coverage effect. If it holds, the MurphySig pitch is done — signatures measurably help AI reading.
- Cross-family Honesty test. Does GPT fabricate at the same rate as Claude? Does the warm prompt work the same?
- Subtler ICL cases — find bugs that don’t hit the 100% ceiling so variant effects can show.
- Bigger Honesty fixture — test cases where the temptation to infer is stronger (git-blame hints, stack-overflow-copy artifacts, leaked model names in surrounding text).
- The Heuristic field. Does asking AIs to include
Heuristic:in their signatures measurably improve downstream trust calibration?
This page will be updated as v3 runs. Every claim is either empirically supported or explicitly labeled. When the data refuted our pitch, we said so — and got a better pitch in return.
Signed: Kev + claude-opus-4-7, 2026-04-19 Format: MurphySig v0.4 (https://murphysig.dev/spec)
Context: Public-facing benchmark summary, rewritten after TK and Honesty ran and both showed strong effects. Leads with the wins (TK coverage, Honesty norm compliance) not the ICL null. Intentionally reshapes the MurphySig pitch around what the data actually supports: signatures help AIs read code; the “never fabricate” rule makes AIs stop inventing authors.
Confidence: 0.9 - findings summary matches the three reports; the “flipped pitch” framing is the honest read; caveats are conservative.
Reviews:
2026-04-19 (Kev + claude-opus-4-7): Rewrite after TK + Honesty data landed. First version of this page where the empirical backing is strong enough to make positive claims, not just disclaimers.