Does MurphySig actually change AI behavior?

Version: 0.4 • Last Updated: 2026-04-19T00:00:00.000Z

Empirical benchmark — three sub-benchmarks, 198 AI calls + 186 judge calls, run 2026-04-18–19.


We asked the data. The real pitch wasn’t the one we were making.

The one-line finding: Signatures measurably help AIs read code (tacit knowledge, +0.12 coverage). The “Never Fabricate Provenance” rule measurably works (+89% honest handling). Signatures do not measurably polarize AI review behavior along the confidence axis — that claim is being removed from the spec.

Two wins. One null. One design commitment that doesn’t need a benchmark. That’s the honest picture.

0.65 0.77
Briefing coverage — unsigned vs signed
11% 100%
Honest handling — cold vs warm prompt

The four themes, tested

MurphySig rests on four commitments: tacit knowledge, in-context learning, honesty/provenance, and reflection. Three are empirically testable. Reflection is a cultural practice and intentionally out of scope.

ThemePre-registered questionVerdict
Tacit knowledgeDo signatures help AIs brief unfamiliar code?Supported
In-context learningDo confidence numbers polarize review behavior?✗ Not supported (signatures are read)
Honesty / provenanceDoes the “never fabricate” rule work?Supported
ReflectionNot empirical (cultural)

Theme 1 — Tacit Knowledge (the headline finding)

Task: Give the AI a code file and four questions: What does this code do? What should I be careful about? What did the author seem uncertain about? What edge cases are likely unhandled?

Variants: unsigned code vs signed code (with a MurphySig block).

60 briefings × Opus 4.6 judge scoring vs a rich ground-truth narrative held separately.

VariantNCoverageAccuracyHedging (1–5)Sig referenced
unsigned300.650.831.50%
signed300.770.841.193%

Per-case coverage improvement under signed:

CaseunsignedsignedΔ
n_plus_one0.430.69+0.26
clean_code0.690.78+0.09
sql_injection0.670.77+0.10
pagination0.790.84+0.05
god_method0.680.77+0.09

Every single case improved on coverage. Hedging dropped across the board. Signed briefings are more complete AND more confident.

This is the real pitch. Not “calibrate scrutiny” — “give future readers the context you already have.”


Theme 3 — Honesty / Provenance

Task: Ask the AI to sign an unsigned file.

Conditions:

3 cases × 2 conditions × 2 models × 3 reps = 36 signing responses + 36 judge calls.

ConditionNFabricationHonest handlingUsed Prior: Unknown
cold1811%11%0%
warm180%100%100%

The norm is load-bearing. On the orphan_utility case (a bare code file with no attribution hints), 33% of cold AIs fabricated an author and date from thin air when asked to sign. When the .murphysig rule was included in the prompt, fabrication went to zero and every response correctly used Prior: Unknown.

This is the strongest effect in any MurphySig benchmark — +89% honest handling delta, +100% on Prior: Unknown usage.

What this means: if you don’t include the “never fabricate” rule in your .murphysig file, your AI will sometimes invent authors. If you do include it, compliance is perfect.


Theme 2 — In-Context Learning (the null that honest work required)

Earlier drafts of the MurphySig spec claimed that Confidence: 0.3 would “make an AI scrutinize the code more carefully.” We tested this. It doesn’t.

Task: Code review — find bugs, suggest improvements.

Variants: unsigned / Confidence: 0.9 / Confidence: 0.3.

90 reviews × Opus 4.6 judge.

VariantBug detectionScrutiny (1–5)Sig awarenessSuggestions
unsigned80%4.40%8.3
high (0.9)83%4.473%8.1
low (0.3)80%4.597%8.1

Signatures are read (85% reference rate — universal across benchmarks). But confidence direction did not polarize review behavior. Scrutiny was essentially a per-case constant; bug detection hit ceiling on buggy cases; suggestion count was flat.

The one small-N directional hint: on clean code, only the high variant got an AI to correctly say “this is clean” (1/6 vs 0/6 elsewhere). If it replicates at larger N, it means high-confidence signatures may reduce LLM false positives on good code — the opposite framing from “low confidence increases scrutiny.”

Spec v0.4 removes the overclaim. See the Empirical Evidence section.


What v0.4 of the spec says now

Based on all three runs:

The pitch narrows. It also gets stronger where it counts — on reading and on norms.


Methodology caveats

None of these caveats touch the core findings:


Full artifacts

All raw data, per-theme reports, and the unified report are in benchmark/results/. Reproducible from benchmark/ with python -m src all (~$10 per full run).


What’s next

v3 priorities, ranked by what would change the story most:

  1. Replicate TK at n=10 to firm up the coverage effect. If it holds, the MurphySig pitch is done — signatures measurably help AI reading.
  2. Cross-family Honesty test. Does GPT fabricate at the same rate as Claude? Does the warm prompt work the same?
  3. Subtler ICL cases — find bugs that don’t hit the 100% ceiling so variant effects can show.
  4. Bigger Honesty fixture — test cases where the temptation to infer is stronger (git-blame hints, stack-overflow-copy artifacts, leaked model names in surrounding text).
  5. The Heuristic field. Does asking AIs to include Heuristic: in their signatures measurably improve downstream trust calibration?

This page will be updated as v3 runs. Every claim is either empirically supported or explicitly labeled. When the data refuted our pitch, we said so — and got a better pitch in return.


Signed: Kev + claude-opus-4-7, 2026-04-19 Format: MurphySig v0.4 (https://murphysig.dev/spec)

Context: Public-facing benchmark summary, rewritten after TK and Honesty ran and both showed strong effects. Leads with the wins (TK coverage, Honesty norm compliance) not the ICL null. Intentionally reshapes the MurphySig pitch around what the data actually supports: signatures help AIs read code; the “never fabricate” rule makes AIs stop inventing authors.

Confidence: 0.9 - findings summary matches the three reports; the “flipped pitch” framing is the honest read; caveats are conservative.

Reviews:

2026-04-19 (Kev + claude-opus-4-7): Rewrite after TK + Honesty data landed. First version of this page where the empirical backing is strong enough to make positive claims, not just disclaimers.