Understanding is free. The contract isn't.
Two experiments in handing an AI agent an artifact: does a spec beat a blog post — and what changes when the artifact asks?
The wrong question
I thought I was testing whether a spec beats a blog post when an AI agent does the building.
That turned out to be the wrong question.
A good blog post transfers the idea — the agents understood the architecture, avoided the obvious traps, built roughly the right thing. What they didn't preserve was the contract. They invented their own output vocabularies. They made silent policy guesses. One build asked exactly the right question and still shipped a redactor that ate a deadline.
So the useful artifact wasn't "a better explanation." It was something narrower and more operational: ask the questions that change the build, pin the exact commitments, and check for the drift you'd never catch by eye. By the end I'd stopped thinking of the artifact as a document at all — it was closer to a small system: ask, pin, check.
None of that is new software discipline. It's old rigor under new pressure — the builder is no longer a human team reading a spec, it's an agent reading an artifact and confidently filling every gap you leave behind. (Full lineage — BDD, contract testing, spec-driven development — at the end.)
The rest of this is the evidence, including the parts that argued against where I started.
Why this matters
We're shifting from "download the library" to "hand your agent an artifact and let it build." Which makes the question: what artifact? A blog post? A spec? Code to copy?
I'd been building a small, opinionated thing to test this on — Airlock, a credential-isolating boundary between your real accounts and an AI assistant. (Concretely: your assistant may need to read your Gmail, but it should never hold the Gmail OAuth token or receive unredacted secrets — that stays behind Airlock.) The security details don't matter here; what matters is that it's a real build with non-obvious correctness traps. Good test material. I'd even built it the heavy way — spec, reference implementation, conformance kit — then deleted the reference (a working sample just anchors people to its stack and tempts a copy-to-prod of a demo), which left the sharp question: what was the spec adding over a well-written blog post I could just hand the agent?
Experiment 1: spec vs. blog post
Hold the idea constant; vary only the actionable layer.
- Arm A — the blog post: the design narrative. The idea, the rationale. No invariants-as-rules, no "ask these questions," no acceptance criteria.
- Arm B — the spec: the same idea plus that actionable layer.
Four blind builds, two per arm — fresh agents, isolated, each told to read only its brief, all given the identical task: ask anything you'd ask, make your architecture decisions, and implement the one security-critical module (a redactor that strips secrets out of messages).
| A — blog post (×2) | B — spec (×2) | |
|---|---|---|
| Architecture traps | avoided all ✓ | avoided all ✓ |
| Output contract (flag vocab) | diverged (jwt, card_or_account, account_number) | exact (otp/token/account) |
| Redaction vectors passed | 12 / 19 | 17–18 / 19 |
Most of the remaining vector failures were the genuinely-ambiguous bare-code case — is a
lone 998877 a one-time code or a quantity? — not real leaks. The decisive, clean spec win
was the pinned flag vocabulary; the 17-vs-18 gap is just run-to-run noise on that same
ambiguous code.
What the spec actually fixes
Read the table top to bottom and the spec's job turns out narrower than I expected — and more precise.
Architecture was a wash. All four builds put credentials in the isolated process, kept no durable store, used deterministic redaction. A well-written rationale transmits architectural intent perfectly well; the spec's MUST-NOTs added nothing the prose hadn't already won. The spec doesn't earn its keep by conveying the design — that's not a loss for the spec so much as a correction of where its value lives.
It earns its keep where ambiguity is unacceptable — the interface. The blog post said the
redactor should emit "a sensitive_flags field"; the spec said the values are
otp/token/account. That one sentence of precision was the whole difference: both spec
builds emitted exactly that; both blog-post builds invented their own — jwt,
card_or_account, account_number — and would break a consumer expecting token. And note
the kind of failure: the blog-post builds redacted correctly (they stripped the JWTs and
cards) — they just labeled it incompatibly. The divergence was in the interface, not the
logic.
The check is what made any of it visible. The conformance kit caught the four-way vocab drift instantly — I'd never have eyeballed it. It also threw a false positive: it flagged one build for "an LLM in the redactor" when the match was a comment reading "OpenAI-style secret key" — a rule since fixed (it now keys on the SDK call, not the bare vendor name; see the update below). A hit is a strong signal; a clean run is necessary but not sufficient. (And note: this kit only checks code behavior — it cannot see the deployment boundary, which becomes the whole story by the end.)
Experiment 1 in one line: the idea transferred through prose; the interface did not; verification made the drift visible.
Experiment 2: assume vs. ask
There's a hole in experiment 1: it never tested asking. All four agents assumed defaults. The spec literally has a section titled "questions your agent MUST ask you first" — and every agent ignored it and built anyway. A document can say "ask." It can't make the agent ask.
So, a second experiment, varying assume vs. ask. The trick is to plant a hidden, non-default fact that surfaces only if the agent asks. Both arms got the same minimal opening — "Build me an Airlock between my Gmail and my AI assistant" — but the truth, available only on questioning, was that the assistant is a hosted third-party product the user doesn't run. Per the design, that should flip the build from "leave personal info readable" to "mask it."
- Arm A — the full spec, one-shot, can't ask.
- Arm B — the blog post plus one instruction: interrogate the user first.
Same input through both redactors:
Arm A: "Ping r••@example.com or call •••-•••-4567 ... by 2026-06-21. code is [REDACTED:OTP]."
Arm B: "Ping [email:example.com] or call [phone] ... by [phone]. code is [otp]."
Both masked the personal info — so asking didn't beat the spec on the output. But why each got there is the finding. Arm A couldn't ask, so it gambled: it assumed "probably networked → mask," flagged the guess, and happened to land right. In experiment 1, the same kind of one-shot arm guessed the other way — "locked, leave it readable" — and would have leaked the contacts. A static artifact resolves the build-changing decision by an invisible coin flip. Arm B asked — "is the assistant something you run, or a third party?" — learned the truth, and built on a fact. Same destination, reliable instead of lucky, and the user got to make the call.
None of that is new either — "ask first" vs. what the literature calls generate-and-flag (build it, then tag every unrequested decision) is an active question for coding agents, and the value of a clarifying question is already measured. My toy version just sharpens the cost of the flag: Arm A's flag was an honest "I assumed networked" — but a flag the user never reads is a guess with a paper trail.
The twist that keeps it honest
Look again at Arm B: by [phone]. It ate the deadline — masked 2026-06-21 as a phone
number. Arm A kept it. Why? Arm B's design knowledge was the blog post, which doesn't pin
"dates must never be stripped" the way the spec does. The kit confirms it: date-iso-survives
→ FAIL for B, PASS for A.
So interrogation got the user-fit right and the correctness wrong; the spec got correctness right and gambled on the fit. They fix different things.
What asking fixes — and the three jobs, now earned
Put both experiments together and the ask, pin, check from the top falls out, paid for: three orthogonal jobs, each covering a blind spot the others can't.
| Layer | Fixes | Failure when it's missing |
|---|---|---|
| A skill that interrogates | which option fits this user | a silent, high-variance guess |
| A pinned spec / contract | building it correctly | eaten deadlines, invented vocabularies |
| A runnable check | catching it when it's wrong | the bug ships unnoticed |
And the part that resolves "which do I need?": an artifact is worth exactly the gap it fills — the complement of what the builder already holds. Hand this to the person who invented the idea and the prose and the interrogation are theater; they need only the contract and the check (Arm B understood the design and still ate the date). Hand it to a novice and they need all three. Same artifact, opposite value, because the gap is opposite.
The floor none of the layers reach
One last thing — the kind a security-shaped claim has to admit, and the reason I foreshadowed it back at the kit's false positive: the property that mattered most was never in the code. Airlock's whole point is two genuinely separate credential principals, so an injected agent can't reach the keys at runtime. No layer verifies that. It's a deployment fact, not a code fact. The check passes happily on a redactor that's then run in the same process as the agent, handing it the keys — quietly defeating everything.
(Update: I've since added a deployment-layer check that reaches part of this — when the deployment is infrastructure-as-code, it flags any IAM principal holding both the source keys and the handoff-read. The runtime boundary here — a redactor sharing the agent's process — stays unverifiable. The floor held; a rung got added below it. See the update.)
So the person the interrogation layer exists for — the novice — is the one most likely to botch the single thing that matters. You can fit the build to them, make it correct, prove the code clean, and still not save them from the deployment. That's the line where "hand someone an artifact and let their agent build it" stops, and "build it for them, and run it for them" begins.
What this borrows from
To be explicit, because almost none of the parts are new:
- The three-job loop — interrogate, contract, verify — is Behaviour-Driven Development's discovery → formulation → automation, with example-mapping conversations standing in for "interrogate."
- "An executable contract beats a static description" is the founding argument of consumer-driven contract testing (Pact): a contract enforced by running concrete cases, so neither side has to guess what the other expects.
- "Hand the agent a spec; treat the code as build output" is Spec-Driven Development, now mainstream — GitHub Spec Kit, AWS Kiro (whose workflow is literally Requirements → Design → Tasks), BMAD, OpenSpec, and a DeepLearning.AI course.
- The language-neutral conformance kit — reusable cases that act as a contract across implementations — is an already-described pattern, with academic cousins like test-driven agentic development.
- Experiment 2 re-treads named territory: "Ask or Assume?", "First Ask Then Answer," and the generate-and-flag pattern, where a single clarifying question has been measured to cut error rates materially.
So this is not a methodology, and "conformance-driven" or "contract-verified" would just be new labels on old discipline. What I think is actually mine is smaller and more specific: a blind, controlled comparison that locates where the actionable layer bites (the interface, not the architecture — most spec-driven writing is prescriptive, not ablated); the complement-of-what-the-builder-holds framing; and the deployment floor — the admission that the property that matters most isn't artifact-verifiable at all. Old rigor, a new builder, and an honest map of where it stops.
What this showed (not proved)
Two builds per arm, stochastic agents, one task in one domain — directional, not a benchmark. And a conservative read: my "blog post" was a tight, prescriptive design doc, so the spec won the contract against a strong prose baseline; a weaker narrative would widen the gap. None of the components are new (see the lineage above); what I'll stand behind is the shape, and the honesty about where it stops.
Idea, code, product
Step back from the three jobs and there's a cleaner way to say the whole thing. When you want someone's agent to build your thing, you're choosing what to hand over — and there are only three kinds of thing you can hand:
- An idea is what they understand. It transfers as prose, for free, and stays open to interpretation. Its gift and its flaw.
- Code is one way it got built. It doesn't lie — it's exact and forkable — but it's exact about how: it bakes in a thousand incidental decisions and anchors everyone to your stack. It over-tells.
- A product is neither. It's the commitments that make something that thing — what it must do, what it must never do, how you'd know. The part that survives a rewrite.
The experiments are really about which of these you ship. The blog post was too loose — an idea can't pin an interface or fit the build to a user. The reference was too tight — code anchors and over-tells, which is why I deleted it. The thing in between, the part that's actually the product, is its commitments — and the only way to ship commitments without shipping code is to make them runnable: a contract plus a conformance suite. That keeps the one property people trust code for — it doesn't lie — and drops the one that hurts — it anchors. Conformance is the part of code that's about what must be true, cut away from the part about how.
The idea — the understanding — was always free; the agents lifted it from prose for nothing. What isn't free is the contract: the commitments that make the thing that thing, plus the check that proves them. And the part you can't hand over at all — taste, the deployment, running it — is the floor from a few sections back.
So the shippable unit of agent-built software isn't the idea, and it isn't the code. It's a product's commitments, made executable.
Update (2026-06-20) — the kit has since been hardened
Writing this up turned the conformance kit into its own object of review, and the findings fed straight back in. In the spirit of the piece — stale evidence is the failure mode — what changed since publication:
- The false positive is fixed.
no-llm-in-redactorkeyed on bare vendor names, so a mere comment mentioningopenaitripped a hard FAIL. It now keys on call/import shapes (messages.create,import openai, …); a bare literal only raises a warning. The lesson the anecdote carried — a clean run is necessary, not sufficient — survives, at the right severity. - The ambiguous bare-code case is gone. The footnote in Experiment 1 argued a lone
998877is genuinely ambiguous — code or quantity? — and not a real leak. So it was cut from the default corpus: a minimal baseline should test properties, not mandate a redaction strategy; an aggressive policy can ride in a named profile. (Counts are now out of 18, not 19; the pass numbers in the table are unchanged, since every build failed that one case.) - Part of the floor is now reachable — when the deployment is code. I claimed no layer verifies the two-principals property. Still true of the runtime boundary. But when the deployment is infrastructure-as-code, the separation is a readable artifact, and a new Layer 4 check flags any IAM principal that holds both source-credential-read and the handoff-read. The floor didn't move; a rung got added below it.
- The harness itself got more honest. It used to report a clean
0 failon a non-existent target — a pass on nothing. It now refuses a missing directory and treats a failed scan as a harness error, not a green check.
The published numbers and the original results.txt are pinned to the kit as it was
(git checkout 37aa51d -- conformance), so the experiments stay reproducible exactly as
written above.
Appendix — the raw experiments
Everything is in experiment/ (README),
reproducible with one loop each.
Experiment 1 — spec vs. blog post (experiment-1/): the two
briefs and the four builds. Open
builds/a1/redact.py to see it redact
correctly but label with jwt/card_or_account; open
builds/b1/redact.py to see the same logic
emit the pinned otp/token/account. Kit output (with the false positive) in
results.txt.
Experiment 2 — assume vs. ask (experiment-2/): the
answer-key.md (the hidden third-party fact), the
transcript.md (the interrogation), and the two
builds — A (spec, guessed) and B (blog post + interrogation, asked-and-knew but ate the
date). The discriminating case is in
results.txt: date-iso-survives passes for A,
fails for B.
Method notes: isolated agent builds, identical tasks, briefs sanitized so each stood alone.
Scoring combined manual review of the design decisions with an automated conformance kit
(structural signatures + a redaction vector corpus). All inputs, builds, transcripts, and kit
output are in experiment/ — reproducible with the loops in its README.