Retrospective: Shelfware Pattern in Waves 1–3
Date: 2026-04-15 Scope: PRD #380 Waves 1, 2, 3 delivery review Attendees (implicit): PM, Principal Architect, Backend SME, QA Engineer
The pattern
Across 3 waves of backend delivery (14 LLDs, 20 PRs), 6 distinct PRs shipped with interfaces defined and unit-tested but not wired into the production call path. Each was caught at architect review and required a re-review loop to fix.
Incidents
| # | Wave | LLD | PR | Symptom | Caught by |
|---|---|---|---|---|---|
| 1 | 2 | 6 Approval State Machine | #422 | NewPauserAdapter(nil, sessionManager) — APPROVED decisions never resumed the paused session | Architect re-read main.go |
| 2 | 2 | 7 Channel Adapters | #423 | CallbackHandler + AdminHandler both defined but never mounted on HTTP mux. Email links dead; emergency rotate dead | Architect grep for mount sites |
| 3 | 2 | 7 Channel Adapters | #423 | RotatorAudit hook unwired; Rotator singleton not enforced | Same review |
| 4 | 2 | 9 Policy Resolver | #427 | Resolver interface existed, zero call sites. ServiceConfig.PolicyResolver never populated. Service.Request never called it | Architect grep for call sites |
| 5 | 3 | 14 Approval Gate | #449 | Entire approval orchestrator unwired: 0 production callers of NewApprovalOrchestrator, EnqueueApproval, NewSubagentApprovalStore, ResolveSubagent, InvokeAfterApproval | Architect explicit grep |
| 6 | 3 | Real coordinator swap | #449 → #452 | InvokeAfterApproval wired to stubCoordinatorInvoker in production — approvals granted would fail loudly, not spawn children | Architect caught the stub was reachable in prod |
Cost
- 3 distinct PR re-review loops (each ~30 min architect time, ~45 min backend-sme time)
- Each incident adds one review cycle to the LLD
- Combined: ~4 hours of additional rework across the session
If any of these had shipped to prod, the user-visible impact would have been:
- Incident 1: silent approval success with stuck sessions (worst-case corruption of approval audit trail)
- Incident 2: email-based approval flow would silently never work
- Incident 3: audit gap on rotation events (SOC2 regression)
- Incident 4: tool calls that should require approval would silently execute
- Incident 5: the entire approval gate would be absent despite PRD sign-off claiming it
- Incident 6: approvals granted would error loudly but no child spawn (less bad because loud)
Incidents 1, 4, 5 were silent-failure class bugs. That's what makes this pattern alarming — happy-path unit tests pass, but the feature isn't reachable from production.
Root cause analysis
Why does this keep happening?
Root cause: the Go type system lets you define an interface + implementation + unit tests without any compile-time check that something CALLS the constructor with the new implementation. Unit tests happily exercise the component in isolation. go test ./... passes. go build ./... passes. go vet ./... passes. The production path uses the OLD interface (or a stub, or nil) and there is no signal.
Contributing factors:
-
Subagent scope discipline — Each LLD was deliberately scoped to "implement primitives, expose seam for next LLD". Backend-sme rightly stayed in scope. The "wiring" step was implicitly the LAST LLD's job, but each LLD's author thought their LLD was done at the seam boundary.
-
LLD documents often underspecify wiring — LLD 9's doc said "resolver wired into Service.Request". It did not say "and construct the resolver in cmd/agent-orchestrator/main.go at startup, pass tenant source, load seed file, pass to approval service config". Backend-sme read "wire the interface" and did just the interface side.
-
Architect's review checklist had a gap — reviews focused on correctness of each function, security of the design, test coverage. They did NOT include a mechanical "grep for production callers of newly-added exported entities". The pattern was invisible until architect did this grep manually, starting from Wave 2 LLD 9 review.
-
Unit tests exercise interfaces in isolation —
TestResolver_*,TestApprovalOrchestrator_*all pass because they construct the component directly. The failure mode (constructor never called in prod) is invisible to unit tests. -
Integration tests are often scoped too narrow — tests exercise the component + its immediate collaborators, stubbing far enough back that the "is it wired?" question isn't asked.
What didn't fix it on its own
- Review rigor increased after LLD 6 (first incident) — didn't prevent LLDs 7, 9, 14 from repeating the pattern
- Backend-sme self-reporting of deviations — they'd note "seam left for next LLD" without realizing that was the gap
- Founder-decision tracking — decisions were clear; the execution gap was downstream of decisions
Fixes
Fix 1 (gate): Architect checklist must include production-caller grep
Before any architect APPROVE on a PR that adds new exported entities, run:
# For every newly-added public entity (struct, function, interface, method)
grep -rn "NewThing\|ThingInterface\|ThingMethod" cmd/ internal/ --include="*.go" | grep -v _test.go
Each must have AT LEAST ONE production caller. If zero: REQUEST_CHANGES until wired.
This gate is non-negotiable. Add to .claude/agents/principal-architect.md.
Fix 2 (gate): Backend-sme must include grep verification in final report
Every PR report from backend-sme must include a grep verification section showing production call sites for newly-added exported entities. The report should make the zero-callers case impossible to overlook.
Pattern established in LLD 14 round 2: worked immediately. Backend-sme caught their own stub.
Add to .claude/agents/backend-sme.md.
Fix 3 (practice): LLD docs must specify wiring
LLD authors (architect) must include an explicit ## Wiring section in every LLD describing:
- Which production file(s) construct the new components (exact file path)
- Which existing code paths call the new component (caller site)
- What main.go startup changes are needed
- Any seed file / config updates
"Seam left for next LLD" is acceptable ONLY when the NEXT LLD is in the same wave AND has an explicit wiring requirement in its own Wiring section.
Add to architect's LLD template.
Fix 4 (process): Integration tests must exercise production wiring
At least one test per LLD must:
- Construct the system EXACTLY as main.go does (or via a helper that main.go also uses)
- Invoke the public entry point
- Verify behavior propagates end-to-end
"Unit test with all fakes" is not sufficient. The test must fail if wiring is missing.
Fix 5 (culture): "Shelfware" is a named bug class
When backend-sme or architect suspects a PR may have shelfware risk, they can use the term "shelfware check" and run the grep immediately. This normalizes the concern and makes the fix fast.
Not-fixes (considered and rejected)
- Automated dead-code lint in CI: Go's standard
unusedlint is too noisy for interface-based code; distinguishing "unused by design" (test doubles) from "dead shelfware" is non-trivial. Human grep is faster for now. - Require TDD: would slow velocity without addressing root cause (the component would still pass tests in isolation).
- Pair programming on every LLD: not practical with async agent workflow.
Follow-up actions
- ✅ Update
.claude/agents/principal-architect.mdwith Fix 1 checklist - ✅ Update
.claude/agents/backend-sme.mdwith Fix 2 reporting requirement - Architect updates LLD template with Fix 3 (Wiring section)
- Before Wave 4 kickoff, PM enforces these gates in every LLD review prompt
Incident cadence after fixes
Track: starting Wave 4, zero shelfware incidents expected. If any occur, log here and re-retro.
Appendix: commits fixing each incident
- Incident 1: #422 commit
c3564fc(wire PauserAdapter with real pause manager) - Incident 2: #423 commits
33bac15,56c2254(mount callback + admin handlers) - Incident 3: #423 commits
5af650c,cb84f95(audit sink + singleton lock) - Incident 4: #427 commits
37172c6,cb7fee7(wire resolver + ceiling) - Incident 5: #449 commits
d0fcc69,fd0654e,2152617,4678319,7fcdfed(5-commit wiring fix) - Incident 6: #452 commit
e2a3681(stub → real coordinator swap)