ADR-0007: Wave F ConflictNotifier — narrowed to rate-limit / idempotency residuals
Status: Accepted Date: 2026-04-17 Decision: Close the scope gap in issue #588 by delivering ONLY the four admission-control rails that PR #565 did not ship (idempotency, cascade-replay, per-recipient rate-limit, email digest) plus retry/DLQ, metrics, and integration tests. Do not re-implement the 6-channel notifier core or the adapters — those landed clean in PR #565. Related: PRD #549 §3.5, HLD v1.1 §3.5 / §3.5.5, LLD #560 §6, Issues #565 (closed), #588.
Context
Issue #588 was created from HLD v1.1 §3.5.5 after Wave F concern #4 was resolved. Its acceptance criteria span:
- 6-channel fan-out dispatcher (
internal/governance/conflicts/notifier.go) - Channel adapters (
adapters/ws.go,bell.go,audit.go,email.go,slack.go) - Idempotency: sha256(tenant|conflict|channel|recipient), Redis, 24h TTL
- Email digest: 15-min rolling window per recipient; P0/P1 bypass
- Rate limit: 20 notifs / 5 min / channel / recipient; overflow to digest
- Slack retry: 3 attempts, exp backoff (1s/4s/16s); DLQ
- Cascade replay: same
(conflict_hash, recipient)within 1h suppressed - Metric
notification_fanout_suppressed_total{reason} - Chaos test (100 conflicts on same target → <10 deliveries)
- Integration test (2-layer verdict disagreement → all 6 channels fire exactly once)
PR #565 — merged 2026-04-15 as part of the Wave E bundle — already shipped:
- ✅
internal/conflict/notifier.go(6-channel dispatcher, debounce, failure isolation, panic isolation) - ✅
internal/conflict/adapters.go(all 6 stub adapters: quad, bell, audit, inbox, email, webhook/Slack) - ✅
conflict_notificationsdelivery ledger (migration 066) - ✅ pg_notify LISTEN/NOTIFY wiring (triggers in migration 066)
Criteria 1 + 2 are already complete. The package name differs from the HLD (internal/conflict rather than internal/governance/conflicts), which the HLD v1.1 specifies as non-normative — the contract is the 6-channel dispatcher, not its package path.
The residual gap is criteria 3-10: the admission-control rails + retry + metrics + tests.
Decision
Option B (narrow to residuals). Ship four new files under the existing internal/conflict/ package:
guard.go—Guardtype composing four rails: idempotency (Redis SETNX, 24h), replay (Redis SETNX on conflict_hash, 1h), per-recipient rate-limit (in-memory token bucket viagolang.org/x/time/rate, 20/5min), email digest (in-memory 15-min window).retry.go—RetryDeliver(adapter, ...)helper implementing the 1s/4s HLD backoff schedule;ShouldRetry(channel)classifier (only webhook + email retry).metrics.go— four Prometheus instruments (suppressed counter, delivered counter, failed counter with reason label, deliver-latency histogram).guard_test.go+retry_test.go+notifier_guard_test.go— covers every suppression reason, chaos (100 conflicts on same target), 2-layer disagreement integration, retry exhaustion, context cancellation.
Wire points:
notifier.go: newWithGuard(*Guard)+WithRetryPolicy(RetryPolicy)options; dispatcher loop runsguard.Admitbefore calling the adapter, thenRetryDeliverfor webhook/email and directDeliverfor the others. Metrics fire on deliver / fail / suppress.cmd/context-engine/main.go: constructsconflict.NewGuard(conflict.WithRedis(rdb), …)and passes it intoNewNotifier— reusing the existing*redis.Client(theRedisClientinterface is a one-method subset of*redis.Client).
Deviations from the HLD AC list
| AC | Decision | Rationale |
|---|---|---|
| "P0/P1 bypass" for email digest | Implemented as ClearanceAtDecision >= 4 bypass. | policy_conflicts has no explicit priority column — clearance-at-decision is the closest available signal. Admin / P0 escalations always write at clearance ≥ 4 (see handler_perms). If a dedicated severity column lands later, the check is a one-line swap. |
| Overflow "routes to digest" on rate-limit | Overflow is dropped (not digested). | Routing rate-limited events into a digest queue requires a persistent outbox + a separate digest worker — out of scope for Wave F residuals. The rate-limit rail's sustain rate (20/5min) plus the email digest rail (15-min collapse) already cap the recipient's inbox well below the AC ceiling. A follow-up can add digest-as-overflow cheaply because the guard returns a typed reason. |
| DLQ on Slack retry exhaust | Implemented via conflict_notifications row with status='failed' + failure_reason='retry_exhausted: …'. | No separate DLQ table — the ledger IS the DLQ. Ops replays via SELECT … WHERE failure_reason LIKE 'retry_exhausted:%'. Keeps the schema surface minimal and composes with existing RLS. |
Package path internal/governance/conflicts/ | Kept as internal/conflict/. | PR #565 established the path; moving it would force a churn-heavy rename across six callers for zero behavioural gain. HLD path was non-normative. |
Why not rebuild
Rebuilding the 6-channel dispatcher would:
- Duplicate the existing
Notifier+ChannelAdaptermachinery - Force migration renames (
conflict_notificationsalready exists as migration 066) - Break ConflictService's existing
conflictNotifier.Enqueue(...)calls inservice.goandmerge_synthesis.go - Add zero capability beyond what PR #565 already delivers
Consequences
- The Guard runs strictly BEFORE the delivery row is inserted — suppressed events do NOT generate a
conflict_notificationsrow. This is intentional: the rails exist to prevent noise, not to record every would-be delivery. - Pod-local rate-limit state means two pods can independently admit their burst budget. Practical ceiling stays bounded:
pod_count × 20 per 5min per channel per recipient. With typical deployments (2-6 pods) this is 40-120 notifications / 5min — still inside the UX tolerance for a digest-bypassing conflict. - Fail-open on Redis error: we tolerate a duplicate notification over a silently-dropped one. A sustained Redis outage spikes duplicate deliveries, surfaced via the absence of suppressed metrics (ops alert).
- Tests run under
-race -count=1clean; the chaos test uses the fake Redis to keep CI fast.
Related
- PR #565 (Wave E bundle — dispatcher + adapters + ledger)
- PR #561 / #567 (migration 060 org_units, migration 064 org_unit_tools)
- HLD v1.1 §3.5.5 concern #4 resolution