ADR-0008: Wave F MergeSynthesisWorker — narrowed to residuals pending PRD #558
Status: Accepted
Date: 2026-04-17
Decision: Narrow issue #589 to hardening the existing MergeSynthesisWorker (MaxAttempts enforcement + exponential backoff + golden tests for StubSynthesizer + worker observability). Leave StubSynthesizer as the plugged implementation. Defer real two-pass Sonnet→Opus synthesis to PRD #558 which owns the LLMGateway + llm_usage_events attribution surface.
Related: PRD #549, HLD #556 v1.1 §3.5.4 / §6.2.2, LLD #560 §9, PRD #558, issues #565 (Wave E shipped the stub), #589 (this).
Context
PR #565 (Wave E, merged 2026-04-10) shipped:
internal/ai/router—StubRouter+Routerinterface, withResolve(functionKey, orgID) → *ResolvedModel. The only place in the Wave E tree that names a model slug.internal/conflict/merge_synthesis.go—MergeSynthesisWorkerpool (size=4 per pod,FOR UPDATE SKIP LOCKED, 30s invoke timeout, per-rowSET LOCAL app.org_idunderSET LOCAL row_security = offfor RLS-aware cross-tenant sweep) +StubSynthesizerthat returns a deterministic skeleton merge.
The architect raised residual gaps (#589) against the merged worker. Issue #589's original acceptance list includes real LLMGateway wiring + two-pass Sonnet→Opus + confidence-based escalation + per-call llm_usage_events. This work cannot land in Wave F because:
- LLMGateway does not exist. It is the central deliverable of PRD #558 (Platform AI Model Configuration). Stubbing a second LLM client outside that PRD would ship dead code + contradict the master PRD NFR ("NO HARDCODED MODEL SLUGS in business code").
llm_usage_events.agent_idisNOT NULL. The schema for per-call attribution was landed for agent runtime use; the MergeSynthesisWorker is not an agent invocation, so writing rows would require a schema delta that belongs to PRD #558, not a Wave F hardening PR.- Confidence-based escalation requires the real model client. A stub cannot emit
confidence_score; gating on it would be fiction.
Founder approved Option B on 2026-04-17 (escalation: https://github.com/upsquad-ai/upsquad-core/issues/589#issuecomment-4266999198) — narrow #589 scope, ship only the hardening residuals, defer real synthesis to PRD #558.
Decision
Ship only these residuals under issue #589:
MaxAttemptsenforcement (default 3). Transient failures (router resolve, synthesizer error,SetSuggestedMergewrite) increment an in-memory attempt counter keyed by conflict id. On reaching the cap, the row is stamped with a permanentsynthesis_max_attempts_exceededmarker so the claim scan stops re-picking it.- Exponential backoff with cap. The Nth retry waits
min(BaseBackoff × 2^(N-1), BaseBackoff × 10). Defaults:BaseBackoff=1s, so the schedule is1s → 2s → 4sover the three-attempt window (well inside the claim tick cadence so rows don't starve the sweeper). - Golden-file tests for the deterministic
StubSynthesizer. Five representative conflict shapes (org-over-member tool call, platform-over-org data export, org_unit-over-member knowledge query, member-wins wildcard target, platform-default-no-loser) locked intointernal/conflict/testdata/merge_golden/*.json. An injectableClockmakesgenerated_atdeterministic;UPSQUAD_GOLDEN_UPDATE=1regenerates. - Worker observability. Three cardinality-bounded Prometheus instruments, registered via
promautoso the default/metricsscrape picks them up:synthesis_attempts_total{outcome}—outcome ∈ {success, transient_failure, permanent_failure, skipped}(4 active series).synthesis_duration_seconds— histogram, 1ms → ~65s buckets.synthesis_failures_total{reason}—reason ∈ {set_org_id, load_row, router_resolve, synthesizer, invalid_json, write_merge, max_attempts_exceeded, unknown}(8 active series). Total package-level cardinality ≤ 12.
- Seam documentation. A prominent block comment at the top of
merge_synthesis.gostates: "PRD #558 will replace StubSynthesizer with real LLMGateway two-pass. Do not mutate this seam without coordinating with that PRD."
What this ADR explicitly does NOT deliver
- Real LLM call (Sonnet 4.6 default) — deferred to PRD #558.
- Opus 4 escalation on
confidence_score < 0.7— deferred to PRD #558. llm_usage_eventsper-call row — deferred to PRD #558.- LLMGateway interface — owned by PRD #558.
- No new migrations. No changes to
internal/ai/router/. No changes tollm_usage_eventsschema.
Handoff contract to PRD #558
When PRD #558 lands:
- The
MergeSynthesizerinterface is the drop-in seam. PRD #558 will ship a new implementation (e.g.internal/ai/synthesis/LLMGatewayMergeSynthesizer) that thecmd/context-engine/main.gowiring swaps in place ofconflict.NewStubSynthesizer(). - The golden fixtures describe the expected output shape. PRD #558's real LLM prompt MUST produce a compatible payload (field names + types) so downstream consumers (audit exports, client admin UI) don't branch on "stub vs real".
- The three worker metrics continue to apply; PRD #558 adds provider-specific sub-metrics alongside, not replaces.
MaxAttemptssemantics stay identical — PRD #558's two-pass (Sonnet → Opus on confidence miss) counts as a single attempt from the worker's perspective because escalation is an internal concern of the synthesizer, not the retry accountant.
Consequences
Positive
- Wave F ships the narrow hardening without blocking on PRD #558 scope or timeline.
- No speculative code in the tree: every exported entity has a production caller (shelfware gate).
- Golden fixtures act as the shape contract for the PRD #558 LLM prompt engineer — less coordination ambiguity than prose.
- MaxAttempts + backoff eliminates the current risk of a poisoned row silently re-attempting forever.
Negative
- The StubSynthesizer remains the plugged implementation in production until PRD #558 ships. This is acceptable because policy_conflicts inbox consumers (admin UI, SIEM export) treat
suggested_merge_jsonas advisory; the authoritative resolution happens in theAcknowledge/ResolveRPCs. - An in-memory attempt tracker loses state on pod restart. Acceptable because (a) the row-in-DB is the queue; (b)
ClaimPendingSynthesisre-picks eligible rows; (c)MaxAttemptsis a last-resort cap, not a correctness property.
Alternatives considered
- Ship everything in #589's original scope now. Rejected: requires LLMGateway +
llm_usage_events.agent_idschema delta + stubbed Opus escalation. Scope belongs to PRD #558. - Ship nothing; wait for PRD #558. Rejected: leaves the worker without retry caps (poisoned row risk) and without observability (operators can't see failure modes).
- Ship MaxAttempts as a DB column. Rejected: adds a migration for a last-resort cap; in-memory tracker is simpler and the row-in-DB is already the queue of record.
Evidence
internal/conflict/merge_synthesis.go— seam comment + retry policy + attempt tracker.internal/conflict/merge_synthesis_metrics.go— the three instruments with bounded cardinality.internal/conflict/merge_synthesis_golden_test.go+testdata/merge_golden/*.json— 5 locked shapes.internal/conflict/merge_synthesis_retry_test.go— backoff + attempt tracker unit tests.