LLD 19 — Right-to-Deletion (RTD) Engine (Wave 4B)
| Field | Value |
|---|---|
| Parent HLD | #456 agent-runtime-wave-4b-enterprise-compliance.md |
| PRD | #380 (P4.8.5) |
| Issue | #462 |
| PR | #477 (merged 2b5ca26) |
| Doc backfill | #478 |
| Milestone | 9 |
| Wave | 4B |
| Size | XL (~2600 LoC impl + tests) |
| Depends on | LLD 18 (#469) classregistry, Wave 1 audit hash-chain |
| Parallel with | LLD 20 retention (#463), LLD 21 SBOM+SIEM (#464) |
Founder decisions (binding, from #380 comment 4254408367):
- Q1 — Audit + usage aggregates: REDACT identifiers, RETAIN aggregate rows. 7-year tax floor trumps GDPR delete.
- Q2 — Dual-control mode: STRICT. A second platform-admin must attest with a one-time HMAC token inside a 72h window. Same-admin attempts are rejected at both the service layer and the DB.
- Q5 — Engine replicas: SINGLE. A PG advisory lock (
upsquad.compliance_engine) is held for the entire daemon lifetime. A rolling restart usesRecreateso the lock transfers cleanly.
This LLD doc was backfilled after PR #477 merged (#478). The shipped code is the authoritative spec; this document is a human-readable companion and auditor trail.
1. Scope
Tenant-initiated, legally-attested purge of personal / operational / secret data across the 55+ scopes registered by LLD 18. The flow is:
RequestDeletion(RPC) ──► AWAITING_ATTESTATION
│
│ AttestDeletion(RPC, different admin, <72h)
▼
QUEUED
│
│ compliance-engine tick (singleton)
▼
RUNNING
│
├─► PURGE (deleters)
│ │
│ ▼
├─► VERIFY (re-scan; loop to PURGE 3x)
│ │
│ ▼
├─► REDACT (audit + usage identifiers)
│ │
│ ▼
└─► CERTIFY (HMAC sign + S3 upload)
│
▼
SUCCEEDED
Every phase writes a row to erasure_phase_log. A crashed daemon restarts with resumeInFlight which re-reads the log and continues from the last non-success entry. The 30-day GDPR SLA is enforced by a per-hour ticker that pages at T-24h and breach-marks at T-0.
2. Data Model (migration 051)
Five tables. RLS on the three tenant-visible ones; platform-scope on the two internal ones.
| Table | Class | RLS | Notes |
|---|---|---|---|
erasure_requests | Personal | yes | State-machine header. erasure_status + erasure_phase enums. |
erasure_phase_log | Audit | yes | Per-phase checkpoint. outcome IN (pending, success, failure, skipped). Cascades on request delete. |
erasure_certificates | Audit | yes | Immutable. REVOKE DELETE FROM PUBLIC. One row per request (UNIQUE). |
audit_redaction_chain | Audit | platform | Parallel hash-chain for REDACT step. Crosses tenant salt so it cannot be RLS-scoped per-tenant. REVOKE UPDATE, DELETE. |
erasure_attestation_tokens | Secret | platform | One-time Q2 tokens. Platform-scoped because only platform-admin can attest. token_sha256_hex is a CHAR(64) SHA-256 hash — the raw token is never persisted. |
Platform-scoped tables (audit_redaction_chain, erasure_attestation_tokens) are on the LLD 18 allow-list so the coverage gate stays green.
Feature flag: compliance.rtd_enabled (default false). Wave 4B ships the machinery; per-tenant rollout is staged once the engine is observed in dev + staging.
2.1 Index strategy
ix_erasure_requests_ready— partial on(created_at ASC) WHERE status='queued' AND attested_at IS NOT NULL— drivesClaimNextRequestunderFOR UPDATE SKIP LOCKED.ix_erasure_requests_sla— partial onsla_deadline WHERE status NOT IN (succeeded,failed,cancelled)— SLA ticker hot path.ix_erasure_requests_salt_expiry— partial oncompleted_at WHERE salt_zeroed_at IS NULL AND status='succeeded'— salt sweeper hot path.
3. State Machine
domain.Status enum:
queued → awaiting_attestation → running → succeeded
│ │
│ └─► failed
└─► cancelled
domain.Phase enum (runs only while status = running):
purge → verify → redact → certify
Phases execute in the order returned by domain.AllPhases() — the slice is load-bearing (test TestAllPhases_OrderIsLoadBearing pins this). The Runner iterates AllPhases() for every request and refuses to skip a phase; crash-recovery re-enters at the last non-success phase, not at the next phase.
3.1 Timing invariants (domain.types.go)
| Constant | Value | Meaning |
|---|---|---|
SLATotal | 30d | GDPR purge SLA |
SLAPagingOffset | 24h | First page fires at T-24h |
AttestationWindow | 72h | Max gap RequestDeletion → AttestDeletion |
PhasePurgeMaxTime | 20d | Soft bound on PURGE phase |
PhaseVerifyMaxTime | 5d | Soft bound on VERIFY |
PhaseRedactMaxTime | 3d | Soft bound on REDACT |
PhaseCertifyMaxTime | 1d | Soft bound on CERTIFY |
VerifyMaxRetries | 3 | VERIFY → PURGE loop cap |
4. Founder Q2 — Strict Dual-Control
Two-admin attestation is enforced via a one-time HMAC token with a 72h TTL.
4.1 Token lifecycle
RequestDeletiongenerates a random token, hashes it with SHA-256 (HashAttestationToken), stores the hash inerasure_attestation_tokens, returns the raw token ONCE to the requester. The raw token never touches disk.AttestDeletionhashes the submitted token and callsStore.ConsumeAttestationToken(hash)— a singleDELETE ... RETURNINGthat atomically removes the token and returns the original(request_id, requester_member_id). A duplicate call finds no row and returnsErrTokenInvalid.- Service-layer checks same-admin AFTER consumption:
consumedRequestID != submittedRequestID→ErrTokenInvalid+same_admin-style metric.requesterID == attestingMemberID→ErrSameAdmin+dual_control_denied{reason=same_admin}metric.
Store.AttestRequestflips statusawaiting_attestation → queuedunder optimistic concurrency (returnsErrAttestInvalidon any other status).
4.2 Defense in depth
- Service layer rejects same-admin before flipping the request.
- Store layer
AttestRequestis aWHERE status='awaiting_attestation' AND requester_member_id != $attesterpredicate — a race that made it past the service check still fails at the DB. - Token rows are DELETE-once, so a replay of a captured token after consumption returns
ErrTokenInvalid.
4.3 Token expiry
The salt-sweeper cron also calls Store.ExpireAttestationTokens — any token past expires_at is removed hourly. An expired but uncollected token is indistinguishable from a consumed one.
5. Founder Q1 — Aggregate-Retention Path
usage_records + llm_usage_events are ClassAudit. They MUST NOT be deleted (7-year tax retention). Instead, phase 3 REDACT:
- Invokes
redactor.UsageRedactor(one per audit table) which UPDATEs every identifier column (user_id,agent_id,session_id,email,phone, ...) to HMAC-SHA256 pseudonyms keyed on the per-request 32-byte salt. - Appends a row to
audit_redaction_chainwithprev_hash= previous chain head,row_hash=sha256(prev || newChainHead)wherenewChainHeadis the redactor's running digest. Integrity verification walks this chain independently of tenant data. - Salt is zeroed 30 days after
completed_atby the daemon's salt-expiry sweeper. After zeroing, the HMAC mapping is mathematically irreversible — the pseudonyms remain valid for aggregate queries but the original identifiers cannot be recovered.
5.1 Redactor contract
type RedactorImpl interface {
Redact(ctx context.Context, orgID string, salt []byte) (rowsRedacted int64, newChainHead []byte, err error)
}
Identifiers are enumerated via the same column-allow-list used by SIEM drop-sensitive transforms (LLD 21). The redactor never touches numeric / aggregate columns (tokens_in, tokens_out, cost_usd).
6. Founder Q5 — Single-Replica Singleton
internal/compliance/singleton.go — PG advisory lock keyed on FNV-1a("upsquad.compliance_engine"). Pattern copied from approval/scheduler_lease.go and subagent/coordinator/lease.go. Unique key per daemon so the three singletons (approval-scheduler, subagent-coordinator, compliance-engine) run independently on the same cluster.
AcquireLeasepollspg_try_advisory_lock($key)every 15s until acquired or ctx cancelled.- The held
*pgxpool.Connis the lock owner — on crash or process exit, the session closes and PG releases the lock automatically (no stale state). Releaserunspg_advisory_unlock+conn.Release()on shutdown.Recreatedeployment strategy prevents rolling-update overlap: the new pod will block atAcquireLeaseuntil the old pod exits.
The retention sweeper (LLD 20) and SIEM worker (LLD 21, when embedded) share this same single-replica guarantee — they run as goroutines inside the daemon that already holds the lease.
7. HMAC Certificate Chain (Phase 4)
HMACCertifier.Mint produces:
- Canonical body: JSON payload with
request_id,org_id,dry_run,reason,submitted_at,attested_at,started_at,certified_at,requester_member_id,attesting_member_id,sla_deadline, per-phase rows-affected summaries,cert_version = "1.0". Keys sorted insidebuildPhasePayload;time.RFC3339Nanofor every timestamp. - Digest:
sha256(body)— stored asbody_sha256_hex. - Signature:
HMAC-SHA256(body, releaseKey)— theCOMPLIANCE_RELEASE_KEYenv var is a platform secret; key rotation viaCOMPLIANCE_RELEASE_KEY_ID(defaultrelease-2026-q2).
Upload path: CertStorage.Put writes the body to long-retention S3 with s3:ObjectLock COMPLIANCE mode + 10y retention (dev uses InMemoryCertStorage). The returned (bucket, key, etag, lockUntil) is stored in erasure_certificates for auditor verification.
HMAC is deliberate for Wave 4B — cosign-signed follow-up is tracked on the 30-day SBOM-signing milestone (LLD 21 §2.4).
8. Cancellation Semantics (LLD §3.6)
Per-state cancel rules enforced by Service.CancelRequest:
| Status | Phase | Cancellable |
|---|---|---|
queued | — | yes |
awaiting_attestation | — | yes |
running | purge (0 scopes done) | yes |
running | purge (any scope done) | no (ErrForbiddenCancel) |
running | verify / redact / certify | no |
succeeded / failed / cancelled | — | no |
Rationale: once any PURGE scope has completed, the purge is no longer reversible; continuing to CERTIFY produces a defensible audit trail. ErrForbiddenCancel maps to gRPC FailedPrecondition.
9. ComplianceService RPCs
All RPCs registered on the agent-orchestrator gRPC port. The daemon (cmd/compliance-engine) does NOT serve RPCs — it only drives the runner + sweepers.
| RPC | Auth | Maps to |
|---|---|---|
RequestDeletion | platform-admin | Service.RequestDeletion |
AttestDeletion | platform-admin (different from requester) | Service.AttestDeletion |
GetStatus | tenant-admin (same org) | Service.GetStatus |
ListRequests | tenant-admin (same org) | Service.ListRequests |
CancelRequest | platform-admin (same org) | Service.CancelRequest |
GetScopeCoverage | tenant-admin (same org) | Service.GetScopeCoverage |
9.1 Error mapping (mapError)
| Service error | gRPC code |
|---|---|
ErrClearance | PermissionDenied |
ErrForbiddenCancel | FailedPrecondition |
ErrSameAdmin | PermissionDenied |
ErrTokenInvalid | Unauthenticated |
store.ErrAttestInvalid | FailedPrecondition |
store.ErrNotFound | NotFound |
| default | Internal |
9.2 Cross-tenant guard
RequestDeletion and every read RPC check the request body's org_id against the JWT-derived org (checkOrgMatch) — mismatch returns PermissionDenied "cross-tenant RTD forbidden". JWT is the source of truth; the body field exists only for back-compat with pre-auth clients.
10. Wiring (per shelfware gate)
10.1 cmd/compliance-engine/main.go constructs
| Entity | Construction |
|---|---|
compliance.AcquireLease | Singleton gate (founder Q5). Blocks until acquired. |
compliance.NewMetrics | OTel bundle (compliance_rtd_*). |
store.New(pool) | Tenant-aware pgxpool wrapper. |
classregistry.WireDependenciesWithVerifiers(deps) | Replaces LLD 18 stubs with real deleters.NewPGDeleter / NewVaultDeleter / NewRedisDeleter / NewS3Deleter(NoopObjectStore) and redactor.NewAuditRedactor / NewUsageRedactor. |
rtd.HMACCertifier | SecretKey = COMPLIANCE_RELEASE_KEY, Storage = InMemoryCertStorage (dev) / S3 (prod, LLD 21 follow-up). |
rtd.NewRunner(cfg) | Poll loop on 15s tick. |
runSaltSweeper | 6h tick — zeroes salts at T+30d, expires dangling attestation tokens. |
runSLATicker | 1h tick — pages at T-24h, breach-marks at T-0. |
10.2 cmd/agent-orchestrator/main.go constructs
| Entity | Construction |
|---|---|
compliance.NewMetrics | Reused OTel bundle. |
compliancestore.New(pool) | Intake path only — daemon owns the runner. |
compliance.NewService(cfg) | Intake service. |
compliance.NewGRPCServer(svc, authLookup) | Registers six RTD RPCs. WithSIEMService(siemSvc) attaches LLD 21 RPCs when vault is wired. |
10.3 internal/runtime/server/grpc.go registers
ComplianceService: complianceGRPC on the runtime gRPC server — the same port that serves LifecycleService, ApprovalService, etc.
10.4 Non-shelfware invariants
Every exported constructor has ≥1 production call site:
| Export | Caller(s) |
|---|---|
compliance.NewService | cmd/agent-orchestrator/main.go |
compliance.NewGRPCServer | cmd/agent-orchestrator/main.go |
compliance.AcquireLease | cmd/compliance-engine/main.go |
rtd.NewRunner | cmd/compliance-engine/main.go |
rtd.HMACCertifier | cmd/compliance-engine/main.go |
rtd.NewInMemoryCertStorage | cmd/compliance-engine/main.go (dev path) |
classregistry.WireDependenciesWithVerifiers | cmd/compliance-engine/main.go |
deleters.NewPGDeleter / NewVaultDeleter / NewRedisDeleter / NewS3Deleter | cmd/compliance-engine/main.go (via factory) |
redactor.NewAuditRedactor / NewUsageRedactor | cmd/compliance-engine/main.go (via factory) |
11. Metrics (compliance_rtd_*)
| Instrument | Type | Attributes |
|---|---|---|
compliance_rtd_requests_total | Int64Counter | phase, outcome |
compliance_rtd_phase_duration_seconds | Float64Histogram | phase |
compliance_erasure_sla_breach_total | Int64Counter | kind (paging / breach) |
compliance_rtd_sla_compliance_ratio | Float64Gauge | — |
compliance_rtd_dual_control_denied_total | Int64Counter | reason (token_invalid / token_wrong_request / same_admin / wrong_status) |
compliance_rtd_salt_zeroed_total | Int64Counter | — |
Histogram buckets: 1s, 5s, 30s, 2m, 10m, 1h, 6h, 1d, 7d, 30d — designed for the long-tail of the 30-day SLA.
12. Audit Events
Emitted via phases.AuditEmitter (nil-safe, best-effort). All events carry request_id + dry_run detail.
| Action | Emitted by | Notes |
|---|---|---|
rtd_submitted | Service.RequestDeletion | Logged via slog, not emitted to chain by default. |
rtd_attested | Service.AttestDeletion | slog only. |
rtd_certified | phases.Certify | Includes cert_sha256. |
rtd_succeeded | Runner (final) | Terminal success. |
rtd_failed | Runner.failRequest | Includes reason. |
Wiring the emitter to the Wave 1 audit hash-chain is a 30-day follow-up; for Wave 4B the events flow via the structured logger only.
13. Tests
Package-level unit tests (run under -race):
attestation_test.go— token entropy, hash determinism, plain-SHA256 form.singleton_test.go— lease key stability; no collision with approval / subagent leases.types_test.go—AllPhases()ordering; 72h attestation window; SLA total.service_test.go/configure_retention_test.go— dual-control branches, cancel gating, retention clamp.certifier_test.go— canonical-JSON determinism; HMAC signature matches independent verification.rtd/redactor/chain_test.go— SHA-256 chaining; 32-zero genesis.rtd/deleters/ident_test.go—ValidateIdentrejects SQL metacharacters + reserved words.
Integration suite (under integration build tag, real Postgres): test/integration/wave4/ — full runner loop, resume-in-flight after kill, dual-control happy path, cross-tenant guard.
14. Non-Goals (Wave 4B)
- S3 deleter —
deleters.NewS3DeleterwrapsNoopObjectStorefor Wave 4B; real S3 client lands with LLD 21 object-store infra. - Cosign certificate signing — HMAC is the Wave 4B ceiling; cosign/Sigstore keyless is the 30-day follow-up (same milestone as SBOM signing).
- Cross-region replication — certificates land in a single bucket; geo-replication is out of scope.
- Channel-adapter audit emission —
AuditEmitteris stub-wired; full plumbing into the Wave 1 hash-chain is a LLD 21 parallel task. - UI / console surface — the three portal pages (request, attest, status) are separate frontend issues; backend RPCs are stable.
- External TPI (third-party integrator) webhooks — planned for Wave 5 once the SIEM worker pattern (LLD 21) is in steady-state.