Skip to main content

LLD 19 — Right-to-Deletion (RTD) Engine (Wave 4B)

FieldValue
Parent HLD#456 agent-runtime-wave-4b-enterprise-compliance.md
PRD#380 (P4.8.5)
Issue#462
PR#477 (merged 2b5ca26)
Doc backfill#478
Milestone9
Wave4B
SizeXL (~2600 LoC impl + tests)
Depends onLLD 18 (#469) classregistry, Wave 1 audit hash-chain
Parallel withLLD 20 retention (#463), LLD 21 SBOM+SIEM (#464)

Founder decisions (binding, from #380 comment 4254408367):

  • Q1 — Audit + usage aggregates: REDACT identifiers, RETAIN aggregate rows. 7-year tax floor trumps GDPR delete.
  • Q2 — Dual-control mode: STRICT. A second platform-admin must attest with a one-time HMAC token inside a 72h window. Same-admin attempts are rejected at both the service layer and the DB.
  • Q5 — Engine replicas: SINGLE. A PG advisory lock (upsquad.compliance_engine) is held for the entire daemon lifetime. A rolling restart uses Recreate so the lock transfers cleanly.

This LLD doc was backfilled after PR #477 merged (#478). The shipped code is the authoritative spec; this document is a human-readable companion and auditor trail.


1. Scope

Tenant-initiated, legally-attested purge of personal / operational / secret data across the 55+ scopes registered by LLD 18. The flow is:

RequestDeletion(RPC) ──► AWAITING_ATTESTATION

│ AttestDeletion(RPC, different admin, <72h)

QUEUED

│ compliance-engine tick (singleton)

RUNNING

├─► PURGE (deleters)
│ │
│ ▼
├─► VERIFY (re-scan; loop to PURGE 3x)
│ │
│ ▼
├─► REDACT (audit + usage identifiers)
│ │
│ ▼
└─► CERTIFY (HMAC sign + S3 upload)


SUCCEEDED

Every phase writes a row to erasure_phase_log. A crashed daemon restarts with resumeInFlight which re-reads the log and continues from the last non-success entry. The 30-day GDPR SLA is enforced by a per-hour ticker that pages at T-24h and breach-marks at T-0.


2. Data Model (migration 051)

Five tables. RLS on the three tenant-visible ones; platform-scope on the two internal ones.

TableClassRLSNotes
erasure_requestsPersonalyesState-machine header. erasure_status + erasure_phase enums.
erasure_phase_logAudityesPer-phase checkpoint. outcome IN (pending, success, failure, skipped). Cascades on request delete.
erasure_certificatesAudityesImmutable. REVOKE DELETE FROM PUBLIC. One row per request (UNIQUE).
audit_redaction_chainAuditplatformParallel hash-chain for REDACT step. Crosses tenant salt so it cannot be RLS-scoped per-tenant. REVOKE UPDATE, DELETE.
erasure_attestation_tokensSecretplatformOne-time Q2 tokens. Platform-scoped because only platform-admin can attest. token_sha256_hex is a CHAR(64) SHA-256 hash — the raw token is never persisted.

Platform-scoped tables (audit_redaction_chain, erasure_attestation_tokens) are on the LLD 18 allow-list so the coverage gate stays green.

Feature flag: compliance.rtd_enabled (default false). Wave 4B ships the machinery; per-tenant rollout is staged once the engine is observed in dev + staging.

2.1 Index strategy

  • ix_erasure_requests_ready — partial on (created_at ASC) WHERE status='queued' AND attested_at IS NOT NULL — drives ClaimNextRequest under FOR UPDATE SKIP LOCKED.
  • ix_erasure_requests_sla — partial on sla_deadline WHERE status NOT IN (succeeded,failed,cancelled) — SLA ticker hot path.
  • ix_erasure_requests_salt_expiry — partial on completed_at WHERE salt_zeroed_at IS NULL AND status='succeeded' — salt sweeper hot path.

3. State Machine

domain.Status enum:

queued → awaiting_attestation → running → succeeded
│ │
│ └─► failed
└─► cancelled

domain.Phase enum (runs only while status = running):

purge → verify → redact → certify

Phases execute in the order returned by domain.AllPhases() — the slice is load-bearing (test TestAllPhases_OrderIsLoadBearing pins this). The Runner iterates AllPhases() for every request and refuses to skip a phase; crash-recovery re-enters at the last non-success phase, not at the next phase.

3.1 Timing invariants (domain.types.go)

ConstantValueMeaning
SLATotal30dGDPR purge SLA
SLAPagingOffset24hFirst page fires at T-24h
AttestationWindow72hMax gap RequestDeletion → AttestDeletion
PhasePurgeMaxTime20dSoft bound on PURGE phase
PhaseVerifyMaxTime5dSoft bound on VERIFY
PhaseRedactMaxTime3dSoft bound on REDACT
PhaseCertifyMaxTime1dSoft bound on CERTIFY
VerifyMaxRetries3VERIFY → PURGE loop cap

4. Founder Q2 — Strict Dual-Control

Two-admin attestation is enforced via a one-time HMAC token with a 72h TTL.

4.1 Token lifecycle

  1. RequestDeletion generates a random token, hashes it with SHA-256 (HashAttestationToken), stores the hash in erasure_attestation_tokens, returns the raw token ONCE to the requester. The raw token never touches disk.
  2. AttestDeletion hashes the submitted token and calls Store.ConsumeAttestationToken(hash) — a single DELETE ... RETURNING that atomically removes the token and returns the original (request_id, requester_member_id). A duplicate call finds no row and returns ErrTokenInvalid.
  3. Service-layer checks same-admin AFTER consumption:
    • consumedRequestID != submittedRequestIDErrTokenInvalid + same_admin-style metric.
    • requesterID == attestingMemberIDErrSameAdmin + dual_control_denied{reason=same_admin} metric.
  4. Store.AttestRequest flips status awaiting_attestation → queued under optimistic concurrency (returns ErrAttestInvalid on any other status).

4.2 Defense in depth

  • Service layer rejects same-admin before flipping the request.
  • Store layer AttestRequest is a WHERE status='awaiting_attestation' AND requester_member_id != $attester predicate — a race that made it past the service check still fails at the DB.
  • Token rows are DELETE-once, so a replay of a captured token after consumption returns ErrTokenInvalid.

4.3 Token expiry

The salt-sweeper cron also calls Store.ExpireAttestationTokens — any token past expires_at is removed hourly. An expired but uncollected token is indistinguishable from a consumed one.


5. Founder Q1 — Aggregate-Retention Path

usage_records + llm_usage_events are ClassAudit. They MUST NOT be deleted (7-year tax retention). Instead, phase 3 REDACT:

  1. Invokes redactor.UsageRedactor (one per audit table) which UPDATEs every identifier column (user_id, agent_id, session_id, email, phone, ...) to HMAC-SHA256 pseudonyms keyed on the per-request 32-byte salt.
  2. Appends a row to audit_redaction_chain with prev_hash = previous chain head, row_hash = sha256(prev || newChainHead) where newChainHead is the redactor's running digest. Integrity verification walks this chain independently of tenant data.
  3. Salt is zeroed 30 days after completed_at by the daemon's salt-expiry sweeper. After zeroing, the HMAC mapping is mathematically irreversible — the pseudonyms remain valid for aggregate queries but the original identifiers cannot be recovered.

5.1 Redactor contract

type RedactorImpl interface {
Redact(ctx context.Context, orgID string, salt []byte) (rowsRedacted int64, newChainHead []byte, err error)
}

Identifiers are enumerated via the same column-allow-list used by SIEM drop-sensitive transforms (LLD 21). The redactor never touches numeric / aggregate columns (tokens_in, tokens_out, cost_usd).


6. Founder Q5 — Single-Replica Singleton

internal/compliance/singleton.go — PG advisory lock keyed on FNV-1a("upsquad.compliance_engine"). Pattern copied from approval/scheduler_lease.go and subagent/coordinator/lease.go. Unique key per daemon so the three singletons (approval-scheduler, subagent-coordinator, compliance-engine) run independently on the same cluster.

  • AcquireLease polls pg_try_advisory_lock($key) every 15s until acquired or ctx cancelled.
  • The held *pgxpool.Conn is the lock owner — on crash or process exit, the session closes and PG releases the lock automatically (no stale state).
  • Release runs pg_advisory_unlock + conn.Release() on shutdown.
  • Recreate deployment strategy prevents rolling-update overlap: the new pod will block at AcquireLease until the old pod exits.

The retention sweeper (LLD 20) and SIEM worker (LLD 21, when embedded) share this same single-replica guarantee — they run as goroutines inside the daemon that already holds the lease.


7. HMAC Certificate Chain (Phase 4)

HMACCertifier.Mint produces:

  • Canonical body: JSON payload with request_id, org_id, dry_run, reason, submitted_at, attested_at, started_at, certified_at, requester_member_id, attesting_member_id, sla_deadline, per-phase rows-affected summaries, cert_version = "1.0". Keys sorted inside buildPhasePayload; time.RFC3339Nano for every timestamp.
  • Digest: sha256(body) — stored as body_sha256_hex.
  • Signature: HMAC-SHA256(body, releaseKey) — the COMPLIANCE_RELEASE_KEY env var is a platform secret; key rotation via COMPLIANCE_RELEASE_KEY_ID (default release-2026-q2).

Upload path: CertStorage.Put writes the body to long-retention S3 with s3:ObjectLock COMPLIANCE mode + 10y retention (dev uses InMemoryCertStorage). The returned (bucket, key, etag, lockUntil) is stored in erasure_certificates for auditor verification.

HMAC is deliberate for Wave 4B — cosign-signed follow-up is tracked on the 30-day SBOM-signing milestone (LLD 21 §2.4).


8. Cancellation Semantics (LLD §3.6)

Per-state cancel rules enforced by Service.CancelRequest:

StatusPhaseCancellable
queuedyes
awaiting_attestationyes
runningpurge (0 scopes done)yes
runningpurge (any scope done)no (ErrForbiddenCancel)
runningverify / redact / certifyno
succeeded / failed / cancelledno

Rationale: once any PURGE scope has completed, the purge is no longer reversible; continuing to CERTIFY produces a defensible audit trail. ErrForbiddenCancel maps to gRPC FailedPrecondition.


9. ComplianceService RPCs

All RPCs registered on the agent-orchestrator gRPC port. The daemon (cmd/compliance-engine) does NOT serve RPCs — it only drives the runner + sweepers.

RPCAuthMaps to
RequestDeletionplatform-adminService.RequestDeletion
AttestDeletionplatform-admin (different from requester)Service.AttestDeletion
GetStatustenant-admin (same org)Service.GetStatus
ListRequeststenant-admin (same org)Service.ListRequests
CancelRequestplatform-admin (same org)Service.CancelRequest
GetScopeCoveragetenant-admin (same org)Service.GetScopeCoverage

9.1 Error mapping (mapError)

Service errorgRPC code
ErrClearancePermissionDenied
ErrForbiddenCancelFailedPrecondition
ErrSameAdminPermissionDenied
ErrTokenInvalidUnauthenticated
store.ErrAttestInvalidFailedPrecondition
store.ErrNotFoundNotFound
defaultInternal

9.2 Cross-tenant guard

RequestDeletion and every read RPC check the request body's org_id against the JWT-derived org (checkOrgMatch) — mismatch returns PermissionDenied "cross-tenant RTD forbidden". JWT is the source of truth; the body field exists only for back-compat with pre-auth clients.


10. Wiring (per shelfware gate)

10.1 cmd/compliance-engine/main.go constructs

EntityConstruction
compliance.AcquireLeaseSingleton gate (founder Q5). Blocks until acquired.
compliance.NewMetricsOTel bundle (compliance_rtd_*).
store.New(pool)Tenant-aware pgxpool wrapper.
classregistry.WireDependenciesWithVerifiers(deps)Replaces LLD 18 stubs with real deleters.NewPGDeleter / NewVaultDeleter / NewRedisDeleter / NewS3Deleter(NoopObjectStore) and redactor.NewAuditRedactor / NewUsageRedactor.
rtd.HMACCertifierSecretKey = COMPLIANCE_RELEASE_KEY, Storage = InMemoryCertStorage (dev) / S3 (prod, LLD 21 follow-up).
rtd.NewRunner(cfg)Poll loop on 15s tick.
runSaltSweeper6h tick — zeroes salts at T+30d, expires dangling attestation tokens.
runSLATicker1h tick — pages at T-24h, breach-marks at T-0.

10.2 cmd/agent-orchestrator/main.go constructs

EntityConstruction
compliance.NewMetricsReused OTel bundle.
compliancestore.New(pool)Intake path only — daemon owns the runner.
compliance.NewService(cfg)Intake service.
compliance.NewGRPCServer(svc, authLookup)Registers six RTD RPCs. WithSIEMService(siemSvc) attaches LLD 21 RPCs when vault is wired.

10.3 internal/runtime/server/grpc.go registers

ComplianceService: complianceGRPC on the runtime gRPC server — the same port that serves LifecycleService, ApprovalService, etc.

10.4 Non-shelfware invariants

Every exported constructor has ≥1 production call site:

ExportCaller(s)
compliance.NewServicecmd/agent-orchestrator/main.go
compliance.NewGRPCServercmd/agent-orchestrator/main.go
compliance.AcquireLeasecmd/compliance-engine/main.go
rtd.NewRunnercmd/compliance-engine/main.go
rtd.HMACCertifiercmd/compliance-engine/main.go
rtd.NewInMemoryCertStoragecmd/compliance-engine/main.go (dev path)
classregistry.WireDependenciesWithVerifierscmd/compliance-engine/main.go
deleters.NewPGDeleter / NewVaultDeleter / NewRedisDeleter / NewS3Deletercmd/compliance-engine/main.go (via factory)
redactor.NewAuditRedactor / NewUsageRedactorcmd/compliance-engine/main.go (via factory)

11. Metrics (compliance_rtd_*)

InstrumentTypeAttributes
compliance_rtd_requests_totalInt64Counterphase, outcome
compliance_rtd_phase_duration_secondsFloat64Histogramphase
compliance_erasure_sla_breach_totalInt64Counterkind (paging / breach)
compliance_rtd_sla_compliance_ratioFloat64Gauge
compliance_rtd_dual_control_denied_totalInt64Counterreason (token_invalid / token_wrong_request / same_admin / wrong_status)
compliance_rtd_salt_zeroed_totalInt64Counter

Histogram buckets: 1s, 5s, 30s, 2m, 10m, 1h, 6h, 1d, 7d, 30d — designed for the long-tail of the 30-day SLA.


12. Audit Events

Emitted via phases.AuditEmitter (nil-safe, best-effort). All events carry request_id + dry_run detail.

ActionEmitted byNotes
rtd_submittedService.RequestDeletionLogged via slog, not emitted to chain by default.
rtd_attestedService.AttestDeletionslog only.
rtd_certifiedphases.CertifyIncludes cert_sha256.
rtd_succeededRunner (final)Terminal success.
rtd_failedRunner.failRequestIncludes reason.

Wiring the emitter to the Wave 1 audit hash-chain is a 30-day follow-up; for Wave 4B the events flow via the structured logger only.


13. Tests

Package-level unit tests (run under -race):

  • attestation_test.go — token entropy, hash determinism, plain-SHA256 form.
  • singleton_test.go — lease key stability; no collision with approval / subagent leases.
  • types_test.goAllPhases() ordering; 72h attestation window; SLA total.
  • service_test.go / configure_retention_test.go — dual-control branches, cancel gating, retention clamp.
  • certifier_test.go — canonical-JSON determinism; HMAC signature matches independent verification.
  • rtd/redactor/chain_test.go — SHA-256 chaining; 32-zero genesis.
  • rtd/deleters/ident_test.goValidateIdent rejects SQL metacharacters + reserved words.

Integration suite (under integration build tag, real Postgres): test/integration/wave4/ — full runner loop, resume-in-flight after kill, dual-control happy path, cross-tenant guard.


14. Non-Goals (Wave 4B)

  1. S3 deleterdeleters.NewS3Deleter wraps NoopObjectStore for Wave 4B; real S3 client lands with LLD 21 object-store infra.
  2. Cosign certificate signing — HMAC is the Wave 4B ceiling; cosign/Sigstore keyless is the 30-day follow-up (same milestone as SBOM signing).
  3. Cross-region replication — certificates land in a single bucket; geo-replication is out of scope.
  4. Channel-adapter audit emissionAuditEmitter is stub-wired; full plumbing into the Wave 1 hash-chain is a LLD 21 parallel task.
  5. UI / console surface — the three portal pages (request, attest, status) are separate frontend issues; backend RPCs are stable.
  6. External TPI (third-party integrator) webhooks — planned for Wave 5 once the SIEM worker pattern (LLD 21) is in steady-state.