LLD 20 — Retention Policy Engine (Wave 4B)
| Field | Value |
|---|---|
| Parent HLD | #456 agent-runtime-wave-4b-enterprise-compliance.md |
| PRD | #380 (P4.8.6) |
| Issue | #463 |
| PR | #482 (merged 7980449) |
| Doc backfill | #484 |
| Milestone | 9 |
| Wave | 4B |
| Size | L (~1400 LoC impl + tests) |
| Depends on | LLD 18 (#469) classregistry, LLD 19 (#477) RTD Engine (shares lease + redactors) |
| Parallel with | LLD 21 SBOM+SIEM (#464) |
Founder decisions (binding, from #380 comment 4254408367):
- Q1 — Audit + usage aggregates: REDACT, not DELETE. 7-year (2555-day) floor is enforced by the resolver clamp.
- Q5 — Engine replicas: SINGLE. The retention sweeper shares the same
upsquad.compliance_engineadvisory lock held by the RTD runner (LLD 19 §6) — it never runs concurrently with the runner and never runs on more than one replica.
This LLD doc was backfilled after PR #482 merged (#484). The shipped code is the authoritative spec; this document is a human-readable companion and auditor trail.
1. Scope
Per-tenant per-scope retention TTL configuration with platform-enforced floor/ceiling clamping, plus a sweeper that walks the classregistry.Scopes() list every 4h and executes the resolved action (DELETE / REDACT / ARCHIVE_THEN_DELETE / SKIP) for every (org, scope) pair with data past its effective TTL.
Goals:
- Tenant choice, bounded. A tenant can TIGHTEN retention below the registry default but never below the platform floor. A tenant can LOOSEN retention but never above the platform ceiling.
- Audit floor preserved. ClassAudit scopes never DELETE — they REDACT identifiers while retaining aggregate rows.
agent_audit_logroutes through archive-then-delete with S3 object-lock COMPLIANCE when an archiver is wired; without an archiver it is skipped (never silently deleted). - Single replica. The sweeper piggy-backs on LLD 19's advisory lease — no new singleton primitive.
- Back-pressure. A single sweep tick is bounded by
MaxRuntime(default 3h) so the next 4h tick always has slack. Scopes deferred to the next tick increment theretention_sweep_deferred_totalcounter.
2. Clamp Model
Effective TTL precedence (high → low):
1. Tenant override (tenant_retention_config row)
2. Platform ceiling (platform_retention_floor.ceiling_days) — clamp DOWN if above
3. Platform floor (platform_retention_floor.floor_days) — clamp UP if below
4. Registry default (classregistry.Scope.RetentionDefaultDays)
Source-of-clamp is surfaced in GetRetentionConfigResponse.source so tenants see WHY their request was adjusted:
retention.Source | Meaning |
|---|---|
tenant | Tenant override accepted inside [floor, ceiling]. |
ceiling | Tenant requested above ceiling — clamped DOWN. |
floor | Tenant requested below floor — clamped UP. |
default | No tenant override — registry default. |
ValidateTTL is the pre-write check called by ConfigureRetention and returns ErrBelowFloor / ErrAboveCeiling so the tenant gets a 400 INVALID_ARGUMENT instead of a silent clamp on write. The sweeper always trusts the resolver, never re-validates.
3. Data Model
3.1 platform_retention_floor (migration 050, LLD 18)
Platform-scoped (no RLS). Primary key on scope_name. ceiling_days nullable = unlimited.
CREATE TABLE platform_retention_floor (
scope_name TEXT PRIMARY KEY,
floor_days INT NOT NULL CHECK (floor_days >= 0),
ceiling_days INT,
rationale TEXT NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CHECK (ceiling_days IS NULL OR ceiling_days >= floor_days)
);
Rows are SEEDED by Resolver.BootstrapPlatformFloors at compliance-engine startup from classregistry.Scopes(). The registry (Go code) is the single source of truth; the DB table is a mirror so the sweeper can read floors without importing classregistry. Idempotent UPSERT — safe to re-run on every boot.
3.2 tenant_retention_config (migration 052)
Tenant-scoped with RLS. UNIQUE(org_id, scope_name) — one row per tenant per scope.
CREATE TABLE tenant_retention_config (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULL REFERENCES organizations(id) ON DELETE CASCADE,
scope_name TEXT NOT NULL,
ttl_days INT NOT NULL CHECK (ttl_days > 0),
archive_target TEXT, -- s3:bucket/prefix or NULL
updated_by UUID NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (org_id, scope_name)
);
Written only via ConfigureRetention RPC (platform-admin). No direct SQL path. RLS policy: org_id::text = current_setting('app.org_id', true).
3.3 retention_sweep_log (migration 052)
Tenant-scoped with RLS, append-only (REVOKE UPDATE, DELETE FROM PUBLIC). One row per (org_id, scope_name, sweep tick):
CREATE TABLE retention_sweep_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULL,
scope_name TEXT NOT NULL,
ttl_days INT NOT NULL,
source TEXT NOT NULL CHECK (source IN ('tenant','floor','ceiling','default')),
action TEXT NOT NULL CHECK (action IN ('delete','redact','archive_then_delete','skip')),
rows_affected BIGINT NOT NULL DEFAULT 0,
archive_key TEXT,
outcome TEXT NOT NULL CHECK (outcome IN ('success','failure','skipped')),
error_msg TEXT,
started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
completed_at TIMESTAMPTZ
);
Feature flag: compliance.retention_sweeper_enabled (default false). Wave 4B ships the machinery; per-tenant rollout is staged once the RTD engine is observed in production.
4. Action Routing
resolver.resolveAction(scope, archiveTarget) is a pure function of classregistry.DataClass + scope.Name:
| Scope trait | Action | Rationale |
|---|---|---|
ClassPlatform | ActionSkip | Platform data has no tenant retention. |
ClassAudit AND scope.Name == agent_audit_log AND archiveTarget != "" | ActionArchiveThenDelete | HLD I5 — audit archive-before-delete with S3 object-lock COMPLIANCE. |
ClassAudit (anything else) | ActionRedact | Founder Q1 — retain aggregates, scrub identifiers. |
ClassPersonal / ClassOperational / ClassSecret + deleter wired | ActionDelete | Standard tenant data purge. |
| any class with no deleter | ActionSkip | Fail-safe. |
4.1 Audit fail-safe
When ActionArchiveThenDelete is selected for agent_audit_log but no Archiver is wired (cmd/compliance-engine launched without COMPLIANCE_RETENTION_ARCHIVE_BUCKET), runArchiveThenDelete returns an error — the sweeper records the outcome as failure with error_msg explaining the skip. Audit rows are NEVER silently deleted.
4.2 Redactor routing
For ClassAudit scopes with ActionRedact, the sweeper calls RedactorFactory(scopeName) which returns the same redactor wired by LLD 19's classregistry.WireDependenciesWithVerifiers. The sweeper shares the wired scope slice with the RTD runner so:
- both agree on the registered redactor set,
- a scope's registered redactor is invoked the same way whether triggered by tenant RTD (LLD 19) or by TTL expiry (LLD 20),
- salt is generated fresh per sweep tick (
crypto/rand, 32 bytes) and discarded after the call — unlike RTD, sweep redaction has no long-term salt retention.
retention.NewRedactorFactoryFromScopes(wiredScopes) adapts classregistry.Redactor closures onto the sweeper's Redactor interface (wire.go).
5. Sweeper Loop
retention.Sweeper owns the tick loop:
DefaultTickInterval = 4h (COMPLIANCE_RETENTION_TICK)
DefaultMaxRuntime = 3h (COMPLIANCE_RETENTION_MAX_RUNTIME)
DefaultDeleteBatchLimit = 1000 rows / DELETE
DefaultOrgPageSize = 100 orgs / scope / tick
Flow per tick:
for sc in Resolver.Scopes():
if ctx past deadline: defer remaining → IncSweepDeferred; return
if sc.Kind != ScopeKindPG: skip # Vault/Redis/S3 use native TTL
if sc.Class == ClassPlatform: skip
if sc.OrgIDColumn == "": skip # allow-listed platform tables
orgs := Store.ListOrgsWithData(sc.Table, sc.OrgIDColumn, OrgPageSize)
for org in orgs:
pol := Resolver.Effective(org, sc.Name) # cached 5min
cutoff := pol.Cutoff(now)
switch pol.Action:
Delete → Store.DeleteRowsOlderThan(...)
ArchiveThenDelete → Archiver.Archive(...) then DeleteRowsOlderThan(...)
Redact → redactor.Redact(org, salt)
Skip → no-op
Store.InsertSweepOutcome(...) # always, even on skip
5.1 Timestamp column resolution
timestampColumn(sc) picks the column used to measure row age:
| Scope | Column |
|---|---|
agent_sessions | started_at |
agent_checkpoints | checkpointed_at |
| anything else | created_at |
Per-scope overrides land here as new scopes ship specialised retention semantics.
5.2 Back-pressure
If MaxRuntime elapses mid-tick:
- Remaining scopes are deferred to the next tick (no drop, no retry loop).
retention_sweep_deferred_totalcounter increments by the deferred count.- The next 4h tick starts fresh — the sweeper picks up whatever scopes still have rows past their cutoff.
This is safe because the sweeper is idempotent: a scope's action is derived from the current policy + cutoff, not from a cursor.
6. ComplianceService RPCs (3 new)
Added to the same service that hosts LLD 19's RTD intake + LLD 21's SIEM endpoints.
| RPC | Auth | Maps to | Notes |
|---|---|---|---|
ConfigureRetention | platform-admin | Service.ConfigureRetention | Clamp-validated via ValidateTTL; invalidates resolver cache on success. |
GetRetentionConfig | tenant-admin (same org) | Service.GetRetentionConfig | Returns effective_ttl_days + source + floor/ceiling/default for transparency. |
ListRetentionOverrides | tenant-admin (same org) | Service.ListRetentionOverrides | Lists every override for the caller's tenant. |
6.1 Error mapping (mapRetentionError)
| Service error | gRPC code |
|---|---|
ErrClearance | PermissionDenied |
ErrRetentionUnwired | Unimplemented (LLD 20 not wired in this deployment) |
retention.ErrBelowFloor | InvalidArgument |
retention.ErrAboveCeiling | InvalidArgument |
retention.ErrUnknownScope | NotFound |
retention.ErrNotFound | NotFound |
| default | Internal |
6.2 Cross-tenant guard
Every RPC enforces checkOrgMatch(req.OrgId, jwtOrg) — mismatch returns PermissionDenied. JWT is source of truth; the body field exists for back-compat only.
7. Wiring (per shelfware gate)
7.1 cmd/compliance-engine/main.go constructs
| Entity | Construction |
|---|---|
retention.NewStore(pool) | pgxpool-backed persistence. |
retention.NewMetrics(meter) | OTel bundle (retention_*). |
retention.NewResolver(cfg) | Scopes = classregistry.Scopes(), 5min cache. |
resolver.BootstrapPlatformFloors(ctx) | Idempotent seed from registry. Logs floors_written. |
retention.NewRedactorFactoryFromScopes(wiredScopes) | Bridge to LLD 19 redactors. |
retention.NewS3ColdArchiver | Optional — only when COMPLIANCE_RETENTION_ARCHIVE_BUCKET is set. Uses NoopObjectWriter in dev; real S3 client wired via LLD 21 infra. |
retention.NewSweeper(cfg) | Shares Store, Resolver, Metrics, Redactors, optional Archiver. |
Goroutine sweeper.Run(ctx) | Started only when COMPLIANCE_SWEEPER_DISABLED != "true". |
7.2 cmd/agent-orchestrator/main.go constructs
The retention RPCs are served via compliance.NewGRPCServer — the same GRPCServer instance that hosts LLD 19's RTD intake. When the orchestrator's ServiceConfig includes RetentionStore / RetentionResolver / RetentionMetrics, the RPCs return real responses; when nil, they return Unimplemented ("retention engine not wired") — the same pattern LLD 21 uses for SIEM.
Wave 4B ships the RPCs wired on the daemon side only (the orchestrator delegates writes to the engine). Orchestrator-side wiring for the RPCs lands in a follow-up once the admin console picks them up.
7.3 Non-shelfware invariants
Every exported constructor has ≥1 production call site:
| Export | Caller |
|---|---|
retention.NewStore | cmd/compliance-engine/main.go |
retention.NewMetrics | cmd/compliance-engine/main.go |
retention.NewResolver | cmd/compliance-engine/main.go |
resolver.BootstrapPlatformFloors | cmd/compliance-engine/main.go |
retention.NewRedactorFactoryFromScopes | cmd/compliance-engine/main.go |
retention.NewS3ColdArchiver | cmd/compliance-engine/main.go (optional path) |
retention.NewSweeper | cmd/compliance-engine/main.go |
Sweeper.Run | cmd/compliance-engine/main.go (goroutine) |
Service.ConfigureRetention / GetRetentionConfig / ListRetentionOverrides | internal/compliance/grpcserver.go |
8. Metrics (retention_*)
| Instrument | Type | Attributes |
|---|---|---|
retention_sweep_total | Int64Counter | scope, outcome (success / failure / skipped) |
retention_sweep_duration_seconds | Float64Histogram | scope |
retention_rows_swept_total | Int64Counter | scope |
retention_sweep_deferred_total | Int64Counter | — |
retention_policy_validate_denied_total | Int64Counter | reason (below_floor / above_ceiling / unknown_scope / invalid) |
Histogram buckets: 0.1s, 0.5s, 1s, 5s, 15s, 30s, 1m, 5m, 15m, 1h — the long tail is bounded by MaxRuntime = 3h but single-scope sweeps rarely exceed minutes.
9. Audit Trail
retention_sweep_log is the per-tick audit. Every (org, scope) outcome lands a row — even skipped — so auditors can see which scopes were evaluated and why. archive_key carries the S3 pointer for ActionArchiveThenDelete rows.
No separate audit-hash-chain entry is emitted for the retention sweep — the log table itself is append-only (REVOKE UPDATE, DELETE) and tenant-scoped with RLS. The 30-day follow-up (LLD 21 SIEM) picks up these rows for export when SiemFilterClass = all.
10. Kill-switches
Precedence (highest first):
COMPLIANCE_SWEEPER_DISABLED=trueenv var → goroutine never starts.compliance.retention_sweeper_enabledfeature flag (defaultfalse) → gates loop entry.COMPLIANCE_ENGINE_DISABLED=trueenv var → whole daemon exits 0 (the RTD runner + sweeper both stop).
Tests (configure_retention_test.go, resolver_test.go, sweeper_test.go) cover the kill-switch matrix explicitly.
11. Tests (95 compliance-package tests under -race)
Unit:
types_test.go—Policy.Validateinvariants,SweepOutcomeround-trip.validate_test.go— clamp precedence (below-floor → up, above-ceiling → down, no override → default).resolver_test.go— cache TTL,Invalidate,Flush,BootstrapPlatformFloorsidempotency,FloorForScopefallback.sweeper_test.go— per-scope action routing, MaxRuntime defer, fail-safe on missing archiver.archiver_test.go—S3ColdArchiverretention floor + object-lock headers.store_test.go— RLS enforcement, UPSERT semantics, append-onlyretention_sweep_log.configure_retention_test.go— clamp rejection maps to correctpolicy_validate_denied_totalreason.
Integration (in test/integration/wave4/):
BootstrapFloorsOnStartup— 59 scopes seeded idempotently.BillingFloorEnforced— tenant cannot configure below 7y onusage_records.RetentionSweeper_DeletesOldRows— operational scope PURGE happy path.RetentionSweeper_RedactsAudit— audit-class redaction end-to-end (founder Q1).ArchiveBeforeAuditSweep—agent_audit_log→ archive → delete chain.
12. Non-Goals (Wave 4B)
- Non-PG retention — Vault / Redis / S3 scopes are skipped; those stores' native TTL / lifecycle policies handle expiry.
- Cold archive format specifics —
S3ColdArchiverwrites NDJSON today; Parquet / delta-lake formats are Wave 5 data-platform concerns. - Tenant-level archive encryption — archive bucket uses default SSE-S3; tenant-KMS archive keys land with the BYOK milestone.
- Dashboards — retention metrics are published to OTel but tenant-facing dashboards are frontend issues.
- Override history —
tenant_retention_configUPSERTs in place; historical override replay (who changed TTL when, from what value) is a compliance follow-up. - Real-time purge — the sweeper is deliberately 4h-cadence; a "purge now" RPC for tenants would require the singleton gate to serialise with RTD and is out of scope.