Skip to main content

LLD: Wave 1 Item 4 — Audit Log Hash-Chaining (Clean-Break Epoch)

FieldValue
Parent HLD#382 (docs/hld/agent-runtime-wave-1-agent-isolation.md §6)
PRD#380
Tracker#381
Milestone9
Estimated sizeXL
AuthorPrincipal Technical Architect
Date2026-04-13

1. Scope

Introduce a tamper-evident per-row hash chain on agent_audit_log, session-scoped, using a clean-break epoch strategy — no backfill. Existing rows remain at chain_epoch=0 with NULL hashes. From cutover time onward, every new row is chain_epoch >= 1 with populated prev_hash and row_hash. A nightly verifier CronJob validates chain integrity and emits a Prometheus gauge. An on-demand gRPC AuditService.VerifyChain RPC supports targeted verification.

Explicit non-goals for Wave 1: no Merkle root, no external anchoring, no HSM signing, no backfill of pre-epoch rows.

2. Schema / migration SQL

-- migrations/00NN_audit_log_hash_chain.up.sql

ALTER TABLE agent_audit_log
ADD COLUMN chain_epoch INTEGER NOT NULL DEFAULT 0,
ADD COLUMN prev_hash BYTEA,
ADD COLUMN row_hash BYTEA;

-- Deterministic chain order index: supports the verifier's per-session walk
-- and the advisory-lock contention pattern.
CREATE INDEX idx_audit_log_session_chain
ON agent_audit_log (session_id, chain_epoch, created_at, id)
WHERE chain_epoch >= 1;

-- Partial CHECK: if chain_epoch >= 1, both hash columns must be populated.
ALTER TABLE agent_audit_log
ADD CONSTRAINT audit_log_hash_populated_when_epoch_ge_1
CHECK (chain_epoch = 0 OR (prev_hash IS NOT NULL AND row_hash IS NOT NULL));

-- Append-only semantics already in place (migration 017): REVOKE UPDATE, DELETE.
-- No additional grants needed.

-- Cutover pointer: when the code flips on, first row per session writes epoch=1.
-- Pointer stored as a single row in platform_feature_flags:
INSERT INTO platform_feature_flags (key, value, description, updated_by)
VALUES (
'audit.chain_enabled',
'false',
'Enables hash-chaining on new audit_log rows. On flip to true, all subsequent rows get chain_epoch>=1.',
'00000000-0000-0000-0000-000000000000'
) ON CONFLICT (key) DO NOTHING;
-- migrations/00NN_audit_log_hash_chain.down.sql
ALTER TABLE agent_audit_log
DROP CONSTRAINT IF EXISTS audit_log_hash_populated_when_epoch_ge_1;
DROP INDEX IF EXISTS idx_audit_log_session_chain;
ALTER TABLE agent_audit_log
DROP COLUMN IF EXISTS chain_epoch,
DROP COLUMN IF EXISTS prev_hash,
DROP COLUMN IF EXISTS row_hash;

No migration to flip epoch on existing rows. They stay at chain_epoch=0 permanently. The verifier treats chain_epoch=0 rows as out of scope — tamper on those rows is undetectable by this mechanism (accepted trade-off).

3. Go interfaces

// internal/runtime/audit/hashchain.go (new)
package audit

import (
"crypto/sha256"
"time"

"github.com/google/uuid"
)

// Hasher computes deterministic per-row audit hashes.
type Hasher interface {
// ComputeRowHash computes SHA-256 over the canonical serialization of `row`
// prepended with the previous row's hash (32 zero bytes at chain root).
ComputeRowHash(prevHash []byte, row *Row) (rowHash []byte, err error)
}

// CanonicalBytes produces RFC 8785 JCS output for the audit row content fields
// that are included in the chain input. See §4 for the exact field list.
func CanonicalBytes(row *Row) ([]byte, error)

// Row is the audit row content (subset of SQL columns that participate in the hash).
type Row struct {
ID uuid.UUID
OrgID uuid.UUID
AgentID uuid.UUID
SessionID uuid.UUID
ActionType string
Decision string
InputHash []byte // existing column, precomputed
OutputHash []byte
Detail json.RawMessage
ProvenanceChain json.RawMessage
DurationMs int64
TokenUsage json.RawMessage
CreatedAt time.Time // must be UTC, RFC3339Nano
}

// ChainWriter is the write-side helper. InsertBatch in pgstore delegates here.
type ChainWriter interface {
// AppendBatch inserts the provided rows as a chained batch for a single session.
// Takes pg_advisory_xact_lock(hashtext('audit_chain:' || session_id)) to serialise
// writers for the same session; different sessions parallelise.
AppendBatch(ctx context.Context, tx Tx, sessionID uuid.UUID, rows []*Row) error
}

// Verifier walks a chain and reports integrity.
type Verifier interface {
VerifySession(ctx context.Context, orgID, sessionID uuid.UUID) (*VerifyReport, error)
VerifyOrg(ctx context.Context, orgID uuid.UUID, since, until time.Time) (*VerifyReport, error)
}

type VerifyReport struct {
OK bool
TotalRows int
BrokenAtRow int // 1-based index within the session, 0 when OK
BrokenRowID uuid.UUID // zero uuid when OK
SessionID uuid.UUID
VerifiedAt time.Time
}

Canonical byte layout for ComputeRowHash

input = prev_hash (32 bytes; 32 zero bytes when first row of a session at epoch>=1)
|| JCS(row_as_json_object)
output = SHA256(input)

Where row_as_json_object is an ordered JSON object with keys:

id, org_id, agent_id, session_id,
action_type, decision,
input_hash_hex, output_hash_hex,
detail, provenance_chain,
duration_ms, token_usage,
created_at (RFC3339Nano UTC)

The chain_epoch, prev_hash, and row_hash columns are NOT inputs to the hash (they are outputs / chain-metadata).

pg_advisory_xact_lock scope

// Inside AppendBatch, before SELECT ... FOR UPDATE of the session's tail row:
lockKey := int64(fnv64("audit_chain:" + sessionID.String()))
_, err := tx.Exec(ctx, "SELECT pg_advisory_xact_lock($1)", lockKey)

FNV-64 is deterministic; PG advisory locks take a single bigint. Collision across sessions is acceptable (rare, worst case two sessions serialise).

4. Redis key schema

None. Chain state lives entirely in PG; no cache. Redis is NOT in the critical path (audit chain must survive Redis outages).

5. Proto changes

// upsquad/audit/v1/audit.proto

syntax = "proto3";
package upsquad.audit.v1;

import "google/protobuf/timestamp.proto";

service AuditService {
// Existing RPCs preserved.
// NEW:
rpc VerifyChain(VerifyChainRequest) returns (VerifyChainResponse);
}

message VerifyChainRequest {
oneof scope {
SessionScope session = 1;
OrgDateScope org_range = 2;
}
}

message SessionScope {
string org_id = 1;
string session_id = 2;
}

message OrgDateScope {
string org_id = 1;
google.protobuf.Timestamp since = 2;
google.protobuf.Timestamp until = 3;
}

message VerifyChainResponse {
bool ok = 1;
int32 total_rows = 2;
int32 broken_at_row = 3; // 1-based, 0 if ok
string broken_row_id = 4; // uuid string, empty if ok
string session_id = 5; // for session scope; empty otherwise
google.protobuf.Timestamp verified_at = 6;
int32 sessions_verified = 7; // for org scope
int32 sessions_broken = 8;
}

Authorisation: verifier RPC is restricted to clearance level COMPLIANCE_AUDITOR and above (enforced by existing gateway RBAC middleware).

6. Unit + integration test plan

Unit (internal/runtime/audit)

  • TestHasher_ComputeRowHash_DeterministicAcrossJSONReordering
  • TestHasher_ComputeRowHash_RootRowPrevHashZero32Bytes
  • TestHasher_ComputeRowHash_ChangeInDetail_ChangesHash
  • TestCanonicalBytes_JCSCompliance_TableDriven (against RFC 8785 test vectors)
  • TestChainWriter_AppendBatch_SingleRow_SetsPrevZeroAndRowHash
  • TestChainWriter_AppendBatch_MultipleRows_LinksPrevToPriorRowHash
  • TestChainWriter_AppendBatch_ConcurrentDifferentSessions_NoDeadlock
  • TestChainWriter_AppendBatch_ConcurrentSameSession_SerialisesCorrectly
  • TestChainWriter_Epoch0Rows_NotTouched
  • TestVerifier_VerifySession_AllOK_ReturnsOK
  • TestVerifier_VerifySession_TamperedDetail_ReportsCorrectBrokenAtRow
  • TestVerifier_VerifySession_TamperedRowHash_Detected
  • TestVerifier_VerifySession_DeletedMiddleRow_DetectedAsChainBreak
  • TestVerifier_VerifyOrg_MultipleSessions_AggregatesCorrectly

Integration (cmd/audit-verify + orchestrator)

  • TestCronVerifier_EmitsMetricGauge
  • TestVerifyChainRPC_AuthzBlocksBelowAuditor
  • TestVerifyChainRPC_SessionScope_OK
  • TestVerifyChainRPC_OrgScope_DateRange_CountsCorrectly
  • TestAuditWriter_EpochFlip_FirstRowAfterCutover_IsEpoch1

Load

  • TestChainWriter_10kRowsAcross100Sessions_NoDeadlock_ThroughputBaseline (regression gate on p99 insert latency)

7. Pen-test scenario

Attack: a DBA with direct psql access runs:

UPDATE agent_audit_log
SET decision='action_auto_executed'
WHERE id='<row_x>' AND chain_epoch=1;

(Assume they have somehow bypassed the REVOKE UPDATE — e.g., via superuser.)

Expected:

  • Row's row_hash (stored) no longer equals SHA256(prev_hash || JCS(content)) because decision is a hash input.
  • Next nightly verifier run for the affected session's org reports ok=false, broken_at_row=N, broken_row_id=<row_x>.
  • Prometheus gauge audit_chain_verifier_status{org_id} flips to 0.
  • Alert AuditChainBroken pages the on-call within 5 min of the nightly run.
  • VerifyChainResponse exposes the broken row to the compliance UI for forensic follow-up.

8. Rollout plan

Feature flag

platform_feature_flags.audit.chain_enabled — boolean. Default false.

Phases

  • Phase 0 (migration): schema columns added, flag false. All new rows continue to be chain_epoch=0, prev_hash=NULL, row_hash=NULL. Verifier deploys in dry-run mode (logs-only).
  • Phase 1 (cutover): flag flipped to true. First insert per session (per service instance) at flag-on time starts a new chain: prev_hash = 32 zero bytes, row_hash = SHA256(0 || JCS(row)), chain_epoch=1. All subsequent rows chain off the previous.
  • Phase 2 (verifier armed): nightly CronJob transitions from dry-run to alerting. Alert AuditChainBroken active.

Thresholds

No auto-flip. Cutover is a human-operator decision because the chain is append-only and cannot be "undone" — once epoch 1 rows exist, they must stay valid.

Rollback procedure

Cannot roll back past a row already written at chain_epoch=1 (by design — integrity guarantee would be void). The rollback is:

  1. Flip flag to false. New rows resume chain_epoch=0.
  2. Existing chain_epoch=1+ rows remain valid and continue to be verified.
  3. Operator documents in ADR that the chain is paused; re-enable bumps chain_epoch to 2+ so the paused-window gap is explicit.
  4. The verifier treats each chain_epoch value as an independent chain (prev_hash at first row of each epoch = 32 zero bytes).

MTTR to stop writing new chained rows: < 30 s (flag propagation).

Kill-switch

AUDIT_CHAIN_DISABLED=true env var on the orchestrator — forces chain_epoch=0 at the write path regardless of flag. For platform-wide incident recovery only.

9. Verifier CronJob

# deploy/k8s/audit-verifier-cronjob.yaml (reference; actual manifest managed via Pulumi)
schedule: "0 2 * * *" # 02:00 UTC nightly
concurrencyPolicy: Forbid
image: upsquad/audit-verify:<sha>
args: ["--mode=scheduled", "--epoch-min=1"]
# Emits: audit_chain_verifier_status{org_id, epoch} gauge
# Writes a persistent run record to audit_chain_verifier_runs (optional follow-up table, not in Wave 1)

On-demand RPC AuditService.VerifyChain shares the same verifier code path; streams rows in 500-row pages to bound memory.

10. Known edge cases (will not fix in Wave 1)

  • Pre-epoch rows (chain_epoch=0) are NOT tamper-evident. A DBA can modify any row written before cutover and the verifier will not catch it. Accepted trade-off; documented in the ADR.
  • Out-of-order inserts within a session: prevented by the advisory lock + created_at ASC, id ASC ordering. If system clock drifts backwards, InsertBatch rejects rows with created_at < prev_row.created_at for the same session.
  • Row deletion: detected as a chain break (next row's prev_hash no longer matches the current preceding row's row_hash). But if the attacker deletes the last row of a session and no newer row is ever written, the break is invisible. Follow-up compliance item: per-session row-count attestation with periodic snapshots (Wave 2).
  • Attacker who compromises the write path AND rewrites forward: they can recompute the entire chain from the tamper point forward. The verifier cannot detect this. Mitigated by Merkle-root external anchoring — deferred to Wave 2.
  • Clock correctness: we rely on the DB server clock (CURRENT_TIMESTAMP) for created_at, not client clocks. NTP drift within < 1 s is fine. Gross skew across replicas would break deterministic ordering — not in Wave 1's threat model (assumed single-primary writes).
  • Schema evolution: adding new columns to agent_audit_log requires bumping chain_epoch (all future rows use the new canonical-bytes recipe). The canonical-bytes function is versioned alongside chain_epoch in code.
  • Large rows: detail JSONB is included in the hash. A 10 MB detail costs ~30 ms SHA-256. Tool outputs are already bounded by governance; no new cap introduced here.

11. Estimated size

XL (3–4 weeks): schema migration + hashchain package + pgstore write-path integration + advisory lock + verifier binary + CronJob manifest + gRPC RPC + auth wiring + unit/integration/load tests + rollout runbook + ADR on pre-epoch non-coverage. Story points ≈ 13.