UpsQuad Security Incident Response Plan

Version: 1.0 Last Updated: 2026-04-12 Owner: Principal Technical Architect Review Cadence: Quarterly (next review: 2026-07-12) Compliance References: HIPAA 164.404, PCI-DSS 12.10, FedRAMP IR-6, GDPR Art. 33-34

1. Incident Classification

Every security event must be classified on intake. Classification determines response timeline, escalation path, and notification obligations.

P0 — Critical

Active or confirmed breach requiring immediate company-wide response.

Unauthorized access to customer data (any tenant's PostgreSQL rows, context vectors, or agent session logs)
Credential compromise: Clerk API keys, GCP service account keys, database credentials, or agent uq_key tokens
Active exploitation of a production vulnerability (RCE, SQL injection, privilege escalation)
Ransomware or destructive attack against GKE clusters or CloudSQL instances
Exfiltration of data confirmed via audit logs or network telemetry

Response SLA: Triage within 15 minutes. Containment within 1 hour.

P1 — High

Confirmed vulnerability or failed control that could lead to breach if unexploited.

Exploitable vulnerability in production gRPC services or API gateway (CVSS >= 9.0)
Row-level security (RLS) bypass — a tenant can query another tenant's data
Failed security control: WAF rules disabled, network policy deleted, TLS termination misconfigured
Suspicious activity pattern: bulk data access anomaly, credential stuffing against Clerk, unusual agent API call volume
Compromised CI/CD pipeline (GitHub Actions secrets exposed, ArgoCD admin access)

Response SLA: Triage within 1 hour. Containment within 4 hours.

P2 — Medium

Known risk requiring scheduled remediation.

Failed penetration test finding (external or internal)
Dependency vulnerability rated CRITICAL or HIGH by govulncheck or Trivy scan
Infrastructure misconfiguration: overly permissive IAM role, public GCS bucket, missing network policy
Expired or soon-to-expire TLS certificates
Security scanner findings from Grafana alerting rules

Response SLA: Triage within 24 hours. Remediation plan within 72 hours.

P3 — Low

Informational events requiring tracking but not urgent action.

Policy violation: developer accessed production database directly, secret committed to branch (not merged)
Informational security event: port scan detected, failed auth attempts below threshold
Dependency vulnerability rated MEDIUM or LOW
Documentation gap identified during audit

Response SLA: Tracked in issue backlog. Addressed within current sprint.

2. Incident Response Phases

Phase 1: Detection

Goal: Identify the security event as early as possible.

Detection Sources:

Source	What It Catches	Owner
Prometheus/Grafana alerts	Anomalous request rates, error spikes, resource exhaustion, RLS violation counters	DevOps / On-call
Audit log analysis (PostgreSQL `audit_log` table)	Unauthorized data access patterns, privilege escalation, bulk exports	Security Lead
Clerk webhook events	Suspicious auth patterns, impossible travel, credential stuffing	Backend team
Trivy container scanning (CI/CD)	Vulnerable base images, compromised dependencies	DevOps
`govulncheck` in GitHub Actions	Go dependency vulnerabilities	Backend team
GKE audit logs (Cloud Audit Logs)	Unauthorized kubectl access, namespace breakout, pod privilege escalation	DevOps
Customer reports (support channel)	Data they shouldn't see, unexpected agent behavior, auth failures	Support / PM
OpenTelemetry traces	Abnormal latency patterns indicating data exfiltration or crypto mining	DevOps

Detection Actions:

Alert fires or report received
On-call engineer acknowledges within 5 minutes (PagerDuty)
Preliminary assessment: is this a security event? If yes, create incident channel and proceed to Triage
Log the event in #security-incidents channel with timestamp, source, and initial assessment

Phase 2: Triage

Goal: Classify severity, determine scope, and identify affected tenants.

Triage Checklist:

Assign incident severity (P0/P1/P2/P3) using classification above
Create GitHub issue with label security-incident and severity label
Identify attack vector: which service, endpoint, or infrastructure component?
Determine blast radius:
- Which tenants are affected? Query audit_log filtered by timeframe and affected service
- Which data types are exposed? (agent sessions, context vectors, org config, credentials)
- Is the vulnerability actively being exploited or theoretical?
Check for lateral movement: are other services or namespaces compromised?
Assign Incident Commander (IC) per escalation matrix
Begin incident timeline log (every action timestamped)

Tenant Scope Assessment Query (run against read replica):

SELECT DISTINCT tenant_id, COUNT(*) as affected_records
FROM audit_log
WHERE event_time BETWEEN $incident_start AND NOW()
  AND service = $affected_service
  AND (status_code >= 400 OR action IN ('data_export', 'bulk_read'))
GROUP BY tenant_id;

Phase 3: Containment

Goal: Stop the bleeding. Prevent further damage without destroying forensic evidence.

Immediate Containment Actions by Attack Type:

Scenario	Containment Action	Command / Procedure
Compromised API key	Rotate key in GCP Secret Manager, invalidate cached copies in Redis	`gcloud secrets versions add $SECRET --data-file=new_key && kubectl rollout restart deployment/$SERVICE -n $NS`
Compromised user account	Disable user in Clerk, revoke all sessions	Clerk Dashboard or `clerk.users.update(userId, { locked: true })`
RLS bypass	Enable emergency read-only mode on affected tables, add explicit `tenant_id` filter to application layer	Apply emergency migration, toggle feature flag `rls_enforcement=strict`
Namespace breakout	Apply deny-all network policy to affected namespace, cordon compromised nodes	`kubectl apply -f emergency-deny-all.yaml -n $NS && kubectl cordon $NODE`
Agent token compromise	Revoke `uq_key` for affected agents, pause agent execution	Update `agent_credentials` table `SET revoked_at = NOW()`, clear Redis session cache
Malicious dependency	Pin to last known good version, rebuild and redeploy	Update `go.mod`, run `go mod tidy`, trigger emergency ArgoCD sync
Data exfiltration in progress	Block source IP at Cloudflare/GKE ingress, rate-limit affected endpoint to zero	Cloudflare WAF rule or `kubectl apply` rate-limit policy

Forensic Preservation (before any cleanup):

Snapshot affected CloudSQL instance
Export GKE audit logs for the incident window to GCS
Preserve Prometheus metrics and Grafana dashboards for the period
Export OpenTelemetry traces for affected services
Do NOT delete or modify logs until post-incident review is complete

Phase 4: Eradication

Goal: Remove the root cause so the vulnerability cannot be re-exploited.

Eradication Steps:

Identify root cause (code bug, misconfiguration, compromised credential, supply chain)
Develop fix:
- Code fix: standard PR process with expedited review (architect review still required)
- Config fix: PR to Pulumi IaC or Kubernetes manifests
- Credential rotation: rotate ALL potentially affected credentials, not just confirmed ones
Deploy fix:
- P0/P1: Emergency deployment via ArgoCD with manual sync (bypass normal promotion cadence)
- P2/P3: Standard deployment pipeline
Verify fix:
- Reproduce the original attack vector and confirm it is blocked
- Run targeted security scan against the fix
- Confirm no regression in functionality

Phase 5: Recovery

Goal: Restore full service and verify data integrity.

Recovery Checklist:

Re-enable any services or endpoints disabled during containment
Remove emergency network policies and rate limits
Verify all tenants can access their data normally

Run data integrity checks:

-- Verify no cross-tenant data contamination
SELECT * FROM context_entries
WHERE tenant_id != (SELECT tenant_id FROM agents WHERE id = context_entries.agent_id);

Confirm monitoring is back to baseline (no elevated error rates, latency normal)
Verify all rotated credentials are propagated to all services
Run smoke tests against all gRPC service health endpoints
Confirm ArgoCD shows all applications in sync

Phase 6: Post-Incident Review

Goal: Learn and prevent recurrence. Must happen within 48 hours of incident closure.

Post-Incident Review Meeting:

Attendees: Incident Commander, all responders, CTO, affected team leads
Agenda:
1. Timeline walkthrough (using incident log)
2. What went well in the response?
3. What could be improved?
4. Root cause analysis (use 5-whys method)
5. Action items with owners and deadlines

Post-Incident Deliverables:

Incident Report (GitHub issue with label post-incident):
- Timeline of events
- Root cause analysis
- Impact assessment (tenants affected, data exposed, duration)
- Remediation actions taken
- Preventive measures to implement
Action Items: Create GitHub issues for each preventive measure, linked to incident
Runbook Updates: Update relevant runbooks based on lessons learned
Monitoring Improvements: Add alerts for the detection gap that allowed this incident
Lessons entry: Update tasks/lessons.md with the pattern for agent team learning

3. Breach Notification Procedure

A "breach" is confirmed unauthorized access to, or exfiltration of, customer data. Not all security incidents are breaches — only those involving actual data exposure.

Breach Confirmation Criteria

A breach is confirmed when ALL of the following are true:

Unauthorized party accessed or exfiltrated data
The data includes personally identifiable information (PII) or customer business data
The access was not authorized by the data owner

Internal Notification

Who	When	How
CTO	Within 1 hour of breach confirmation	Phone call + Slack DM
Security Lead	Within 1 hour of breach confirmation	Phone call + Slack DM
CEO	Within 2 hours of breach confirmation	Phone call from CTO
Legal counsel	Within 4 hours of breach confirmation	Email from CTO with incident summary
All engineering	Within 8 hours of breach confirmation	`#security-incidents` Slack channel

Customer Notification

Regulation	Deadline	Authority
GDPR (EU residents)	72 hours from awareness	Supervisory Authority + affected individuals
HIPAA (if health data)	60 calendar days	HHS OCR + affected individuals
CCPA (CA residents)	Without unreasonable delay	CA Attorney General (if 500+ residents)
PCI-DSS (if card data)	Immediately upon discovery	Card brands + acquiring bank
General (no specific regulation)	72 hours (our policy)	Affected customers directly

Notification Template

Subject: Security Incident Notification — UpsQuad [Incident ID]

Dear [Customer Name],

We are writing to inform you of a security incident that may have affected your
data on the UpsQuad platform.

WHAT HAPPENED
On [date], we detected [brief description of the incident — e.g., "unauthorized
access to our database through a vulnerability in our API gateway"]. The incident
occurred between [start time] and [end time] UTC.

WHAT DATA WAS INVOLVED
Based on our investigation, the following data associated with your account may
have been accessed:
- [List specific data types: agent session logs, organization configuration,
  context entries, user email addresses, etc.]

WHAT WE ARE DOING
- [Specific remediation step 1: e.g., "We have rotated all affected credentials
  and deployed a fix to the vulnerability"]
- [Specific remediation step 2: e.g., "We have engaged a third-party security
  firm to conduct a full audit"]
- [Specific remediation step 3: e.g., "We are implementing additional monitoring
  to detect similar patterns"]

WHAT YOU SHOULD DO
- [Action 1: e.g., "Rotate any API keys you have configured in UpsQuad"]
- [Action 2: e.g., "Review your agent activity logs for unexpected actions"]
- [Action 3: e.g., "Enable MFA if not already active on your account"]

CONTACT
If you have questions, contact our security team at security@upsquad.ai or
your account representative. We will provide updates as our investigation
continues.

Reference: Incident [ID]

[Name]
[Title]
UpsQuad Security Team

Notification Delivery

Email to the organization's primary admin contact (from Clerk org metadata)
In-app notification banner on the client portal for affected tenants
Dedicated status page update at status.upsquad.ai
For P0 breaches affecting 100+ tenants: public blog post within 7 days

4. Escalation Matrix

Severity	First Responder	Escalation (15 min no response)	Executive (1 hr)	Communication Lead
P0	On-call engineer	Security Lead + CTO	CEO	CTO drafts external comms
P1	On-call engineer	Security Lead	CTO	Security Lead drafts internal comms
P2	Assigned engineer	Team Lead	—	Team Lead updates ticket
P3	Assigned engineer	—	—	Engineer updates ticket

On-Call Rotation

Primary on-call: rotates weekly across backend and DevOps engineers
Secondary on-call: Security Lead (always reachable for P0/P1)
Escalation tool: PagerDuty with 5-minute auto-escalation for unacknowledged P0/P1 alerts

Incident Commander Responsibilities

The Incident Commander (IC) is assigned during triage and owns the incident until closure:

Coordinates all response activities
Maintains the incident timeline log
Makes containment decisions (what to shut down, what to keep running)
Communicates status updates every 30 minutes during active P0/P1 incidents
Ensures post-incident review is scheduled and conducted
Signs off on incident closure

5. Security Runbooks

5.1 Credential Leak (API Key Exposed in Logs or Repository)

Detection: Secret scanning alert (GitHub Advanced Security), log monitoring regex match, manual discovery.

Immediate Actions (within 15 minutes):

Identify which credential was leaked: GCP service account key, Clerk API key, database password, agent uq_key, OpenAI/Anthropic API key, Redis password
Determine exposure scope: was it in a public repo, a log accessible to customers, an internal log, or a CI artifact?
Rotate the credential immediately:
- GCP SA key: gcloud iam service-accounts keys create new key, delete old key, update Secret Manager
- Clerk API key: regenerate in Clerk Dashboard, update CLERK_SECRET_KEY in Secret Manager
- Database password: ALTER USER ... PASSWORD '...' on CloudSQL, update Secret Manager, restart PgBouncer
- Agent uq_key: UPDATE agent_credentials SET revoked_at = NOW() WHERE key_hash = $hash, issue new key
- LLM API key (OpenAI/Anthropic/Google): revoke in provider dashboard, update Secret Manager

Trigger rolling restart of all services consuming the rotated credential:

kubectl rollout restart deployment -l uses-secret=$SECRET_NAME -n upsquad

Audit usage: check if the leaked credential was used between leak time and rotation
If the credential was used maliciously, escalate to P0

Prevention:

Pre-commit hook with gitleaks to block secrets in commits
Log sanitization middleware in all gRPC interceptors (redact patterns matching key formats)
Secret Manager with automatic rotation policies

5.2 Data Exfiltration (Unauthorized Bulk Data Access)

Detection: Prometheus alert on context_reads_total exceeding per-tenant threshold, anomalous SELECT volume in PostgreSQL slow query log, unusual egress traffic in GKE network metrics.

Immediate Actions:

Identify the source: which tenant, user, agent, or service account is making the requests?
Block the source:
- If agent: UPDATE agents SET status = 'suspended' WHERE id = $agent_id; clear agent session from Redis
- If user: lock account in Clerk
- If service account: revoke GCP IAM binding

Apply emergency rate limit to the affected gRPC endpoint:

# Apply via kubectl to Envoy sidecar config
rate_limit:
  requests_per_unit: 1
  unit: MINUTE

Snapshot the database for forensic analysis

Quantify data exposure:

SELECT COUNT(*), array_agg(DISTINCT tenant_id)
FROM audit_log
WHERE actor_id = $suspect_id
  AND action IN ('read', 'list', 'export')
  AND event_time > $suspicious_start;

If cross-tenant data was accessed, escalate to P0 breach

Prevention:

Per-tenant query rate limits enforced at the gRPC middleware layer
Anomaly detection alerts on read volume (baseline + 3 standard deviations)
PostgreSQL RLS policies as defense-in-depth (even if application layer is bypassed)

5.3 Unauthorized Access (Compromised User or Agent Account)

Detection: Clerk webhook for suspicious sign-in (new device, impossible travel), agent performing actions outside its governance policy, failed authorization attempts exceeding threshold.

Immediate Actions:

Lock the compromised account:
- User: clerk.users.update(userId, { locked: true }) — revokes all sessions
- Agent: set status = 'suspended' in database, remove from Redis active-agent set
Revoke all active sessions and tokens associated with the account

Review the account's recent activity:

SELECT action, resource, status_code, event_time
FROM audit_log
WHERE actor_id = $compromised_id
ORDER BY event_time DESC
LIMIT 1000;

Identify how the account was compromised:
- Credential stuffing? Check Clerk auth logs for brute-force patterns
- Session hijacking? Check for session token reuse from different IPs
- Phishing? Contact the account owner
- Agent jailbreak? Review agent session transcripts for prompt injection
If the compromised account accessed other tenants' data, escalate to P0 breach
Reset credentials and require MFA re-enrollment before re-enabling

Prevention:

Enforce MFA for all admin-level accounts via Clerk
Agent governance policies limit blast radius (agents cannot access data outside their scope)
Session tokens have short TTL (15 minutes) with refresh rotation
IP allowlisting for sensitive operations

5.4 Supply Chain Attack (Compromised Dependency)

Detection: Trivy scan in CI flags a known-malicious package version, govulncheck detects a CVE in a direct dependency, security advisory from Go vulnerability database or npm advisory.

Immediate Actions:

Determine if the compromised version is deployed in production:

# For Go dependencies
kubectl exec -n upsquad deploy/$SERVICE -- go version -m /app | grep $PACKAGE
# For Node dependencies (client portal)
kubectl exec -n upsquad deploy/client-portal -- npm ls $PACKAGE

If deployed, assess impact:
- What does the compromised package do? (network access, file system, crypto)
- Was it a build-time only dependency or runtime?
- Were any malicious payloads executed? (check outbound network connections)

Pin to last known good version:

# Go
go get $PACKAGE@$SAFE_VERSION && go mod tidy
# Node
npm install $PACKAGE@$SAFE_VERSION --save-exact

Rebuild all container images from scratch (not from cache):
```
docker build --no-cache -t $IMAGE:$TAG .
```
Deploy clean images via emergency ArgoCD sync
Audit: check if the compromised dependency exfiltrated any secrets or data
If secrets were potentially exfiltrated, trigger credential rotation (see Runbook 5.1)

Prevention:

go.sum and package-lock.json checked into source control (integrity verification)
Trivy scanning in CI blocks merges with CRITICAL/HIGH vulnerabilities
Dependabot enabled for automated dependency updates
Minimal dependency policy: prefer standard library over third-party where feasible

5.5 DDoS / Service Disruption

Detection: Prometheus alert on request rate exceeding 10x baseline, GKE node pool hitting resource limits, Cloudflare DDoS detection triggers, customer reports of degraded performance.

Immediate Actions:

Confirm it is a DDoS and not legitimate traffic spike (check if specific tenants are affected or all):
```
sum(rate(grpc_server_handled_total[1m])) by (grpc_service)
```
Enable Cloudflare "Under Attack" mode if traffic is external:
- Cloudflare Dashboard > Security > Under Attack Mode: ON
- This enables JavaScript challenges for all requests

If attack is targeting specific endpoints, apply targeted rate limiting:

# Block specific IP ranges at Cloudflare
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE/firewall/rules" \
  -H "Authorization: Bearer $CF_TOKEN" \
  -d '{"filter":{"expression":"ip.src in {$ATTACKER_RANGE}"},"action":"block"}'

Scale up GKE node pool if legitimate traffic is being impacted:

gcloud container clusters resize $CLUSTER --node-pool $POOL --num-nodes $N --zone $ZONE

If specific gRPC services are overwhelmed, enable circuit breaker:
- Set Envoy outlier detection to eject unhealthy pods faster
- Reduce connection limits to shed excess load
Monitor recovery: watch error rate return to baseline before declaring all-clear
Post-incident: analyze attack patterns and add permanent WAF rules

Prevention:

Cloudflare in front of all public endpoints (DDoS mitigation built-in)
Per-tenant rate limiting at API gateway level (prevents single tenant from consuming all capacity)
GKE Autopilot auto-scales nodes, but set PodDisruptionBudgets to maintain availability
gRPC keepalive settings tuned to drop idle connections from botnets
Horizontal Pod Autoscaler with aggressive scale-up (30-second cooldown for P0 scenarios)

6. Communication Templates

Internal Status Update (every 30 minutes during active P0/P1)

INCIDENT UPDATE — [Incident ID] — [Timestamp UTC]
Status: [Investigating | Identified | Monitoring | Resolved]
Severity: [P0 | P1]
IC: [Name]

Current situation: [1-2 sentences]
Actions taken since last update: [Bullet list]
Next steps: [Bullet list]
ETA to resolution: [Estimate or "Unknown"]
Customer impact: [Description]

External Status Page Update

[Timestamp] — Investigating: We are investigating reports of [brief description].
Some customers may experience [impact]. We will provide an update within 30 minutes.

[Timestamp] — Identified: We have identified the cause and are implementing a fix.
[Service] functionality may be degraded. No data loss has occurred.

[Timestamp] — Resolved: The issue has been resolved. All services are operating normally.
A full post-incident report will be published within 48 hours.

7. Compliance Mapping

Requirement	This Plan Section	Evidence
HIPAA 164.404 (breach notification)	Section 3	Notification within 60 days, template provided
HIPAA 164.308(a)(6) (security incident procedures)	Sections 1-2	Classification and response phases documented
PCI-DSS 12.10 (incident response plan)	Sections 1-6	Full plan with runbooks and escalation
PCI-DSS 12.10.2 (annual testing)	Section 2, Phase 6	Quarterly review cadence, post-incident review
FedRAMP IR-6 (incident reporting)	Sections 3-4	Escalation matrix and notification deadlines
GDPR Art. 33 (notification to authority)	Section 3	72-hour notification to supervisory authority
GDPR Art. 34 (notification to data subject)	Section 3	Customer notification template and procedure
SOC 2 CC7.3 (security incidents)	Sections 1-6	Detection, response, communication, and review

8. Plan Maintenance

Quarterly review: Security Lead reviews and updates this plan every quarter
After every P0/P1 incident: update runbooks with lessons learned within 1 week
Annual tabletop exercise: simulate a P0 breach scenario and walk through the full response
New hire onboarding: all engineers read this plan within their first week
Version control: this document is maintained in docs/security/incident-response-plan.md in the upsquad-core repository; all changes go through PR review

1. Incident Classification​

P0 — Critical​

P1 — High​

P2 — Medium​

P3 — Low​

2. Incident Response Phases​

Phase 1: Detection​

Phase 2: Triage​

Phase 3: Containment​

Phase 4: Eradication​

Phase 5: Recovery​

Phase 6: Post-Incident Review​

3. Breach Notification Procedure​

Breach Confirmation Criteria​

Internal Notification​

Customer Notification​

Notification Template​

Notification Delivery​

4. Escalation Matrix​

On-Call Rotation​

Incident Commander Responsibilities​

5. Security Runbooks​

5.1 Credential Leak (API Key Exposed in Logs or Repository)​

5.2 Data Exfiltration (Unauthorized Bulk Data Access)​

5.3 Unauthorized Access (Compromised User or Agent Account)​

5.4 Supply Chain Attack (Compromised Dependency)​

5.5 DDoS / Service Disruption​

6. Communication Templates​

Internal Status Update (every 30 minutes during active P0/P1)​

External Status Page Update​

7. Compliance Mapping​

8. Plan Maintenance​

1. Incident Classification

P0 — Critical

P1 — High

P2 — Medium

P3 — Low

2. Incident Response Phases

Phase 1: Detection

Phase 2: Triage

Phase 3: Containment

Phase 4: Eradication

Phase 5: Recovery

Phase 6: Post-Incident Review

3. Breach Notification Procedure

Breach Confirmation Criteria

Internal Notification

Customer Notification

Notification Template

Notification Delivery

4. Escalation Matrix

On-Call Rotation

Incident Commander Responsibilities

5. Security Runbooks

5.1 Credential Leak (API Key Exposed in Logs or Repository)

5.2 Data Exfiltration (Unauthorized Bulk Data Access)

5.3 Unauthorized Access (Compromised User or Agent Account)

5.4 Supply Chain Attack (Compromised Dependency)

5.5 DDoS / Service Disruption

6. Communication Templates

Internal Status Update (every 30 minutes during active P0/P1)

External Status Page Update

7. Compliance Mapping

8. Plan Maintenance