UpsQuad Security Incident Response Plan
Version: 1.0 Last Updated: 2026-04-12 Owner: Principal Technical Architect Review Cadence: Quarterly (next review: 2026-07-12) Compliance References: HIPAA 164.404, PCI-DSS 12.10, FedRAMP IR-6, GDPR Art. 33-34
1. Incident Classification
Every security event must be classified on intake. Classification determines response timeline, escalation path, and notification obligations.
P0 — Critical
Active or confirmed breach requiring immediate company-wide response.
- Unauthorized access to customer data (any tenant's PostgreSQL rows, context vectors, or agent session logs)
- Credential compromise: Clerk API keys, GCP service account keys, database credentials, or agent
uq_keytokens - Active exploitation of a production vulnerability (RCE, SQL injection, privilege escalation)
- Ransomware or destructive attack against GKE clusters or CloudSQL instances
- Exfiltration of data confirmed via audit logs or network telemetry
Response SLA: Triage within 15 minutes. Containment within 1 hour.
P1 — High
Confirmed vulnerability or failed control that could lead to breach if unexploited.
- Exploitable vulnerability in production gRPC services or API gateway (CVSS >= 9.0)
- Row-level security (RLS) bypass — a tenant can query another tenant's data
- Failed security control: WAF rules disabled, network policy deleted, TLS termination misconfigured
- Suspicious activity pattern: bulk data access anomaly, credential stuffing against Clerk, unusual agent API call volume
- Compromised CI/CD pipeline (GitHub Actions secrets exposed, ArgoCD admin access)
Response SLA: Triage within 1 hour. Containment within 4 hours.
P2 — Medium
Known risk requiring scheduled remediation.
- Failed penetration test finding (external or internal)
- Dependency vulnerability rated CRITICAL or HIGH by
govulncheckor Trivy scan - Infrastructure misconfiguration: overly permissive IAM role, public GCS bucket, missing network policy
- Expired or soon-to-expire TLS certificates
- Security scanner findings from Grafana alerting rules
Response SLA: Triage within 24 hours. Remediation plan within 72 hours.
P3 — Low
Informational events requiring tracking but not urgent action.
- Policy violation: developer accessed production database directly, secret committed to branch (not merged)
- Informational security event: port scan detected, failed auth attempts below threshold
- Dependency vulnerability rated MEDIUM or LOW
- Documentation gap identified during audit
Response SLA: Tracked in issue backlog. Addressed within current sprint.
2. Incident Response Phases
Phase 1: Detection
Goal: Identify the security event as early as possible.
Detection Sources:
| Source | What It Catches | Owner |
|---|---|---|
| Prometheus/Grafana alerts | Anomalous request rates, error spikes, resource exhaustion, RLS violation counters | DevOps / On-call |
Audit log analysis (PostgreSQL audit_log table) | Unauthorized data access patterns, privilege escalation, bulk exports | Security Lead |
| Clerk webhook events | Suspicious auth patterns, impossible travel, credential stuffing | Backend team |
| Trivy container scanning (CI/CD) | Vulnerable base images, compromised dependencies | DevOps |
govulncheck in GitHub Actions | Go dependency vulnerabilities | Backend team |
| GKE audit logs (Cloud Audit Logs) | Unauthorized kubectl access, namespace breakout, pod privilege escalation | DevOps |
| Customer reports (support channel) | Data they shouldn't see, unexpected agent behavior, auth failures | Support / PM |
| OpenTelemetry traces | Abnormal latency patterns indicating data exfiltration or crypto mining | DevOps |
Detection Actions:
- Alert fires or report received
- On-call engineer acknowledges within 5 minutes (PagerDuty)
- Preliminary assessment: is this a security event? If yes, create incident channel and proceed to Triage
- Log the event in
#security-incidentschannel with timestamp, source, and initial assessment
Phase 2: Triage
Goal: Classify severity, determine scope, and identify affected tenants.
Triage Checklist:
- Assign incident severity (P0/P1/P2/P3) using classification above
- Create GitHub issue with label
security-incidentand severity label - Identify attack vector: which service, endpoint, or infrastructure component?
- Determine blast radius:
- Which tenants are affected? Query
audit_logfiltered by timeframe and affected service - Which data types are exposed? (agent sessions, context vectors, org config, credentials)
- Is the vulnerability actively being exploited or theoretical?
- Which tenants are affected? Query
- Check for lateral movement: are other services or namespaces compromised?
- Assign Incident Commander (IC) per escalation matrix
- Begin incident timeline log (every action timestamped)
Tenant Scope Assessment Query (run against read replica):
SELECT DISTINCT tenant_id, COUNT(*) as affected_records
FROM audit_log
WHERE event_time BETWEEN $incident_start AND NOW()
AND service = $affected_service
AND (status_code >= 400 OR action IN ('data_export', 'bulk_read'))
GROUP BY tenant_id;
Phase 3: Containment
Goal: Stop the bleeding. Prevent further damage without destroying forensic evidence.
Immediate Containment Actions by Attack Type:
| Scenario | Containment Action | Command / Procedure |
|---|---|---|
| Compromised API key | Rotate key in GCP Secret Manager, invalidate cached copies in Redis | gcloud secrets versions add $SECRET --data-file=new_key && kubectl rollout restart deployment/$SERVICE -n $NS |
| Compromised user account | Disable user in Clerk, revoke all sessions | Clerk Dashboard or clerk.users.update(userId, { locked: true }) |
| RLS bypass | Enable emergency read-only mode on affected tables, add explicit tenant_id filter to application layer | Apply emergency migration, toggle feature flag rls_enforcement=strict |
| Namespace breakout | Apply deny-all network policy to affected namespace, cordon compromised nodes | kubectl apply -f emergency-deny-all.yaml -n $NS && kubectl cordon $NODE |
| Agent token compromise | Revoke uq_key for affected agents, pause agent execution | Update agent_credentials table SET revoked_at = NOW(), clear Redis session cache |
| Malicious dependency | Pin to last known good version, rebuild and redeploy | Update go.mod, run go mod tidy, trigger emergency ArgoCD sync |
| Data exfiltration in progress | Block source IP at Cloudflare/GKE ingress, rate-limit affected endpoint to zero | Cloudflare WAF rule or kubectl apply rate-limit policy |
Forensic Preservation (before any cleanup):
- Snapshot affected CloudSQL instance
- Export GKE audit logs for the incident window to GCS
- Preserve Prometheus metrics and Grafana dashboards for the period
- Export OpenTelemetry traces for affected services
- Do NOT delete or modify logs until post-incident review is complete
Phase 4: Eradication
Goal: Remove the root cause so the vulnerability cannot be re-exploited.
Eradication Steps:
- Identify root cause (code bug, misconfiguration, compromised credential, supply chain)
- Develop fix:
- Code fix: standard PR process with expedited review (architect review still required)
- Config fix: PR to Pulumi IaC or Kubernetes manifests
- Credential rotation: rotate ALL potentially affected credentials, not just confirmed ones
- Deploy fix:
- P0/P1: Emergency deployment via ArgoCD with manual sync (bypass normal promotion cadence)
- P2/P3: Standard deployment pipeline
- Verify fix:
- Reproduce the original attack vector and confirm it is blocked
- Run targeted security scan against the fix
- Confirm no regression in functionality
Phase 5: Recovery
Goal: Restore full service and verify data integrity.
Recovery Checklist:
- Re-enable any services or endpoints disabled during containment
- Remove emergency network policies and rate limits
- Verify all tenants can access their data normally
- Run data integrity checks:
-- Verify no cross-tenant data contaminationSELECT * FROM context_entriesWHERE tenant_id != (SELECT tenant_id FROM agents WHERE id = context_entries.agent_id);
- Confirm monitoring is back to baseline (no elevated error rates, latency normal)
- Verify all rotated credentials are propagated to all services
- Run smoke tests against all gRPC service health endpoints
- Confirm ArgoCD shows all applications in sync
Phase 6: Post-Incident Review
Goal: Learn and prevent recurrence. Must happen within 48 hours of incident closure.
Post-Incident Review Meeting:
- Attendees: Incident Commander, all responders, CTO, affected team leads
- Agenda:
- Timeline walkthrough (using incident log)
- What went well in the response?
- What could be improved?
- Root cause analysis (use 5-whys method)
- Action items with owners and deadlines
Post-Incident Deliverables:
- Incident Report (GitHub issue with label
post-incident):- Timeline of events
- Root cause analysis
- Impact assessment (tenants affected, data exposed, duration)
- Remediation actions taken
- Preventive measures to implement
- Action Items: Create GitHub issues for each preventive measure, linked to incident
- Runbook Updates: Update relevant runbooks based on lessons learned
- Monitoring Improvements: Add alerts for the detection gap that allowed this incident
- Lessons entry: Update
tasks/lessons.mdwith the pattern for agent team learning
3. Breach Notification Procedure
A "breach" is confirmed unauthorized access to, or exfiltration of, customer data. Not all security incidents are breaches — only those involving actual data exposure.
Breach Confirmation Criteria
A breach is confirmed when ALL of the following are true:
- Unauthorized party accessed or exfiltrated data
- The data includes personally identifiable information (PII) or customer business data
- The access was not authorized by the data owner
Internal Notification
| Who | When | How |
|---|---|---|
| CTO | Within 1 hour of breach confirmation | Phone call + Slack DM |
| Security Lead | Within 1 hour of breach confirmation | Phone call + Slack DM |
| CEO | Within 2 hours of breach confirmation | Phone call from CTO |
| Legal counsel | Within 4 hours of breach confirmation | Email from CTO with incident summary |
| All engineering | Within 8 hours of breach confirmation | #security-incidents Slack channel |
Customer Notification
| Regulation | Deadline | Authority |
|---|---|---|
| GDPR (EU residents) | 72 hours from awareness | Supervisory Authority + affected individuals |
| HIPAA (if health data) | 60 calendar days | HHS OCR + affected individuals |
| CCPA (CA residents) | Without unreasonable delay | CA Attorney General (if 500+ residents) |
| PCI-DSS (if card data) | Immediately upon discovery | Card brands + acquiring bank |
| General (no specific regulation) | 72 hours (our policy) | Affected customers directly |
Notification Template
Subject: Security Incident Notification — UpsQuad [Incident ID]
Dear [Customer Name],
We are writing to inform you of a security incident that may have affected your
data on the UpsQuad platform.
WHAT HAPPENED
On [date], we detected [brief description of the incident — e.g., "unauthorized
access to our database through a vulnerability in our API gateway"]. The incident
occurred between [start time] and [end time] UTC.
WHAT DATA WAS INVOLVED
Based on our investigation, the following data associated with your account may
have been accessed:
- [List specific data types: agent session logs, organization configuration,
context entries, user email addresses, etc.]
WHAT WE ARE DOING
- [Specific remediation step 1: e.g., "We have rotated all affected credentials
and deployed a fix to the vulnerability"]
- [Specific remediation step 2: e.g., "We have engaged a third-party security
firm to conduct a full audit"]
- [Specific remediation step 3: e.g., "We are implementing additional monitoring
to detect similar patterns"]
WHAT YOU SHOULD DO
- [Action 1: e.g., "Rotate any API keys you have configured in UpsQuad"]
- [Action 2: e.g., "Review your agent activity logs for unexpected actions"]
- [Action 3: e.g., "Enable MFA if not already active on your account"]
CONTACT
If you have questions, contact our security team at security@upsquad.ai or
your account representative. We will provide updates as our investigation
continues.
Reference: Incident [ID]
[Name]
[Title]
UpsQuad Security Team
Notification Delivery
- Email to the organization's primary admin contact (from Clerk org metadata)
- In-app notification banner on the client portal for affected tenants
- Dedicated status page update at status.upsquad.ai
- For P0 breaches affecting 100+ tenants: public blog post within 7 days
4. Escalation Matrix
| Severity | First Responder | Escalation (15 min no response) | Executive (1 hr) | Communication Lead |
|---|---|---|---|---|
| P0 | On-call engineer | Security Lead + CTO | CEO | CTO drafts external comms |
| P1 | On-call engineer | Security Lead | CTO | Security Lead drafts internal comms |
| P2 | Assigned engineer | Team Lead | — | Team Lead updates ticket |
| P3 | Assigned engineer | — | — | Engineer updates ticket |
On-Call Rotation
- Primary on-call: rotates weekly across backend and DevOps engineers
- Secondary on-call: Security Lead (always reachable for P0/P1)
- Escalation tool: PagerDuty with 5-minute auto-escalation for unacknowledged P0/P1 alerts
Incident Commander Responsibilities
The Incident Commander (IC) is assigned during triage and owns the incident until closure:
- Coordinates all response activities
- Maintains the incident timeline log
- Makes containment decisions (what to shut down, what to keep running)
- Communicates status updates every 30 minutes during active P0/P1 incidents
- Ensures post-incident review is scheduled and conducted
- Signs off on incident closure
5. Security Runbooks
5.1 Credential Leak (API Key Exposed in Logs or Repository)
Detection: Secret scanning alert (GitHub Advanced Security), log monitoring regex match, manual discovery.
Immediate Actions (within 15 minutes):
- Identify which credential was leaked: GCP service account key, Clerk API key, database password, agent
uq_key, OpenAI/Anthropic API key, Redis password - Determine exposure scope: was it in a public repo, a log accessible to customers, an internal log, or a CI artifact?
- Rotate the credential immediately:
- GCP SA key:
gcloud iam service-accounts keys createnew key, delete old key, update Secret Manager - Clerk API key: regenerate in Clerk Dashboard, update
CLERK_SECRET_KEYin Secret Manager - Database password:
ALTER USER ... PASSWORD '...'on CloudSQL, update Secret Manager, restart PgBouncer - Agent uq_key:
UPDATE agent_credentials SET revoked_at = NOW() WHERE key_hash = $hash, issue new key - LLM API key (OpenAI/Anthropic/Google): revoke in provider dashboard, update Secret Manager
- GCP SA key:
- Trigger rolling restart of all services consuming the rotated credential:
kubectl rollout restart deployment -l uses-secret=$SECRET_NAME -n upsquad
- Audit usage: check if the leaked credential was used between leak time and rotation
- If the credential was used maliciously, escalate to P0
Prevention:
- Pre-commit hook with
gitleaksto block secrets in commits - Log sanitization middleware in all gRPC interceptors (redact patterns matching key formats)
- Secret Manager with automatic rotation policies
5.2 Data Exfiltration (Unauthorized Bulk Data Access)
Detection: Prometheus alert on context_reads_total exceeding per-tenant threshold, anomalous SELECT volume in PostgreSQL slow query log, unusual egress traffic in GKE network metrics.
Immediate Actions:
- Identify the source: which tenant, user, agent, or service account is making the requests?
- Block the source:
- If agent:
UPDATE agents SET status = 'suspended' WHERE id = $agent_id; clear agent session from Redis - If user: lock account in Clerk
- If service account: revoke GCP IAM binding
- If agent:
- Apply emergency rate limit to the affected gRPC endpoint:
# Apply via kubectl to Envoy sidecar configrate_limit:requests_per_unit: 1unit: MINUTE
- Snapshot the database for forensic analysis
- Quantify data exposure:
SELECT COUNT(*), array_agg(DISTINCT tenant_id)FROM audit_logWHERE actor_id = $suspect_idAND action IN ('read', 'list', 'export')AND event_time > $suspicious_start;
- If cross-tenant data was accessed, escalate to P0 breach
Prevention:
- Per-tenant query rate limits enforced at the gRPC middleware layer
- Anomaly detection alerts on read volume (baseline + 3 standard deviations)
- PostgreSQL RLS policies as defense-in-depth (even if application layer is bypassed)
5.3 Unauthorized Access (Compromised User or Agent Account)
Detection: Clerk webhook for suspicious sign-in (new device, impossible travel), agent performing actions outside its governance policy, failed authorization attempts exceeding threshold.
Immediate Actions:
- Lock the compromised account:
- User:
clerk.users.update(userId, { locked: true })— revokes all sessions - Agent: set
status = 'suspended'in database, remove from Redis active-agent set
- User:
- Revoke all active sessions and tokens associated with the account
- Review the account's recent activity:
SELECT action, resource, status_code, event_timeFROM audit_logWHERE actor_id = $compromised_idORDER BY event_time DESCLIMIT 1000;
- Identify how the account was compromised:
- Credential stuffing? Check Clerk auth logs for brute-force patterns
- Session hijacking? Check for session token reuse from different IPs
- Phishing? Contact the account owner
- Agent jailbreak? Review agent session transcripts for prompt injection
- If the compromised account accessed other tenants' data, escalate to P0 breach
- Reset credentials and require MFA re-enrollment before re-enabling
Prevention:
- Enforce MFA for all admin-level accounts via Clerk
- Agent governance policies limit blast radius (agents cannot access data outside their scope)
- Session tokens have short TTL (15 minutes) with refresh rotation
- IP allowlisting for sensitive operations
5.4 Supply Chain Attack (Compromised Dependency)
Detection: Trivy scan in CI flags a known-malicious package version, govulncheck detects a CVE in a direct dependency, security advisory from Go vulnerability database or npm advisory.
Immediate Actions:
- Determine if the compromised version is deployed in production:
# For Go dependencieskubectl exec -n upsquad deploy/$SERVICE -- go version -m /app | grep $PACKAGE# For Node dependencies (client portal)kubectl exec -n upsquad deploy/client-portal -- npm ls $PACKAGE
- If deployed, assess impact:
- What does the compromised package do? (network access, file system, crypto)
- Was it a build-time only dependency or runtime?
- Were any malicious payloads executed? (check outbound network connections)
- Pin to last known good version:
# Gogo get $PACKAGE@$SAFE_VERSION && go mod tidy# Nodenpm install $PACKAGE@$SAFE_VERSION --save-exact
- Rebuild all container images from scratch (not from cache):
docker build --no-cache -t $IMAGE:$TAG .
- Deploy clean images via emergency ArgoCD sync
- Audit: check if the compromised dependency exfiltrated any secrets or data
- If secrets were potentially exfiltrated, trigger credential rotation (see Runbook 5.1)
Prevention:
go.sumandpackage-lock.jsonchecked into source control (integrity verification)- Trivy scanning in CI blocks merges with CRITICAL/HIGH vulnerabilities
- Dependabot enabled for automated dependency updates
- Minimal dependency policy: prefer standard library over third-party where feasible
5.5 DDoS / Service Disruption
Detection: Prometheus alert on request rate exceeding 10x baseline, GKE node pool hitting resource limits, Cloudflare DDoS detection triggers, customer reports of degraded performance.
Immediate Actions:
- Confirm it is a DDoS and not legitimate traffic spike (check if specific tenants are affected or all):
sum(rate(grpc_server_handled_total[1m])) by (grpc_service)
- Enable Cloudflare "Under Attack" mode if traffic is external:
- Cloudflare Dashboard > Security > Under Attack Mode: ON
- This enables JavaScript challenges for all requests
- If attack is targeting specific endpoints, apply targeted rate limiting:
# Block specific IP ranges at Cloudflarecurl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE/firewall/rules" \-H "Authorization: Bearer $CF_TOKEN" \-d '{"filter":{"expression":"ip.src in {$ATTACKER_RANGE}"},"action":"block"}'
- Scale up GKE node pool if legitimate traffic is being impacted:
gcloud container clusters resize $CLUSTER --node-pool $POOL --num-nodes $N --zone $ZONE
- If specific gRPC services are overwhelmed, enable circuit breaker:
- Set Envoy outlier detection to eject unhealthy pods faster
- Reduce connection limits to shed excess load
- Monitor recovery: watch error rate return to baseline before declaring all-clear
- Post-incident: analyze attack patterns and add permanent WAF rules
Prevention:
- Cloudflare in front of all public endpoints (DDoS mitigation built-in)
- Per-tenant rate limiting at API gateway level (prevents single tenant from consuming all capacity)
- GKE Autopilot auto-scales nodes, but set PodDisruptionBudgets to maintain availability
- gRPC keepalive settings tuned to drop idle connections from botnets
- Horizontal Pod Autoscaler with aggressive scale-up (30-second cooldown for P0 scenarios)
6. Communication Templates
Internal Status Update (every 30 minutes during active P0/P1)
INCIDENT UPDATE — [Incident ID] — [Timestamp UTC]
Status: [Investigating | Identified | Monitoring | Resolved]
Severity: [P0 | P1]
IC: [Name]
Current situation: [1-2 sentences]
Actions taken since last update: [Bullet list]
Next steps: [Bullet list]
ETA to resolution: [Estimate or "Unknown"]
Customer impact: [Description]
External Status Page Update
[Timestamp] — Investigating: We are investigating reports of [brief description].
Some customers may experience [impact]. We will provide an update within 30 minutes.
[Timestamp] — Identified: We have identified the cause and are implementing a fix.
[Service] functionality may be degraded. No data loss has occurred.
[Timestamp] — Resolved: The issue has been resolved. All services are operating normally.
A full post-incident report will be published within 48 hours.
7. Compliance Mapping
| Requirement | This Plan Section | Evidence |
|---|---|---|
| HIPAA 164.404 (breach notification) | Section 3 | Notification within 60 days, template provided |
| HIPAA 164.308(a)(6) (security incident procedures) | Sections 1-2 | Classification and response phases documented |
| PCI-DSS 12.10 (incident response plan) | Sections 1-6 | Full plan with runbooks and escalation |
| PCI-DSS 12.10.2 (annual testing) | Section 2, Phase 6 | Quarterly review cadence, post-incident review |
| FedRAMP IR-6 (incident reporting) | Sections 3-4 | Escalation matrix and notification deadlines |
| GDPR Art. 33 (notification to authority) | Section 3 | 72-hour notification to supervisory authority |
| GDPR Art. 34 (notification to data subject) | Section 3 | Customer notification template and procedure |
| SOC 2 CC7.3 (security incidents) | Sections 1-6 | Detection, response, communication, and review |
8. Plan Maintenance
- Quarterly review: Security Lead reviews and updates this plan every quarter
- After every P0/P1 incident: update runbooks with lessons learned within 1 week
- Annual tabletop exercise: simulate a P0 breach scenario and walk through the full response
- New hire onboarding: all engineers read this plan within their first week
- Version control: this document is maintained in
docs/security/incident-response-plan.mdin theupsquad-corerepository; all changes go through PR review