Skip to main content

UpsQuad Security Incident Response Plan

Version: 1.0 Last Updated: 2026-04-12 Owner: Principal Technical Architect Review Cadence: Quarterly (next review: 2026-07-12) Compliance References: HIPAA 164.404, PCI-DSS 12.10, FedRAMP IR-6, GDPR Art. 33-34


1. Incident Classification

Every security event must be classified on intake. Classification determines response timeline, escalation path, and notification obligations.

P0 — Critical

Active or confirmed breach requiring immediate company-wide response.

  • Unauthorized access to customer data (any tenant's PostgreSQL rows, context vectors, or agent session logs)
  • Credential compromise: Clerk API keys, GCP service account keys, database credentials, or agent uq_key tokens
  • Active exploitation of a production vulnerability (RCE, SQL injection, privilege escalation)
  • Ransomware or destructive attack against GKE clusters or CloudSQL instances
  • Exfiltration of data confirmed via audit logs or network telemetry

Response SLA: Triage within 15 minutes. Containment within 1 hour.

P1 — High

Confirmed vulnerability or failed control that could lead to breach if unexploited.

  • Exploitable vulnerability in production gRPC services or API gateway (CVSS >= 9.0)
  • Row-level security (RLS) bypass — a tenant can query another tenant's data
  • Failed security control: WAF rules disabled, network policy deleted, TLS termination misconfigured
  • Suspicious activity pattern: bulk data access anomaly, credential stuffing against Clerk, unusual agent API call volume
  • Compromised CI/CD pipeline (GitHub Actions secrets exposed, ArgoCD admin access)

Response SLA: Triage within 1 hour. Containment within 4 hours.

P2 — Medium

Known risk requiring scheduled remediation.

  • Failed penetration test finding (external or internal)
  • Dependency vulnerability rated CRITICAL or HIGH by govulncheck or Trivy scan
  • Infrastructure misconfiguration: overly permissive IAM role, public GCS bucket, missing network policy
  • Expired or soon-to-expire TLS certificates
  • Security scanner findings from Grafana alerting rules

Response SLA: Triage within 24 hours. Remediation plan within 72 hours.

P3 — Low

Informational events requiring tracking but not urgent action.

  • Policy violation: developer accessed production database directly, secret committed to branch (not merged)
  • Informational security event: port scan detected, failed auth attempts below threshold
  • Dependency vulnerability rated MEDIUM or LOW
  • Documentation gap identified during audit

Response SLA: Tracked in issue backlog. Addressed within current sprint.


2. Incident Response Phases

Phase 1: Detection

Goal: Identify the security event as early as possible.

Detection Sources:

SourceWhat It CatchesOwner
Prometheus/Grafana alertsAnomalous request rates, error spikes, resource exhaustion, RLS violation countersDevOps / On-call
Audit log analysis (PostgreSQL audit_log table)Unauthorized data access patterns, privilege escalation, bulk exportsSecurity Lead
Clerk webhook eventsSuspicious auth patterns, impossible travel, credential stuffingBackend team
Trivy container scanning (CI/CD)Vulnerable base images, compromised dependenciesDevOps
govulncheck in GitHub ActionsGo dependency vulnerabilitiesBackend team
GKE audit logs (Cloud Audit Logs)Unauthorized kubectl access, namespace breakout, pod privilege escalationDevOps
Customer reports (support channel)Data they shouldn't see, unexpected agent behavior, auth failuresSupport / PM
OpenTelemetry tracesAbnormal latency patterns indicating data exfiltration or crypto miningDevOps

Detection Actions:

  1. Alert fires or report received
  2. On-call engineer acknowledges within 5 minutes (PagerDuty)
  3. Preliminary assessment: is this a security event? If yes, create incident channel and proceed to Triage
  4. Log the event in #security-incidents channel with timestamp, source, and initial assessment

Phase 2: Triage

Goal: Classify severity, determine scope, and identify affected tenants.

Triage Checklist:

  • Assign incident severity (P0/P1/P2/P3) using classification above
  • Create GitHub issue with label security-incident and severity label
  • Identify attack vector: which service, endpoint, or infrastructure component?
  • Determine blast radius:
    • Which tenants are affected? Query audit_log filtered by timeframe and affected service
    • Which data types are exposed? (agent sessions, context vectors, org config, credentials)
    • Is the vulnerability actively being exploited or theoretical?
  • Check for lateral movement: are other services or namespaces compromised?
  • Assign Incident Commander (IC) per escalation matrix
  • Begin incident timeline log (every action timestamped)

Tenant Scope Assessment Query (run against read replica):

SELECT DISTINCT tenant_id, COUNT(*) as affected_records
FROM audit_log
WHERE event_time BETWEEN $incident_start AND NOW()
AND service = $affected_service
AND (status_code >= 400 OR action IN ('data_export', 'bulk_read'))
GROUP BY tenant_id;

Phase 3: Containment

Goal: Stop the bleeding. Prevent further damage without destroying forensic evidence.

Immediate Containment Actions by Attack Type:

ScenarioContainment ActionCommand / Procedure
Compromised API keyRotate key in GCP Secret Manager, invalidate cached copies in Redisgcloud secrets versions add $SECRET --data-file=new_key && kubectl rollout restart deployment/$SERVICE -n $NS
Compromised user accountDisable user in Clerk, revoke all sessionsClerk Dashboard or clerk.users.update(userId, { locked: true })
RLS bypassEnable emergency read-only mode on affected tables, add explicit tenant_id filter to application layerApply emergency migration, toggle feature flag rls_enforcement=strict
Namespace breakoutApply deny-all network policy to affected namespace, cordon compromised nodeskubectl apply -f emergency-deny-all.yaml -n $NS && kubectl cordon $NODE
Agent token compromiseRevoke uq_key for affected agents, pause agent executionUpdate agent_credentials table SET revoked_at = NOW(), clear Redis session cache
Malicious dependencyPin to last known good version, rebuild and redeployUpdate go.mod, run go mod tidy, trigger emergency ArgoCD sync
Data exfiltration in progressBlock source IP at Cloudflare/GKE ingress, rate-limit affected endpoint to zeroCloudflare WAF rule or kubectl apply rate-limit policy

Forensic Preservation (before any cleanup):

  1. Snapshot affected CloudSQL instance
  2. Export GKE audit logs for the incident window to GCS
  3. Preserve Prometheus metrics and Grafana dashboards for the period
  4. Export OpenTelemetry traces for affected services
  5. Do NOT delete or modify logs until post-incident review is complete

Phase 4: Eradication

Goal: Remove the root cause so the vulnerability cannot be re-exploited.

Eradication Steps:

  1. Identify root cause (code bug, misconfiguration, compromised credential, supply chain)
  2. Develop fix:
    • Code fix: standard PR process with expedited review (architect review still required)
    • Config fix: PR to Pulumi IaC or Kubernetes manifests
    • Credential rotation: rotate ALL potentially affected credentials, not just confirmed ones
  3. Deploy fix:
    • P0/P1: Emergency deployment via ArgoCD with manual sync (bypass normal promotion cadence)
    • P2/P3: Standard deployment pipeline
  4. Verify fix:
    • Reproduce the original attack vector and confirm it is blocked
    • Run targeted security scan against the fix
    • Confirm no regression in functionality

Phase 5: Recovery

Goal: Restore full service and verify data integrity.

Recovery Checklist:

  • Re-enable any services or endpoints disabled during containment
  • Remove emergency network policies and rate limits
  • Verify all tenants can access their data normally
  • Run data integrity checks:
    -- Verify no cross-tenant data contamination
    SELECT * FROM context_entries
    WHERE tenant_id != (SELECT tenant_id FROM agents WHERE id = context_entries.agent_id);
  • Confirm monitoring is back to baseline (no elevated error rates, latency normal)
  • Verify all rotated credentials are propagated to all services
  • Run smoke tests against all gRPC service health endpoints
  • Confirm ArgoCD shows all applications in sync

Phase 6: Post-Incident Review

Goal: Learn and prevent recurrence. Must happen within 48 hours of incident closure.

Post-Incident Review Meeting:

  • Attendees: Incident Commander, all responders, CTO, affected team leads
  • Agenda:
    1. Timeline walkthrough (using incident log)
    2. What went well in the response?
    3. What could be improved?
    4. Root cause analysis (use 5-whys method)
    5. Action items with owners and deadlines

Post-Incident Deliverables:

  1. Incident Report (GitHub issue with label post-incident):
    • Timeline of events
    • Root cause analysis
    • Impact assessment (tenants affected, data exposed, duration)
    • Remediation actions taken
    • Preventive measures to implement
  2. Action Items: Create GitHub issues for each preventive measure, linked to incident
  3. Runbook Updates: Update relevant runbooks based on lessons learned
  4. Monitoring Improvements: Add alerts for the detection gap that allowed this incident
  5. Lessons entry: Update tasks/lessons.md with the pattern for agent team learning

3. Breach Notification Procedure

A "breach" is confirmed unauthorized access to, or exfiltration of, customer data. Not all security incidents are breaches — only those involving actual data exposure.

Breach Confirmation Criteria

A breach is confirmed when ALL of the following are true:

  1. Unauthorized party accessed or exfiltrated data
  2. The data includes personally identifiable information (PII) or customer business data
  3. The access was not authorized by the data owner

Internal Notification

WhoWhenHow
CTOWithin 1 hour of breach confirmationPhone call + Slack DM
Security LeadWithin 1 hour of breach confirmationPhone call + Slack DM
CEOWithin 2 hours of breach confirmationPhone call from CTO
Legal counselWithin 4 hours of breach confirmationEmail from CTO with incident summary
All engineeringWithin 8 hours of breach confirmation#security-incidents Slack channel

Customer Notification

RegulationDeadlineAuthority
GDPR (EU residents)72 hours from awarenessSupervisory Authority + affected individuals
HIPAA (if health data)60 calendar daysHHS OCR + affected individuals
CCPA (CA residents)Without unreasonable delayCA Attorney General (if 500+ residents)
PCI-DSS (if card data)Immediately upon discoveryCard brands + acquiring bank
General (no specific regulation)72 hours (our policy)Affected customers directly

Notification Template

Subject: Security Incident Notification — UpsQuad [Incident ID]

Dear [Customer Name],

We are writing to inform you of a security incident that may have affected your
data on the UpsQuad platform.

WHAT HAPPENED
On [date], we detected [brief description of the incident — e.g., "unauthorized
access to our database through a vulnerability in our API gateway"]. The incident
occurred between [start time] and [end time] UTC.

WHAT DATA WAS INVOLVED
Based on our investigation, the following data associated with your account may
have been accessed:
- [List specific data types: agent session logs, organization configuration,
context entries, user email addresses, etc.]

WHAT WE ARE DOING
- [Specific remediation step 1: e.g., "We have rotated all affected credentials
and deployed a fix to the vulnerability"]
- [Specific remediation step 2: e.g., "We have engaged a third-party security
firm to conduct a full audit"]
- [Specific remediation step 3: e.g., "We are implementing additional monitoring
to detect similar patterns"]

WHAT YOU SHOULD DO
- [Action 1: e.g., "Rotate any API keys you have configured in UpsQuad"]
- [Action 2: e.g., "Review your agent activity logs for unexpected actions"]
- [Action 3: e.g., "Enable MFA if not already active on your account"]

CONTACT
If you have questions, contact our security team at security@upsquad.ai or
your account representative. We will provide updates as our investigation
continues.

Reference: Incident [ID]

[Name]
[Title]
UpsQuad Security Team

Notification Delivery

  1. Email to the organization's primary admin contact (from Clerk org metadata)
  2. In-app notification banner on the client portal for affected tenants
  3. Dedicated status page update at status.upsquad.ai
  4. For P0 breaches affecting 100+ tenants: public blog post within 7 days

4. Escalation Matrix

SeverityFirst ResponderEscalation (15 min no response)Executive (1 hr)Communication Lead
P0On-call engineerSecurity Lead + CTOCEOCTO drafts external comms
P1On-call engineerSecurity LeadCTOSecurity Lead drafts internal comms
P2Assigned engineerTeam LeadTeam Lead updates ticket
P3Assigned engineerEngineer updates ticket

On-Call Rotation

  • Primary on-call: rotates weekly across backend and DevOps engineers
  • Secondary on-call: Security Lead (always reachable for P0/P1)
  • Escalation tool: PagerDuty with 5-minute auto-escalation for unacknowledged P0/P1 alerts

Incident Commander Responsibilities

The Incident Commander (IC) is assigned during triage and owns the incident until closure:

  • Coordinates all response activities
  • Maintains the incident timeline log
  • Makes containment decisions (what to shut down, what to keep running)
  • Communicates status updates every 30 minutes during active P0/P1 incidents
  • Ensures post-incident review is scheduled and conducted
  • Signs off on incident closure

5. Security Runbooks

5.1 Credential Leak (API Key Exposed in Logs or Repository)

Detection: Secret scanning alert (GitHub Advanced Security), log monitoring regex match, manual discovery.

Immediate Actions (within 15 minutes):

  1. Identify which credential was leaked: GCP service account key, Clerk API key, database password, agent uq_key, OpenAI/Anthropic API key, Redis password
  2. Determine exposure scope: was it in a public repo, a log accessible to customers, an internal log, or a CI artifact?
  3. Rotate the credential immediately:
    • GCP SA key: gcloud iam service-accounts keys create new key, delete old key, update Secret Manager
    • Clerk API key: regenerate in Clerk Dashboard, update CLERK_SECRET_KEY in Secret Manager
    • Database password: ALTER USER ... PASSWORD '...' on CloudSQL, update Secret Manager, restart PgBouncer
    • Agent uq_key: UPDATE agent_credentials SET revoked_at = NOW() WHERE key_hash = $hash, issue new key
    • LLM API key (OpenAI/Anthropic/Google): revoke in provider dashboard, update Secret Manager
  4. Trigger rolling restart of all services consuming the rotated credential:
    kubectl rollout restart deployment -l uses-secret=$SECRET_NAME -n upsquad
  5. Audit usage: check if the leaked credential was used between leak time and rotation
  6. If the credential was used maliciously, escalate to P0

Prevention:

  • Pre-commit hook with gitleaks to block secrets in commits
  • Log sanitization middleware in all gRPC interceptors (redact patterns matching key formats)
  • Secret Manager with automatic rotation policies

5.2 Data Exfiltration (Unauthorized Bulk Data Access)

Detection: Prometheus alert on context_reads_total exceeding per-tenant threshold, anomalous SELECT volume in PostgreSQL slow query log, unusual egress traffic in GKE network metrics.

Immediate Actions:

  1. Identify the source: which tenant, user, agent, or service account is making the requests?
  2. Block the source:
    • If agent: UPDATE agents SET status = 'suspended' WHERE id = $agent_id; clear agent session from Redis
    • If user: lock account in Clerk
    • If service account: revoke GCP IAM binding
  3. Apply emergency rate limit to the affected gRPC endpoint:
    # Apply via kubectl to Envoy sidecar config
    rate_limit:
    requests_per_unit: 1
    unit: MINUTE
  4. Snapshot the database for forensic analysis
  5. Quantify data exposure:
    SELECT COUNT(*), array_agg(DISTINCT tenant_id)
    FROM audit_log
    WHERE actor_id = $suspect_id
    AND action IN ('read', 'list', 'export')
    AND event_time > $suspicious_start;
  6. If cross-tenant data was accessed, escalate to P0 breach

Prevention:

  • Per-tenant query rate limits enforced at the gRPC middleware layer
  • Anomaly detection alerts on read volume (baseline + 3 standard deviations)
  • PostgreSQL RLS policies as defense-in-depth (even if application layer is bypassed)

5.3 Unauthorized Access (Compromised User or Agent Account)

Detection: Clerk webhook for suspicious sign-in (new device, impossible travel), agent performing actions outside its governance policy, failed authorization attempts exceeding threshold.

Immediate Actions:

  1. Lock the compromised account:
    • User: clerk.users.update(userId, { locked: true }) — revokes all sessions
    • Agent: set status = 'suspended' in database, remove from Redis active-agent set
  2. Revoke all active sessions and tokens associated with the account
  3. Review the account's recent activity:
    SELECT action, resource, status_code, event_time
    FROM audit_log
    WHERE actor_id = $compromised_id
    ORDER BY event_time DESC
    LIMIT 1000;
  4. Identify how the account was compromised:
    • Credential stuffing? Check Clerk auth logs for brute-force patterns
    • Session hijacking? Check for session token reuse from different IPs
    • Phishing? Contact the account owner
    • Agent jailbreak? Review agent session transcripts for prompt injection
  5. If the compromised account accessed other tenants' data, escalate to P0 breach
  6. Reset credentials and require MFA re-enrollment before re-enabling

Prevention:

  • Enforce MFA for all admin-level accounts via Clerk
  • Agent governance policies limit blast radius (agents cannot access data outside their scope)
  • Session tokens have short TTL (15 minutes) with refresh rotation
  • IP allowlisting for sensitive operations

5.4 Supply Chain Attack (Compromised Dependency)

Detection: Trivy scan in CI flags a known-malicious package version, govulncheck detects a CVE in a direct dependency, security advisory from Go vulnerability database or npm advisory.

Immediate Actions:

  1. Determine if the compromised version is deployed in production:
    # For Go dependencies
    kubectl exec -n upsquad deploy/$SERVICE -- go version -m /app | grep $PACKAGE
    # For Node dependencies (client portal)
    kubectl exec -n upsquad deploy/client-portal -- npm ls $PACKAGE
  2. If deployed, assess impact:
    • What does the compromised package do? (network access, file system, crypto)
    • Was it a build-time only dependency or runtime?
    • Were any malicious payloads executed? (check outbound network connections)
  3. Pin to last known good version:
    # Go
    go get $PACKAGE@$SAFE_VERSION && go mod tidy
    # Node
    npm install $PACKAGE@$SAFE_VERSION --save-exact
  4. Rebuild all container images from scratch (not from cache):
    docker build --no-cache -t $IMAGE:$TAG .
  5. Deploy clean images via emergency ArgoCD sync
  6. Audit: check if the compromised dependency exfiltrated any secrets or data
  7. If secrets were potentially exfiltrated, trigger credential rotation (see Runbook 5.1)

Prevention:

  • go.sum and package-lock.json checked into source control (integrity verification)
  • Trivy scanning in CI blocks merges with CRITICAL/HIGH vulnerabilities
  • Dependabot enabled for automated dependency updates
  • Minimal dependency policy: prefer standard library over third-party where feasible

5.5 DDoS / Service Disruption

Detection: Prometheus alert on request rate exceeding 10x baseline, GKE node pool hitting resource limits, Cloudflare DDoS detection triggers, customer reports of degraded performance.

Immediate Actions:

  1. Confirm it is a DDoS and not legitimate traffic spike (check if specific tenants are affected or all):
    sum(rate(grpc_server_handled_total[1m])) by (grpc_service)
  2. Enable Cloudflare "Under Attack" mode if traffic is external:
    • Cloudflare Dashboard > Security > Under Attack Mode: ON
    • This enables JavaScript challenges for all requests
  3. If attack is targeting specific endpoints, apply targeted rate limiting:
    # Block specific IP ranges at Cloudflare
    curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE/firewall/rules" \
    -H "Authorization: Bearer $CF_TOKEN" \
    -d '{"filter":{"expression":"ip.src in {$ATTACKER_RANGE}"},"action":"block"}'
  4. Scale up GKE node pool if legitimate traffic is being impacted:
    gcloud container clusters resize $CLUSTER --node-pool $POOL --num-nodes $N --zone $ZONE
  5. If specific gRPC services are overwhelmed, enable circuit breaker:
    • Set Envoy outlier detection to eject unhealthy pods faster
    • Reduce connection limits to shed excess load
  6. Monitor recovery: watch error rate return to baseline before declaring all-clear
  7. Post-incident: analyze attack patterns and add permanent WAF rules

Prevention:

  • Cloudflare in front of all public endpoints (DDoS mitigation built-in)
  • Per-tenant rate limiting at API gateway level (prevents single tenant from consuming all capacity)
  • GKE Autopilot auto-scales nodes, but set PodDisruptionBudgets to maintain availability
  • gRPC keepalive settings tuned to drop idle connections from botnets
  • Horizontal Pod Autoscaler with aggressive scale-up (30-second cooldown for P0 scenarios)

6. Communication Templates

Internal Status Update (every 30 minutes during active P0/P1)

INCIDENT UPDATE — [Incident ID] — [Timestamp UTC]
Status: [Investigating | Identified | Monitoring | Resolved]
Severity: [P0 | P1]
IC: [Name]

Current situation: [1-2 sentences]
Actions taken since last update: [Bullet list]
Next steps: [Bullet list]
ETA to resolution: [Estimate or "Unknown"]
Customer impact: [Description]

External Status Page Update

[Timestamp] — Investigating: We are investigating reports of [brief description].
Some customers may experience [impact]. We will provide an update within 30 minutes.

[Timestamp] — Identified: We have identified the cause and are implementing a fix.
[Service] functionality may be degraded. No data loss has occurred.

[Timestamp] — Resolved: The issue has been resolved. All services are operating normally.
A full post-incident report will be published within 48 hours.

7. Compliance Mapping

RequirementThis Plan SectionEvidence
HIPAA 164.404 (breach notification)Section 3Notification within 60 days, template provided
HIPAA 164.308(a)(6) (security incident procedures)Sections 1-2Classification and response phases documented
PCI-DSS 12.10 (incident response plan)Sections 1-6Full plan with runbooks and escalation
PCI-DSS 12.10.2 (annual testing)Section 2, Phase 6Quarterly review cadence, post-incident review
FedRAMP IR-6 (incident reporting)Sections 3-4Escalation matrix and notification deadlines
GDPR Art. 33 (notification to authority)Section 372-hour notification to supervisory authority
GDPR Art. 34 (notification to data subject)Section 3Customer notification template and procedure
SOC 2 CC7.3 (security incidents)Sections 1-6Detection, response, communication, and review

8. Plan Maintenance

  • Quarterly review: Security Lead reviews and updates this plan every quarter
  • After every P0/P1 incident: update runbooks with lessons learned within 1 week
  • Annual tabletop exercise: simulate a P0 breach scenario and walk through the full response
  • New hire onboarding: all engineers read this plan within their first week
  • Version control: this document is maintained in docs/security/incident-response-plan.md in the upsquad-core repository; all changes go through PR review