Skip to content
6 min read·Lesson 8 of 10

Incident Response in the Cloud

From alert to containment to recovery. Cloud-specific evidence preservation, automated response, and the playbooks that work under pressure.

Cloud incident response shares the structure of any IR — detect, contain, eradicate, recover, learn — but the speed and tooling differ. Containment that took hours on-prem can be one API call. Evidence is a snapshot away. The challenge is being prepared and disciplined enough to use the speed correctly.

Before the Incident: Preparation

  • Runbooks for the top expected scenarios: leaked access key, public bucket, compromised EC2, container escape, DDoS, data exfiltration.
  • RACI — incident commander, security lead, comms lead, executive sponsor, legal counsel.
  • Communication channels — incident Slack/Teams channel template, status page, customer comms templates.
  • Authority — who can authorise rotating keys, isolating prod, calling in external IR.
  • External relationships — IR retainer (Mandiant, CrowdStrike, Kroll), cyber insurance, regulators' contact details.
  • Tabletop exercises quarterly; purple-team exercises annually.

The Five Phases

1. Detect

Alert fires (covered in the previous lesson). Acknowledge, declare an incident, open the channel.

2. Triage

  • Confirm: is this a real incident or a noisy alert?
  • Assess scope: which accounts, regions, identities, data?
  • Set severity: customer impact, data sensitivity, blast radius.
  • Notify stakeholders per severity.

3. Contain

Stop the bleeding. Cloud-native tools make this very fast — but be careful: containment can also tip off the attacker. For confirmed compromise, contain decisively.

IndicatorContainment action
Leaked IAM access keyDisable the key; invalidate active STS sessions (aws iam update-role trust policy or attach DenyAll); rotate
Compromised EC2 / VMSnapshot disks; replace SG with isolation SG (no ingress, no egress); detach role; image for forensic analysis
Compromised containerCordon node; drain pods; capture pod and node memory if possible; rotate service account tokens
Public bucketApply Block Public Access + bucket policy deny; preserve access logs
Compromised userRevoke sessions in IdP; rotate password; require MFA reset; review recent activity
DDoSScale up, enable Shield Advanced / Cloud Armor adaptive protection, block source ASNs/regions

4. Eradicate & Recover

  • Identify the initial vector and remove it (revoke web shell, patch vulnerability, fix misconfiguration).
  • Rebuild from clean images / Terraform — do not "clean" a compromised host in place.
  • Rotate all credentials potentially exposed (within blast radius). Be generous.
  • Restore from a backup confirmed to predate the compromise.
  • Phased restoration: bring services back behind extra monitoring; full restore only when clean.

5. Lessons Learned

Postmortem within days. Action items with owners and deadlines. Structural fixes (better IAM, better detection rule, missing guardrail) over tactical ones.

Evidence Preservation

Before remediation, copy what you will need later:

  • EBS / managed-disk snapshots of affected VMs.
  • EBS / disk volumes attached read-only to a forensic instance for triage.
  • Memory captures where possible (LiME, AWS SSM document, Azure Run Command).
  • CloudTrail / Activity Log / Audit Logs for the time window — copied to a separate forensic account.
  • Container images, Kubernetes audit logs, pod logs.
  • Network logs (VPC Flow, NSG Flow).

The forensic account / project should be tightly scoped, with object-locked storage and a different IAM blast radius from the rest of the org. Even if the attacker has admin in your prod org, they cannot reach the forensic copies.

Automating Response

Common reactions can be safely automated:

  • Public S3 bucket detected → automatically apply Block Public Access (Config rule remediation, EventBridge → Lambda).
  • IAM user accumulating dangerous permissions → revert to baseline policy.
  • EC2 instance flagged by GuardDuty as crypto-mining → isolate via SG swap, snapshot, page on-call.
  • Compromised credential alert → invalidate sessions, force re-auth.

Wire EventBridge / Sentinel Logic Apps / Eventarc → workflow tools. Keep humans in the loop for high-impact actions; automate clearly-safe ones.

Comms During an Incident

  • One incident channel for engineering coordination; one for stakeholder updates.
  • Pre-written customer notification templates so legal/PR are not blocking under pressure.
  • Status page updates every 30–60 minutes, even with "still investigating".
  • Notify regulators where required (GDPR: 72h, many financial regulators, sectoral rules).
  • Document decisions and timestamps. The postmortem and (potentially) regulatory filing depends on this record.

Cloud-Specific Pitfalls

  • Not all sessions invalidate — disabling an access key stops new STS calls, but existing STS sessions live until expiry. Use aws iam put-user-policy with an explicit deny, or revoke sessions in IdP.
  • Cross-account assume-role chains — investigate every account the compromised principal could reach.
  • Logs in the compromised account — attackers delete CloudTrail. Logs must live in a separate, write-protected account.
  • Cost spikes as a signal — sudden EC2 / GPU spend often means crypto-mining; monitor for anomalous billing.
  • Encryption keys — rotate KMS keys if compromised; re-encrypt or rely on key-version rollover.

Mock Runbook: Leaked Access Key

  1. Confirm: is the alert real? Check CloudTrail for the user's recent activity.
  2. Disable the access key (aws iam update-access-key --status Inactive).
  3. Attach explicit-deny inline policy to the user/role to invalidate STS sessions.
  4. Snapshot all CloudTrail events for the user in the past 30 days into the forensic account.
  5. Review iam:Create*, s3:Get*, ec2:RunInstances for sketchy activity.
  6. Rotate any data the user could read (or note in postmortem).
  7. Replace the user with a role + federation; remove the static key permanently.
  8. Search the codebase, registries, and Slack for the key fingerprint to find the leak source.
  9. Postmortem; add detection for the leak vector.

The Truth About Real Incidents

Under stress, even strong teams make mistakes. The teams that handle incidents well are the teams that practiced. Run a tabletop next month. Pick a scenario, walk through who decides what, see where the gaps are. The gaps you find in practice are the ones you do not find at 2am during a real incident.

Key Takeaways

  • Plan response before incidents: roles, comms, decision authority, evidence handling.
  • Containment in cloud is fast — disable creds, snapshot then isolate, rotate keys, revoke sessions.
  • Preserve evidence by snapshotting and copying logs to an immutable account before remediating.
  • Automate common runbooks with EventBridge, Logic Apps, Cloud Functions — humans approve, robots act.
  • Run tabletop and purple-team exercises so the first real incident is not also the first practice run.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →