Cloud incident response shares the structure of any IR — detect, contain, eradicate, recover, learn — but the speed and tooling differ. Containment that took hours on-prem can be one API call. Evidence is a snapshot away. The challenge is being prepared and disciplined enough to use the speed correctly.
Before the Incident: Preparation
- Runbooks for the top expected scenarios: leaked access key, public bucket, compromised EC2, container escape, DDoS, data exfiltration.
- RACI — incident commander, security lead, comms lead, executive sponsor, legal counsel.
- Communication channels — incident Slack/Teams channel template, status page, customer comms templates.
- Authority — who can authorise rotating keys, isolating prod, calling in external IR.
- External relationships — IR retainer (Mandiant, CrowdStrike, Kroll), cyber insurance, regulators' contact details.
- Tabletop exercises quarterly; purple-team exercises annually.
The Five Phases
1. Detect
Alert fires (covered in the previous lesson). Acknowledge, declare an incident, open the channel.
2. Triage
- Confirm: is this a real incident or a noisy alert?
- Assess scope: which accounts, regions, identities, data?
- Set severity: customer impact, data sensitivity, blast radius.
- Notify stakeholders per severity.
3. Contain
Stop the bleeding. Cloud-native tools make this very fast — but be careful: containment can also tip off the attacker. For confirmed compromise, contain decisively.
| Indicator | Containment action |
|---|---|
| Leaked IAM access key | Disable the key; invalidate active STS sessions (aws iam update-role trust policy or attach DenyAll); rotate |
| Compromised EC2 / VM | Snapshot disks; replace SG with isolation SG (no ingress, no egress); detach role; image for forensic analysis |
| Compromised container | Cordon node; drain pods; capture pod and node memory if possible; rotate service account tokens |
| Public bucket | Apply Block Public Access + bucket policy deny; preserve access logs |
| Compromised user | Revoke sessions in IdP; rotate password; require MFA reset; review recent activity |
| DDoS | Scale up, enable Shield Advanced / Cloud Armor adaptive protection, block source ASNs/regions |
4. Eradicate & Recover
- Identify the initial vector and remove it (revoke web shell, patch vulnerability, fix misconfiguration).
- Rebuild from clean images / Terraform — do not "clean" a compromised host in place.
- Rotate all credentials potentially exposed (within blast radius). Be generous.
- Restore from a backup confirmed to predate the compromise.
- Phased restoration: bring services back behind extra monitoring; full restore only when clean.
5. Lessons Learned
Postmortem within days. Action items with owners and deadlines. Structural fixes (better IAM, better detection rule, missing guardrail) over tactical ones.
Evidence Preservation
Before remediation, copy what you will need later:
- EBS / managed-disk snapshots of affected VMs.
- EBS / disk volumes attached read-only to a forensic instance for triage.
- Memory captures where possible (LiME, AWS SSM document, Azure Run Command).
- CloudTrail / Activity Log / Audit Logs for the time window — copied to a separate forensic account.
- Container images, Kubernetes audit logs, pod logs.
- Network logs (VPC Flow, NSG Flow).
The forensic account / project should be tightly scoped, with object-locked storage and a different IAM blast radius from the rest of the org. Even if the attacker has admin in your prod org, they cannot reach the forensic copies.
Automating Response
Common reactions can be safely automated:
- Public S3 bucket detected → automatically apply Block Public Access (Config rule remediation, EventBridge → Lambda).
- IAM user accumulating dangerous permissions → revert to baseline policy.
- EC2 instance flagged by GuardDuty as crypto-mining → isolate via SG swap, snapshot, page on-call.
- Compromised credential alert → invalidate sessions, force re-auth.
Wire EventBridge / Sentinel Logic Apps / Eventarc → workflow tools. Keep humans in the loop for high-impact actions; automate clearly-safe ones.
Comms During an Incident
- One incident channel for engineering coordination; one for stakeholder updates.
- Pre-written customer notification templates so legal/PR are not blocking under pressure.
- Status page updates every 30–60 minutes, even with "still investigating".
- Notify regulators where required (GDPR: 72h, many financial regulators, sectoral rules).
- Document decisions and timestamps. The postmortem and (potentially) regulatory filing depends on this record.
Cloud-Specific Pitfalls
- Not all sessions invalidate — disabling an access key stops new STS calls, but existing STS sessions live until expiry. Use
aws iam put-user-policywith an explicit deny, or revoke sessions in IdP. - Cross-account assume-role chains — investigate every account the compromised principal could reach.
- Logs in the compromised account — attackers delete CloudTrail. Logs must live in a separate, write-protected account.
- Cost spikes as a signal — sudden EC2 / GPU spend often means crypto-mining; monitor for anomalous billing.
- Encryption keys — rotate KMS keys if compromised; re-encrypt or rely on key-version rollover.
Mock Runbook: Leaked Access Key
- Confirm: is the alert real? Check CloudTrail for the user's recent activity.
- Disable the access key (
aws iam update-access-key --status Inactive). - Attach explicit-deny inline policy to the user/role to invalidate STS sessions.
- Snapshot all CloudTrail events for the user in the past 30 days into the forensic account.
- Review
iam:Create*,s3:Get*,ec2:RunInstancesfor sketchy activity. - Rotate any data the user could read (or note in postmortem).
- Replace the user with a role + federation; remove the static key permanently.
- Search the codebase, registries, and Slack for the key fingerprint to find the leak source.
- Postmortem; add detection for the leak vector.
The Truth About Real Incidents
Under stress, even strong teams make mistakes. The teams that handle incidents well are the teams that practiced. Run a tabletop next month. Pick a scenario, walk through who decides what, see where the gaps are. The gaps you find in practice are the ones you do not find at 2am during a real incident.