Security Incident Response and Digital Forensics: What to Do When You Are Breached
An SRE outage is a system failing you. A breach is an adversary attacking you. This is the engineer's field guide to the NIST incident response lifecycle, preserving evidence, and the forensic sources that tell you what actually happened.
At 02:14 a GuardDuty finding fires: an IAM role you have never seen calling AssumeRole from an IP in a country you do not operate in. Your first instinct, honed by years of on-call, is to fix it: kill the session, rotate the key, restart the service, go back to bed. That instinct is exactly wrong. This is not a service that broke. This is a person who got in, and the actions you take in the next ten minutes either preserve the truth or destroy it.
Who this is for
Engineers and SREs who own production systems and might be first on the scene of a breach. You do not need to be a forensic examiner or a CISO. You need to know what an incident is, how to contain it without burning the evidence, and what to hand to the people who will reconstruct what happened. If you have read [incident management and on-call](/blog/incident-management-and-oncall), this is its security-shaped sibling.
SRE incident response asks 'how do we restore service?'. Security incident response asks a harder set of questions first: 'is the attacker still here? what did they touch? what did they take? and can I prove it?'. Restoring service too fast can re-expose you, tip off the adversary, or wipe the very logs that answer those questions.
The mental model: you are a first responder at a crime scene
An incident is a confirmed adverse event with impact; an event is anything observable. Most events are noise. The job is to decide which events are incidents, and to not contaminate the scene while you do.
Secure the scene, keep people out so they do not trample evidenceRestrict access to affected systems; freeze deploys and routine admin activity
Do not move the body before the photographer arrivesDo not reboot, terminate, or re-image a host before you capture memory and disk
Bag and tag each item with who touched it and whenSnapshot volumes, hash the images, record a chain of custody
Detectives reconstruct the timeline from physical tracesForensic analysts reconstruct the attack from logs, memory, and disk artifacts
A first responder is not the prosecutorYou contain and preserve; specialists (or future-you) do deep analysis
A breach is a crime scene. The same instincts that make you a good engineer, fix it fast, clean it up, make you a bad first responder. Slow down and preserve.
Hold this frame for the whole article. Every controversial call, should I isolate or watch? reboot or snapshot? notify or wait?, gets easier when you ask 'what would a careful first responder do?'.
The IR lifecycle (NIST 800-61) is a loop, not a line
NIST splits incident handling into phases. The trap is reading them as a one-way checklist. In a real breach you bounce between detect/analyze and contain/eradicate repeatedly as you discover more, and every incident feeds prepare for the next one. Draw it as a loop.
The NIST incident response lifecycle as a closed loop. Detection and analysis interleave with containment and eradication; lessons learned flow back into preparation.
1
Prepare
Before anything happens: centralized logs you cannot tamper with, a documented IR plan, contact lists, pre-staged forensic tooling, and rehearsed tabletops. The work you skip here is the work you regret at 02:14.
2
Detect & Analyze
Triage the event. Is it real? What is the scope and blast radius? Establish a timeline and an initial set of indicators of compromise. Declare an incident and assign an incident commander.
3
Contain
Stop the spread. Short-term containment buys time (isolate a host); long-term containment is a sustainable holding pattern while you prepare to eradicate. This is where the isolate-vs-observe trade-off lives.
4
Eradicate
Remove the attacker's access entirely: revoke credentials, kill persistence (cron jobs, new IAM users, backdoored AMIs), patch the entry vector. Half-eradication means they come back.
5
Recover
Rebuild from known-good, restore service, and watch closely, heightened monitoring on the affected systems for days, because attackers often return to test whether you actually closed the door.
6
Lessons Learned
Within a week or two, run a blameless post-mortem. What let them in, what slowed detection, what was missing in prepare. Feed every gap back into the loop.
Containment is a trade-off: isolate now vs watch the attacker
The most contested decision in a breach is *when* to pull the plug. Isolate immediately and you stop damage, but you tip off the attacker, who may have other footholds you have not found, and they may detonate (wipe, ransom, exfil-dump) on the way out. Observe quietly and you map the full intrusion, but every minute the attacker is live is more risk. There is no universally right answer; there is a right answer for *this* incident.
Approach
Best when
Risk
Isolate immediately (cut network, disable creds)
Active destruction, ransomware, or sensitive data being exfiltrated right now
Tips off the attacker; you may miss other footholds; possible loss of volatile evidence if done carelessly
Observe & monitor (let them run, watch closely)
Sophisticated actor, you have strong monitoring, and immediate damage is low
Attacker continues to act; legal/ethical exposure grows; requires real isolation capability if it turns
Segment / sandbox (move to a controlled, watched network)
You want intel but cannot accept live exposure
Operationally hard to do quickly; attacker may detect the move
Power off (last resort)
Imminent catastrophic damage and nothing else will stop it
Containment options and their trade-offs. The right choice depends on data sensitivity, attacker sophistication, and your ability to monitor safely.
Make it a decision, not an accident
Decide isolate-vs-observe explicitly, name who made the call, and write it in the incident timeline. The default reflex (kill it now) is often right for ransomware and often wrong for a slow, credentialed intruder. Do not let the choice happen by reflex.
Forensic evidence: what each source actually reveals
Reconstructing an attack is correlation across sources, because no single log tells the whole story. CloudTrail shows you the API calls but not the packets; flow logs show the connections but not the payload; EDR shows process behavior on the host; memory shows what disk never will. Know what each gives you before you need it.
Source
Reveals
Volatility
CloudTrail / audit logs
Who called which API, when, from where, AssumeRole, key creation, S3 access, IAM changes
Forensic sources and what each reveals. Volatility matters: capture the most ephemeral evidence first.
Indicators of compromise (IOCs) are the breadcrumbs you pivot on across all of these: an attacker IP, a malicious file hash, a rogue IAM principal, a suspicious domain, a user-agent string. Find one IOC in CloudTrail, then go hunt it in flow logs and EDR to expand the picture.
Hunting an IOC and preserving evidence (hands on)
Suppose your IOC is a suspicious external IP that performed an AssumeRole. Start in CloudTrail (queried via Athena) to find every API call from that source and which roles it touched.
athena_assumerole_hunt.sql
sql
-- Find all AssumeRole and credentialed calls from a suspect IP
-- (CloudTrail logs queried via Athena over the partitioned S3 table)
SELECT
eventtime,
useridentity.arn AS principal,
eventname,
sourceipaddress,
requestparameters,
errorcode
FROM cloudtrail_logs
WHERE sourceipaddress = '203.0.113.66'
AND eventtime >= '2026-06-05T00:00:00Z'
ORDER BY eventtime ASC;
-- Then pivot: which roles were assumed, and what did the
-- resulting temporary sessions do afterward?
SELECT
useridentity.arn AS assumed_role,
eventname,
count(*) AS calls
FROM cloudtrail_logs
WHERE useridentity.type = 'AssumedRole'
AND useridentity.sessioncontext.sessionissuer.username = 'deploy-bot'
AND eventtime >= '2026-06-05T00:00:00Z'
GROUP BY 1, 2
ORDER BY calls DESC;
Once you have identified a compromised EC2 instance, your goal is to capture evidence before touching the host. The cardinal rule: snapshot, then hash, then analyze a *copy*, never the original. Snapshotting an EBS volume is read-only and does not disturb the running instance.
preserve_evidence.sh
bash
#!/usr/bin/env bashset -euo pipefail
INSTANCE_ID="i-0abc123compromised"
CASE="IR-2026-0605"# 1. Tag and isolate the network WITHOUT terminating.# Swap to a quarantine SG that denies all egress but keeps# the instance running so memory stays intact.
aws ec2 modify-instance-attribute \
--instance-id "$INSTANCE_ID" \
--groups sg-0quarantinedenyall
# 2. Find the attached volumes.
VOLS=$(aws ec2 describe-volumes \
--filters "Name=attachment.instance-id,Values=$INSTANCE_ID" \
--query 'Volumes[].VolumeId' --output text)
# 3. Read-only snapshot of each volume (does NOT modify the source).for VOL in$VOLS; do
aws ec2 create-snapshot \
--volume-id "$VOL" \
--description "$CASE forensic image of $VOL" \
--tag-specifications \
"ResourceType=snapshot,Tags=[{Key=case,Value=$CASE},{Key=evidence,Value=true}]"done# 4. When you later restore a snapshot to a forensic volume,# hash it immediately to anchor the chain of custody.# sha256sum /dev/xvdf > ${CASE}_xvdf.sha256echo"Snapshots created. Capture memory via SSM/EDR before any reboot."
Do not destroy the evidence
Rebooting, terminating, or re-imaging a compromised host wipes volatile memory, running malware, decrypted secrets, live sockets, that may be the only proof of what happened. Never 'clean up' before capture. The order is: capture memory, snapshot disk read-only, hash the image, record who did each step, *then* eradicate. A breach you cannot reconstruct is a breach you cannot prove you contained.
Chain of custody and breach notification
Chain of custody is the documented, unbroken record of who handled each piece of evidence, when, and what they did to it. It is what makes evidence trustworthy, to your own analysts, to leadership, to regulators, and possibly to a court. Hash every image at capture (and re-verify the hash later to prove it was not altered), log every access, and store originals write-once. If you cannot say who touched the snapshot and when, its evidentiary value collapses.
Breach notification is a legal obligation with hard clocks, and engineers are usually the ones who surface the facts those clocks depend on. You do not have to memorize the law, but you must know the timelines exist so you escalate fast enough.
Regime
Trigger
Clock
GDPR
Personal data breach with risk to individuals
Notify supervisory authority within 72 hours of awareness
US state laws / SEC
PII exposure; material cyber incident (public companies)
'Without unreasonable delay'; SEC material incidents within 4 business days
PCI-DSS
Cardholder data compromise
Notify card brands / acquirer immediately per contract
Contractual (B2B)
Customer data affected
Often 24–72 hours per the DPA / contract
Representative breach notification timelines. Always confirm specifics with legal/compliance, scope and thresholds vary.
Escalate before you are sure
The 72-hour clock often starts at 'awareness', not at 'fully confirmed'. Looping in legal and your incident commander early is never the wrong call, you can stand down a false alarm, but you cannot rewind a missed deadline.
Blameless post-mortems and tabletop exercises
After recovery, run a blameless post-mortem, same discipline as an SRE outage review, with a security lens. The point is never 'who clicked the link'. Humans will always click links; the system should survive that. Ask instead: why did one phished credential grant this much access? why did detection take six hours? why were the logs we needed not retained? Blame hides the systemic gaps; safety surfaces them.
Tabletop exercises are how you find those gaps *before* a real breach. Gather the team, narrate a scenario ('an engineer's laptop is compromised and an AWS access key is now active from Eastern Europe'), and walk through your response out loud. You will discover the runbook is stale, nobody knows who can revoke prod keys at 3am, and your CloudTrail logs roll off after 14 days. Far cheaper to learn that in a conference room than during the real thing.
Run tabletops at least quarterly, rotating scenarios: ransomware, insider, cloud key leak, supply-chain compromise.
Include non-engineers, legal, comms, leadership, because notification and messaging are part of the response.
End every tabletop with concrete action items and owners, then feed them into prepare.
Common mistakes that cost you the case
Rebooting or terminating the host to 'clean it up' before capturing memory and disk, you delete the evidence you most need.
Logging into the compromised box with admin tools and poking around, which overwrites timestamps, drops new artifacts, and contaminates the scene.
Containing too early on a sophisticated actor and tipping them off before you have mapped their other footholds, or containing too late on ransomware.
Treating it like an SRE outage: rushing to restore service re-exposes the same vulnerability and wipes the trail.
No centralized, tamper-resistant logs, attackers delete local logs, so if CloudTrail/flow logs are not shipped to a locked, separate account, you are blind.
Skipping chain of custody, un-hashed, undocumented evidence may be worthless when it matters most.
Forgetting the notification clock until day four, turning a contained breach into a compliance failure.
A blameful post-mortem that scapegoats the person who clicked, so nobody reports the next incident and the real gaps stay open.
Takeaways and where to go next
The whole article in nine lines
A breach is an adversary, not a bug, preserve before you fix.
An incident is a confirmed adverse event; an event is just observable noise.
NIST IR is a loop: prepare, detect/analyze, contain, eradicate, recover, lessons.
Isolate-vs-observe is a deliberate trade-off, decide it explicitly and write down who chose.
Capture most-volatile first: memory, then read-only disk snapshot, then hash.
Never reboot/terminate/re-image before capture, you destroy the evidence.
Chain of custody + hashes make evidence trustworthy; notification clocks start at awareness.
Blameless post-mortems and quarterly tabletops turn one breach into a stronger next prepare.
Incident response is downstream of preparation. The single highest-leverage investment is making sure you *can* answer 'what happened?' when the time comes, which is a logging and detection problem long before it is a forensics problem.
Practice the mechanics on the Linux lab and networking lab so log-hunting and host triage are reflexes, not first-time fumbles.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.