Security Incident Response and Digital Forensics: What to Do When You Are Breached

On this page

When the alert is an attacker, not a bug
The mental model: you are a first responder at a crime scene
The IR lifecycle (NIST 800-61) is a loop, not a line
Containment is a trade-off: isolate now vs watch the attacker
Forensic evidence: what each source actually reveals
Hunting an IOC and preserving evidence (hands on)
Chain of custody and breach notification
Blameless post-mortems and tabletop exercises
Common mistakes that cost you the case
Takeaways and where to go next

When the alert is an attacker, not a bug

At 02:14 a GuardDuty finding fires: an IAM role you have never seen calling AssumeRole from an IP in a country you do not operate in. Your first instinct, honed by years of on-call, is to fix it: kill the session, rotate the key, restart the service, go back to bed. That instinct is exactly wrong. This is not a service that broke. This is a person who got in, and the actions you take in the next ten minutes either preserve the truth or destroy it.

Who this is for

Engineers and SREs who own production systems and might be first on the scene of a breach. You do not need to be a forensic examiner or a CISO. You need to know what an incident is, how to contain it without burning the evidence, and what to hand to the people who will reconstruct what happened. If you have read [incident management and on-call](/blog/incident-management-and-oncall), this is its security-shaped sibling.

SRE incident response asks 'how do we restore service?'. Security incident response asks a harder set of questions first: 'is the attacker still here? what did they touch? what did they take? and can I prove it?'. Restoring service too fast can re-expose you, tip off the adversary, or wipe the very logs that answer those questions.

The mental model: you are a first responder at a crime scene

An incident is a confirmed adverse event with impact; an event is anything observable. Most events are noise. The job is to decide which events are incidents, and to not contaminate the scene while you do.

Secure the scene, keep people out so they do not trample evidenceRestrict access to affected systems; freeze deploys and routine admin activity

Do not move the body before the photographer arrivesDo not reboot, terminate, or re-image a host before you capture memory and disk

Bag and tag each item with who touched it and whenSnapshot volumes, hash the images, record a chain of custody

Detectives reconstruct the timeline from physical tracesForensic analysts reconstruct the attack from logs, memory, and disk artifacts

A first responder is not the prosecutorYou contain and preserve; specialists (or future-you) do deep analysis

A breach is a crime scene. The same instincts that make you a good engineer, fix it fast, clean it up, make you a bad first responder. Slow down and preserve.

Hold this frame for the whole article. Every controversial call, should I isolate or watch? reboot or snapshot? notify or wait?, gets easier when you ask 'what would a careful first responder do?'.

The IR lifecycle (NIST 800-61) is a loop, not a line

NIST splits incident handling into phases. The trap is reading them as a one-way checklist. In a real breach you bounce between detect/analyze and contain/eradicate repeatedly as you discover more, and every incident feeds prepare for the next one. Draw it as a loop.

The NIST incident response lifecycle as a closed loop. Detection and analysis interleave with containment and eradication; lessons learned flow back into preparation.

1
Prepare
Before anything happens: centralized logs you cannot tamper with, a documented IR plan, contact lists, pre-staged forensic tooling, and rehearsed tabletops. The work you skip here is the work you regret at 02:14.
2
Detect & Analyze
Triage the event. Is it real? What is the scope and blast radius? Establish a timeline and an initial set of indicators of compromise. Declare an incident and assign an incident commander.
3
Contain
Stop the spread. Short-term containment buys time (isolate a host); long-term containment is a sustainable holding pattern while you prepare to eradicate. This is where the isolate-vs-observe trade-off lives.
4
Eradicate
Remove the attacker's access entirely: revoke credentials, kill persistence (cron jobs, new IAM users, backdoored AMIs), patch the entry vector. Half-eradication means they come back.
5
Recover
Rebuild from known-good, restore service, and watch closely, heightened monitoring on the affected systems for days, because attackers often return to test whether you actually closed the door.
6
Lessons Learned
Within a week or two, run a blameless post-mortem. What let them in, what slowed detection, what was missing in prepare. Feed every gap back into the loop.

Containment is a trade-off: isolate now vs watch the attacker

The most contested decision in a breach is *when* to pull the plug. Isolate immediately and you stop damage, but you tip off the attacker, who may have other footholds you have not found, and they may detonate (wipe, ransom, exfil-dump) on the way out. Observe quietly and you map the full intrusion, but every minute the attacker is live is more risk. There is no universally right answer; there is a right answer for *this* incident.

Approach	Best when	Risk
Isolate immediately (cut network, disable creds)	Active destruction, ransomware, or sensitive data being exfiltrated right now	Tips off the attacker; you may miss other footholds; possible loss of volatile evidence if done carelessly
Observe & monitor (let them run, watch closely)	Sophisticated actor, you have strong monitoring, and immediate damage is low	Attacker continues to act; legal/ethical exposure grows; requires real isolation capability if it turns
Segment / sandbox (move to a controlled, watched network)	You want intel but cannot accept live exposure	Operationally hard to do quickly; attacker may detect the move
Power off (last resort)	Imminent catastrophic damage and nothing else will stop it	Destroys memory-resident evidence (running malware, keys, network state)

Containment options and their trade-offs. The right choice depends on data sensitivity, attacker sophistication, and your ability to monitor safely.

Make it a decision, not an accident

Decide isolate-vs-observe explicitly, name who made the call, and write it in the incident timeline. The default reflex (kill it now) is often right for ransomware and often wrong for a slow, credentialed intruder. Do not let the choice happen by reflex.

Forensic evidence: what each source actually reveals

Reconstructing an attack is correlation across sources, because no single log tells the whole story. CloudTrail shows you the API calls but not the packets; flow logs show the connections but not the payload; EDR shows process behavior on the host; memory shows what disk never will. Know what each gives you before you need it.

Source	Reveals	Volatility
CloudTrail / audit logs	Who called which API, when, from where, AssumeRole, key creation, S3 access, IAM changes	Durable (if shipped to a locked bucket)
VPC flow logs	Network connections: source/dest IP, port, bytes, beaconing, exfil volume, lateral movement	Durable
EDR / endpoint telemetry	Process trees, file writes, command lines, parent-child spawns on the host	Medium, buffers can roll over
Memory capture (RAM)	Running malware, decrypted keys, injected code, live network sockets, processes with no disk file	Highly volatile, gone on reboot/terminate
Disk image / volume snapshot	Files, persistence mechanisms, logs, deleted-but-recoverable artifacts	Durable once snapshotted read-only

Forensic sources and what each reveals. Volatility matters: capture the most ephemeral evidence first.

Indicators of compromise (IOCs) are the breadcrumbs you pivot on across all of these: an attacker IP, a malicious file hash, a rogue IAM principal, a suspicious domain, a user-agent string. Find one IOC in CloudTrail, then go hunt it in flow logs and EDR to expand the picture.

Hunting an IOC and preserving evidence (hands on)

Suppose your IOC is a suspicious external IP that performed an AssumeRole. Start in CloudTrail (queried via Athena) to find every API call from that source and which roles it touched.

athena_assumerole_hunt.sql

sql

-- Find all AssumeRole and credentialed calls from a suspect IP
-- (CloudTrail logs queried via Athena over the partitioned S3 table)
SELECT
  eventtime,
  useridentity.arn       AS principal,
  eventname,
  sourceipaddress,
  requestparameters,
  errorcode
FROM cloudtrail_logs
WHERE sourceipaddress = '203.0.113.66'
  AND eventtime >= '2026-06-05T00:00:00Z'
ORDER BY eventtime ASC;

-- Then pivot: which roles were assumed, and what did the
-- resulting temporary sessions do afterward?
SELECT
  useridentity.arn AS assumed_role,
  eventname,
  count(*)         AS calls
FROM cloudtrail_logs
WHERE useridentity.type = 'AssumedRole'
  AND useridentity.sessioncontext.sessionissuer.username = 'deploy-bot'
  AND eventtime >= '2026-06-05T00:00:00Z'
GROUP BY 1, 2
ORDER BY calls DESC;

Once you have identified a compromised EC2 instance, your goal is to capture evidence before touching the host. The cardinal rule: snapshot, then hash, then analyze a *copy*, never the original. Snapshotting an EBS volume is read-only and does not disturb the running instance.

preserve_evidence.sh

bash

#!/usr/bin/env bash
set -euo pipefail

INSTANCE_ID="i-0abc123compromised"
CASE="IR-2026-0605"

# 1. Tag and isolate the network WITHOUT terminating.
#    Swap to a quarantine SG that denies all egress but keeps
#    the instance running so memory stays intact.
aws ec2 modify-instance-attribute \
  --instance-id "$INSTANCE_ID" \
  --groups sg-0quarantinedenyall

# 2. Find the attached volumes.
VOLS=$(aws ec2 describe-volumes \
  --filters "Name=attachment.instance-id,Values=$INSTANCE_ID" \
  --query 'Volumes[].VolumeId' --output text)

# 3. Read-only snapshot of each volume (does NOT modify the source).
for VOL in $VOLS; do
  aws ec2 create-snapshot \
    --volume-id "$VOL" \
    --description "$CASE forensic image of $VOL" \
    --tag-specifications \
      "ResourceType=snapshot,Tags=[{Key=case,Value=$CASE},{Key=evidence,Value=true}]"
done

# 4. When you later restore a snapshot to a forensic volume,
#    hash it immediately to anchor the chain of custody.
#    sha256sum /dev/xvdf  >  ${CASE}_xvdf.sha256
echo "Snapshots created. Capture memory via SSM/EDR before any reboot."

Do not destroy the evidence

Rebooting, terminating, or re-imaging a compromised host wipes volatile memory, running malware, decrypted secrets, live sockets, that may be the only proof of what happened. Never 'clean up' before capture. The order is: capture memory, snapshot disk read-only, hash the image, record who did each step, *then* eradicate. A breach you cannot reconstruct is a breach you cannot prove you contained.

Chain of custody and breach notification

Chain of custody is the documented, unbroken record of who handled each piece of evidence, when, and what they did to it. It is what makes evidence trustworthy, to your own analysts, to leadership, to regulators, and possibly to a court. Hash every image at capture (and re-verify the hash later to prove it was not altered), log every access, and store originals write-once. If you cannot say who touched the snapshot and when, its evidentiary value collapses.

Breach notification is a legal obligation with hard clocks, and engineers are usually the ones who surface the facts those clocks depend on. You do not have to memorize the law, but you must know the timelines exist so you escalate fast enough.

Regime	Trigger	Clock
GDPR	Personal data breach with risk to individuals	Notify supervisory authority within 72 hours of awareness
US state laws / SEC	PII exposure; material cyber incident (public companies)	'Without unreasonable delay'; SEC material incidents within 4 business days
PCI-DSS	Cardholder data compromise	Notify card brands / acquirer immediately per contract
Contractual (B2B)	Customer data affected	Often 24–72 hours per the DPA / contract

Representative breach notification timelines. Always confirm specifics with legal/compliance, scope and thresholds vary.

Escalate before you are sure

The 72-hour clock often starts at 'awareness', not at 'fully confirmed'. Looping in legal and your incident commander early is never the wrong call, you can stand down a false alarm, but you cannot rewind a missed deadline.

Blameless post-mortems and tabletop exercises

After recovery, run a blameless post-mortem, same discipline as an SRE outage review, with a security lens. The point is never 'who clicked the link'. Humans will always click links; the system should survive that. Ask instead: why did one phished credential grant this much access? why did detection take six hours? why were the logs we needed not retained? Blame hides the systemic gaps; safety surfaces them.

Tabletop exercises are how you find those gaps *before* a real breach. Gather the team, narrate a scenario ('an engineer's laptop is compromised and an AWS access key is now active from Eastern Europe'), and walk through your response out loud. You will discover the runbook is stale, nobody knows who can revoke prod keys at 3am, and your CloudTrail logs roll off after 14 days. Far cheaper to learn that in a conference room than during the real thing.

Run tabletops at least quarterly, rotating scenarios: ransomware, insider, cloud key leak, supply-chain compromise.
Include non-engineers, legal, comms, leadership, because notification and messaging are part of the response.
End every tabletop with concrete action items and owners, then feed them into prepare.

Common mistakes that cost you the case

Rebooting or terminating the host to 'clean it up' before capturing memory and disk, you delete the evidence you most need.
Logging into the compromised box with admin tools and poking around, which overwrites timestamps, drops new artifacts, and contaminates the scene.
Containing too early on a sophisticated actor and tipping them off before you have mapped their other footholds, or containing too late on ransomware.
Treating it like an SRE outage: rushing to restore service re-exposes the same vulnerability and wipes the trail.
No centralized, tamper-resistant logs, attackers delete local logs, so if CloudTrail/flow logs are not shipped to a locked, separate account, you are blind.
Skipping chain of custody, un-hashed, undocumented evidence may be worthless when it matters most.
Forgetting the notification clock until day four, turning a contained breach into a compliance failure.
A blameful post-mortem that scapegoats the person who clicked, so nobody reports the next incident and the real gaps stay open.

Takeaways and where to go next

The whole article in nine lines

A breach is an adversary, not a bug, preserve before you fix.
An incident is a confirmed adverse event; an event is just observable noise.
NIST IR is a loop: prepare, detect/analyze, contain, eradicate, recover, lessons.
Isolate-vs-observe is a deliberate trade-off, decide it explicitly and write down who chose.
Correlate across sources: CloudTrail (API), flow logs (network), EDR (host), memory (volatile).
Capture most-volatile first: memory, then read-only disk snapshot, then hash.
Never reboot/terminate/re-image before capture, you destroy the evidence.
Chain of custody + hashes make evidence trustworthy; notification clocks start at awareness.
Blameless post-mortems and quarterly tabletops turn one breach into a stronger next prepare.

Incident response is downstream of preparation. The single highest-leverage investment is making sure you *can* answer 'what happened?' when the time comes, which is a logging and detection problem long before it is a forensics problem.

Build the visibility that makes forensics possible: security logging and monitoring.
Reduce the surface attackers exploit in the first place: threat modeling.
Borrow the operational muscle, commander, comms, timelines, from incident management and on-call.
Practice the mechanics on the Linux lab and networking lab so log-hunting and host triage are reflexes, not first-time fumbles.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Cloud Identity & Access (IAM) From First Principles

Read

DevOps

Securing the Software Supply Chain (SLSA, SBOM, Signing)

Read

Security

Security as a Non-Functional Requirement

Read