Back to Blog
SRE16 min readJun 2026

Chaos Engineering in Practice

Resilience you have not tested is just a hope. Chaos engineering deliberately injects failure, kill a node, add latency, drop a region, to prove your system survives before your users find out it does not.

SREChaos EngineeringResilienceTesting
SB

Sri Balaji

Founder · TheSimplifiedTech

On this page

Your failover has never actually failed over

You built it right. Two availability zones. A load balancer with health checks. Retries with backoff. A runbook for the database failover. Every box on the resilience checklist is ticked, and the architecture diagram looks bulletproof on the wall.

Then at 02:14 on a Tuesday a single zone goes dark, and you discover the truth: the failover script references an IAM role that was renamed six months ago, the retry storm took down the dependency that was still healthy, and the "redundant" replica was silently three hours behind. None of it was tested under real failure, so none of it worked.

Chaos engineering is the discipline of finding those gaps on purpose, during business hours, with a coffee in hand and a kill switch ready, instead of at 02:14 with a pager screaming. You do not wait for the outage. You schedule it, scope it, and learn from it.

Who this is for

Engineers and SREs who own a production service and have the basics in place, health checks, redundancy, monitoring, but have never deliberately broken anything to see if that redundancy is real. If you can read a dashboard and roll back a deploy, you are ready to run your first experiment.

The principle: you cannot prove resilience by reading code

Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.
Principles of Chaos Engineering

The key word is experiment. Chaos engineering is not random breakage and it is not "let's see what happens." It is the scientific method applied to distributed systems: you state what you believe, you introduce a controlled disturbance, and you measure whether reality matches your belief. When it does, your confidence is now earned rather than assumed. When it does not, you have found a bug before it found your customers.

You do not wait for a real fire to learn the exits are locked.You do not wait for a real outage to learn the failover is broken.
You schedule the drill, warn the building, and time the evacuation.You schedule the experiment, notify on-call, and measure recovery time.
A drill that finds a jammed door is a success, not a failure.An experiment that surfaces a broken retry is a win, you fix it calmly.
You start with one floor before testing the whole campus.You start with one pod before failing an entire region.
A chaos experiment is a fire drill for your software.

The fire-drill framing matters because it changes the emotional stakes. A fire drill is not a crisis, it is a calm, planned event with a defined start, a defined end, and a clear way to stop early. That is exactly the posture a chaos experiment demands.

The chaos experiment loop

Every good experiment follows the same loop. You define what "healthy" looks like, form a falsifiable hypothesis, inject a small failure, measure against your steady state, and then branch: if the system held, you bank the confidence and widen the blast radius; if it broke, you stop, fix, and re-run. The loop never really ends, it just covers more ground each time.

heldbrokere-runnext
Define steady state

p99, error rate, throughput

Form hypothesis

"Killing one node holds SLO"

Inject failure

Small blast radius

Measure

Compare to steady state

Confidence gained

Hypothesis held

Fix the weakness

Hypothesis broke

Expand scope

Bigger blast radius

The chaos loop: hypothesize, inject a small failure, measure, then either expand scope or fix and retry.

  1. 1

    Define steady state

    Pick metrics that describe a healthy system from the user's point of view, p99 latency under 300ms, error rate under 0.1%, checkout throughput steady. This is your control group.

  2. 2

    Form a hypothesis

    State it as a falsifiable claim: "If we terminate one of the three API pods, p99 stays under 300ms and error rate stays under 0.1% because the load balancer reroutes within seconds."

  3. 3

    Inject failure with a small blast radius

    Kill exactly one pod, not the deployment. The smallest action that can test the hypothesis is the right one.

  4. 4

    Measure against steady state

    Watch the same dashboards. Did the metrics stay inside their bounds? Quantify the deviation rather than eyeballing it.

  5. 5

    Branch on the result

    Held: confidence gained, widen the blast radius next time. Broke: stop, fix the weakness, and re-run the identical experiment to confirm the fix.

Blast radius: the dial that keeps chaos safe

The single concept that separates engineering from recklessness is the blast radius, the maximum amount of harm an experiment can do if your hypothesis is wrong. You always start with the smallest radius that can still falsify the hypothesis, then turn the dial up only after a clean result.

Concretely, the dial goes: one request, then one pod, then one node, then one availability zone, then one region. It also covers traffic, shadow traffic first, then 1% of real users, then 10%, then everyone. Each notch up should be a deliberate decision made after the previous notch held, never a default.

ExperimentWhat it testsBlast radius
Kill an instance / podDoes the scheduler reschedule and the LB reroute without dropping requests?Small, one replica; capacity briefly reduced
Add network latencyDo timeouts, retries, and circuit breakers behave under a slow dependency?Small–medium, one service hop or one path
Exhaust CPU / memoryDo resource limits, autoscaling, and OOM handling kick in correctly?Small, one node or pod under pressure
Fail a dependencyDoes graceful degradation engage when the cache or a downstream API is down?Medium, every caller of that dependency
Drop a dependency's responsesDo callers fall back to defaults instead of cascading the failure?Medium, fan-out depends on the dependency
Availability-zone outageDoes multi-AZ redundancy actually carry full load on the survivors?Large, a third of capacity; user-visible if under-provisioned
Region outageDoes cross-region failover and data replication work end to end?Very large, run in staging or with heavy guardrails first
Common experiments mapped to what they prove and how much they can hurt.

Read these top to bottom

The table is roughly ordered by increasing blast radius. A healthy program works down it over months, you earn the right to test a region outage by first proving you survive a single pod.

Run your first game day safely

A game day is a scheduled, time-boxed session where a team runs one or more chaos experiments together, watching the system react in real time. It is the on-ramp: low-stakes, collaborative, and explicitly a learning exercise rather than a test anyone can fail. Here is how to run your first one without scaring anyone, including yourself.

  1. 1

    Pick one small hypothesis

    Choose a single experiment with a small blast radius, "killing one API pod keeps us inside SLO" is perfect. One hypothesis, one variable.

  2. 2

    Run it in staging first

    Prove the tooling, the kill switch, and the dashboards all work where mistakes are free. Only graduate to production once the staging run is boring.

  3. 3

    Confirm your steady-state dashboard is live

    You must be able to see p99, error rate, and throughput in real time before you inject anything. If you cannot measure it, you cannot experiment on it.

  4. 4

    Write the abort criteria and the kill switch first

    Decide in advance: "if error rate exceeds 1% for 30 seconds, we abort." Make stopping a single command everyone in the room knows.

  5. 5

    Announce the window and assign roles

    Notify on-call and stakeholders. Name a coordinator (drives the experiment), an observer (watches metrics), and a scribe (records the timeline).

  6. 6

    Inject, observe, and narrate out loud

    Run the experiment and talk through what the dashboards show as it happens. Surprises are the whole point, note every one.

  7. 7

    Stop, then run a blameless retro

    End the experiment, restore normal state, and debrief: what did we expect, what actually happened, what surprised us, what do we fix? Capture follow-up actions with owners.

Never your first time in prod

Your first-ever game day should be in staging. Production game days are for teams that already have working tooling, a proven kill switch, and dashboards they trust. Earn production access by being boring in staging first.

An experiment as code

Real programs codify experiments so they are repeatable, reviewable in pull requests, and runnable in CI. Below is a LitmusChaos experiment that kills a single pod in a Kubernetes deployment, a textbook small-blast-radius start. Note the explicit duration and the single-pod target: the spec itself encodes the safety constraints.

pod-delete-experiment.yaml
yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-pod-delete
  namespace: shop
spec:
  appinfo:
    appns: shop
    applabel: "app=checkout-api"   # target only this deployment
    appkind: deployment
  # kill switch: flip to "stop" to abort immediately
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"           # blast radius in time: 60s only
            - name: PODS_AFFECTED_PERC
              value: "33"           # one of three replicas, never all
            - name: FORCE
              value: "false"        # graceful termination, like a real eviction
        probe:
          # steady-state check: abort if p99 health probe fails
          - name: checkout-latency-slo
            type: promProbe
            mode: Continuous
            runProperties:
              probeTimeout: 5
              interval: 5
              stopOnFailure: true   # auto-abort if the SLO breaks

The same idea in AWS Fault Injection Service looks like a stop condition wired to a CloudWatch alarm, if the alarm fires, AWS halts the experiment for you. That alarm is your kill switch, enforced by the platform rather than by a human watching a dashboard.

aws-fis-stop-condition.sh
bash
# Create the alarm that becomes your automated kill switch
aws cloudwatch put-metric-alarm \
  --alarm-name checkout-error-rate-high \
  --metric-name 5XXError --namespace AWS/ApplicationELB \
  --threshold 1 --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 --period 30 --statistic Average

# FIS experiment template (excerpt): stop the instant the alarm fires
cat <<'JSON' > fis-template.json
{
  "description": "Terminate one AZ worth of instances",
  "stopConditions": [
    { "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:eu-west-1:111:alarm:checkout-error-rate-high" }
  ],
  "targets": {
    "oneAZ": {
      "resourceType": "aws:ec2:instance",
      "selectionMode": "PERCENT(20)",
      "resourceTags": { "app": "checkout" }
    }
  },
  "actions": {
    "stop-instances": { "actionId": "aws:ec2:stop-instances",
      "parameters": { "duration": "PT5M" }, "targets": { "Instances": "oneAZ" } }
  }
}
JSON
aws fis create-experiment-template --cli-input-json file://fis-template.json

Pick a tool that matches your stack

**Chaos Monkey** (Netflix) randomly kills instances and pioneered the field. **AWS FIS** is the managed, IAM-governed option for AWS with built-in stop conditions. **LitmusChaos** is the CNCF-native choice for Kubernetes. **Gremlin** is a commercial SaaS with a polished UI and a hardened kill switch. Start with whichever speaks your platform's language.

Common mistakes that turn an experiment into an incident

  1. No hypothesis, just breaking things. Without a steady-state baseline and a falsifiable claim, you cannot tell a pass from a fail. You are not experimenting, you are gambling. Always write the hypothesis down before you inject.
  2. Production with no kill switch. If you cannot stop the experiment in one command, do not start it. Wire an automated stop condition (a CloudWatch alarm, a Litmus probe) and rehearse the manual abort before you touch prod.
  3. Too-big blast radius too soon. Failing an entire region on day one is not bold, it is an outage you scheduled. Start with one pod, prove it, then turn the dial. The radius should grow with your confidence, not your ego.
  4. No steady-state dashboard. If you cannot see p99 and error rate in real time, you are flying blind, you will not notice the breach until users do. Stand up the dashboard before the experiment, not after.
  5. Running it once and calling it solved. A fix is unproven until the identical experiment passes. Re-run after every fix, and schedule recurring game days so resilience does not rot as the system changes.
  6. Surprising on-call. An unannounced experiment that pages a sleeping engineer destroys trust in the whole practice. Announce the window, name the roles, and make it a calm, shared event.

Takeaways

The whole article in seven lines

  • Resilience you have not tested is a hope, not a property, chaos engineering turns the hope into evidence.
  • It is the scientific method for systems: define steady state, hypothesize, inject, measure, learn.
  • Blast radius is the safety dial, start at one pod and turn it up only after a clean result.
  • A game day is a fire drill: scheduled, time-boxed, blameless, and stoppable at any moment.
  • Codify experiments (Litmus, AWS FIS) so they are repeatable, reviewable, and CI-runnable.
  • Every experiment needs a kill switch, automated stop conditions beat a human watching a graph.
  • A broken hypothesis is the goal, not the failure: you found the bug calmly, on your schedule.

Where to go next

Chaos engineering only pays off on a system that was designed to fail gracefully in the first place. Pair this with the design-side companions: Reliability & Resilience: Designing for Failure for the patterns you will be testing, and Graceful Degradation & Load Shedding for what "the system held" should actually look like under stress.

  • Get hands-on with the platform you will inject into: the Kubernetes lab lets you kill and reschedule pods exactly like a pod-delete experiment.
  • Practice latency and dependency faults at the mesh layer in the Istio service mesh lab, fault injection is a first-class Istio feature.
  • Put it all in context on the SRE career path, where resilience, observability, and incident response come together.
  • Then schedule a real game day. Pick one small hypothesis, run it in staging, and turn the dial only after it holds.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.