Interactive Explainer

Relevant for:Mid-levelSeniorStaff

Why this matters at your level

Mid-level

Configure retries, timeouts, and circuit breakers correctly for a service. Know the perTryTimeout < timeout rule. Know which HTTP methods are safe to retry.

Senior

Debug retry storms. Calculate amplification factors for multi-service topologies. Design retry budgets as a platform policy. Know response_flags in Istio metrics for diagnosing circuit breaker state.

Staff

Define retry and timeout policies as platform defaults in MeshConfig. Design the resilience architecture for a complex service graph. Own the amplification analysis for retry storms across the fleet.

Circuit Breakers, Retries & Timeouts

How Istio implements the three reliability primitives -- and the retry storm that turned a partial failure into a total outage.

~4 min read

Be the first to complete!

LIVEData Plane Cascade -- Retry Storm -- 1000x Request Amplification

Breaking News

T+0

Recommendation service degrades -- 30% requests at 5s latency

WARNING

T+2m

10 upstream services each retrying 3x -- 30x amplification at recommendation

CRITICAL

T+3m

Recommendation OOM kills -- all callers now getting 100% 503

CRITICAL

T+4m

Retry storm spreads to neighbors -- 6 services degraded

CRITICAL

T+15m

Retries disabled -- recommendation recovers -- storm subsides

—Request amplification from retries

—Services brought down by the cascade

—Retries per request (the trigger)

—Upstream services all retrying simultaneously

The question this raises

How do you configure retries, timeouts, and circuit breakers to improve resilience without creating amplification cascades that turn partial failures into total outages?

Test your assumption first

You configure a VirtualService with timeout: 5s, retries.attempts: 3, retries.perTryTimeout: 5s. Users report requests always fail after exactly 5 seconds with no retries observed. What is wrong?

Lesson outline

The three reliability primitives and how they interact

Retries, timeouts, and circuit breakers must be configured together -- each alone is dangerous

Retries without timeouts: a request that hangs forever is retried 3 times, each hanging forever -- 3x connection exhaustion. Timeouts without retries: transient errors immediately fail the user. Circuit breakers without retries: the circuit trips and users get 503 until the breaker resets. All three together: retries handle transient errors, timeouts bound retry cost, circuit breakers prevent the retry storm.


Request arrives at Envoy (source pod)
     |
     v
[1] Timeout check: does the route have a timeout?
    timeout: 10s -> entire request including all retries must complete in 10s
     |
     v
[2] Forward to upstream endpoint
     |
     +-- Success (2xx) -> return to caller
     |
     +-- Failure (503, connect-failure, 5xx):
              |
              v
         [3] Should retry?
             retries.attempts: 3
             retries.retryOn: 5xx,gateway-error,connect-failure
             retries.perTryTimeout: 3s
              |
              +-- Yes: pick a different healthy endpoint, retry
              |        (if all attempts used: return last error)
              |
              +-- Circuit breaker check (outlierDetection):
                       Has this endpoint had 5 consecutive 5xx?
                       Yes -> eject endpoint from pool
                       Envoy picks a different endpoint for retries
                       (passive circuit breaking -- per endpoint, not per service)

Active circuit breaking (connection pool overflow):
  If http1MaxPendingRequests exceeded -> 503 immediately (no retry)
  This is the "fail fast" behavior -- prevents queue buildup

Retry storm prevention

How this concept changes your thinking

Situation

Before

After

Retries at 10 callers each with attempts:3

“Amplification = callers x retries = 10 x 3 = 30x load on downstream”

“Add retry budgets: 20% of requests can be retried -- above that, fail fast. Add exponential backoff to spread retries over time.”

Timeout configuration for retries

“perTryTimeout: 5s, attempts: 3 -- total possible wait: 15s before the request fails to the user”

“Set timeout (global): 10s. Set perTryTimeout: 3s, attempts: 3. Global timeout caps the total: no retry can happen after 10s regardless of attempt count.”

Safe retry configuration principles

Only retry idempotent operations — GET, HEAD are safe to retry. POST without idempotency key is not -- retrying creates duplicate orders.
Set perTryTimeout shorter than global timeout — If perTryTimeout >= timeout, the first attempt already uses the full budget -- retries never fire.
retryOn: 5xx is too broad — Set retryOn to specific conditions: connect-failure,gateway-error,503. 500 Application errors should not be retried.
Add jitter to retry backoff — Multiple callers retrying simultaneously after the same failure = thundering herd. Jitter spreads retries across time.
Limit retry amplification — Use retryBudget in Linkerd-style or limit attempts to 2 max. 3 retries x 10 callers = 30x -- at 100 RPS, that's 3000 RPS to the degraded service.

retry-timeout-config.yaml

1apiVersion: networking.istio.io/v1beta1
2kind: VirtualService
3metadata:
4  name: recommendation-svc
5spec:
6  hosts:
7  - recommendation-svc
8  http:
9  - route:
10    - destination:
11        host: recommendation-svc
12    timeout: 10s           # Global: entire operation must complete in 10s
13    retries:
14      attempts: 2          # Max 2 retries (not 3!) -- limit amplification
15      perTryTimeout: 3s    # Each attempt must succeed in 3s (< timeout/attempts)
16      retryOn: connect-failure,gateway-error,503  # Specific conditions only
17      # NOT: 5xx -- application 500s should not be retried
18      # NOT: reset -- retrying on connection reset can cause duplicate writes

circuit-breaker-config.yaml

1apiVersion: networking.istio.io/v1beta1
2kind: DestinationRule
3metadata:
4  name: recommendation-svc
5spec:
6  host: recommendation-svc
7  trafficPolicy:
8    # Active circuit breaking (connection pool overflow -> 503)
9    connectionPool:
10      http:
11        http1MaxPendingRequests: 100  # Queue limit before instant 503
12        http2MaxRequests: 500
13    # Passive circuit breaking (outlier detection)
14    outlierDetection:
15      consecutive5xxErrors: 5        # Eject after 5 consecutive 5xx
16      consecutiveGatewayErrors: 3    # Eject after 3 gateway errors (faster)
17      interval: 10s                  # Check frequency
18      baseEjectionTime: 30s          # Minimum ejection duration
19      maxEjectionPercent: 50         # Never eject more than half the pool

kubectl

1# Check retry configuration applied to a service
2istioctl proxy-config routes my-pod.production \
3  --name 8080 --output json | grep -A20 retryPolicy
4 
5# Check which endpoints are ejected (circuit breaker state)
6istioctl proxy-config endpoint my-pod.production \
7  --cluster "outbound|8080||recommendation-svc.production.svc.cluster.local"
8 
9# Monitor retry rate via Prometheus metric
10# istio_requests_total{response_flags="RL"} = rate limited (circuit breaker overflow)
11# istio_requests_total{response_flags="URX"} = upstream retry exhausted
12 
13# Generate load to test circuit breaker
14kubectl run fortio -it --image=fortio/fortio -- \
15  load -c 10 -qps 50 -n 1000 \
16  http://recommendation-svc/recommend

What breaks in production

Blast radius of resilience misconfiguration

Retry storm — N callers x M retries = NxM amplification -- turns partial degradation into total collapse
perTryTimeout >= timeout — Retries never fire -- first attempt uses entire time budget, subsequent attempts impossible
Retrying non-idempotent operations — POST payment endpoint retried on 503 -- duplicate charges, double orders
Circuit breaker without retry — Breaker trips on transient error -- users get 503 until baseEjectionTime passes (30s default)
retryOn: reset on writes — Network reset during a write operation -- retry resends the write -- data corruption or duplicate records

perTryTimeout >= timeout -- retries can never fire

Bug

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  http:
  - route:
    - destination:
        host: payment-svc
    timeout: 5s         # Global: 5 seconds total
    retries:
      attempts: 3
      perTryTimeout: 5s  # WRONG: each attempt gets 5s
      # First attempt takes 5s -> timeout fires -> no time for retry
      # attempts: 3 setting is completely ignored
      retryOn: 5xx

Fix

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  http:
  - route:
    - destination:
        host: payment-svc
    timeout: 10s        # Global timeout: 10s for everything
    retries:
      attempts: 2       # 2 retries
      perTryTimeout: 3s  # Each attempt: 3s max
      # Budget: 3s (attempt 1) + 3s (attempt 2) = 6s max
      # Well under the 10s global timeout
      # Buffer for jitter and network overhead
      retryOn: connect-failure,gateway-error,503

perTryTimeout must be significantly less than timeout / attempts. A good rule: perTryTimeout <= timeout / (attempts + 1). This ensures there is time for all retry attempts plus some overhead within the global timeout budget. If the global timeout fires first, Envoy stops all retries and returns the error.

Decision guide: retry configuration

Is the operation idempotent (safe to repeat without side effects)?

YesRetries are safe -- set attempts: 2, perTryTimeout < timeout/3, retryOn: connect-failure,503

NoIs the operation a write (POST, PUT, DELETE)?

Is the operation a write (POST, PUT, DELETE)?

YesNo retries unless you have idempotency keys. Use circuit breaker only (outlierDetection) -- fail fast on write failures rather than retry.

NoEvaluate the risk -- some read-like POST endpoints are idempotent. Add retries only if the failure mode is confirmed transient.

Resilience patterns compared

Pattern	What it does	Failure mode prevented	Failure mode introduced
Timeout	Caps maximum request duration	Slow upstream hanging all connections	User sees failure instead of slow response
Retry	Repeats failed requests	Transient errors (pod restart, network blip)	Amplification cascade if downstream is overloaded
Passive circuit breaker	Ejects repeatedly failing endpoints	Continuing to send to unhealthy pods	Healthy endpoints may be overloaded during ejection
Active circuit breaker	Connection pool overflow -> 503	Queue buildup from slow upstream	Legitimate requests fast-fail during overload
Bulkhead (connection pool)	Limits connections per upstream	One slow service consuming all connections	Low-limit causes 503 under normal load

Exam Answer vs. Production Reality

1 / 2

Retry configuration

📖 What the exam expects

Istio retries are configured via VirtualService with attempts, perTryTimeout, and retryOn fields. Retries improve resilience by hiding transient failures from users.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Asked in SRE and reliability engineering interviews. "How do retries improve resilience?" is the lead-in. "What is the risk of retries?" reveals depth. The retry storm scenario is a favourite senior interview scenario.

Common questions:

How would you configure retries in Istio to avoid a retry storm?
What is the difference between passive and active circuit breaking?
Why should you not retry POST requests by default?
How does perTryTimeout relate to the global timeout?
How do you identify retry storms in Istio metrics?

Strong answer: Immediately mentions retry storm risk, knows amplification = callers x retries, knows retryOn should be specific not 5xx, has debugged circuit breaker behavior via Envoy metrics.

Red flags: Thinks more retries = better reliability, does not know about retry amplification, confuses passive and active circuit breaking.

Related concepts

Explore topics that connect to this one.

Suggested next

Often learned after this topic.

Fault Injection & Resilience Testing

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Continue learning

Fault Injection & Resilience Testing

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

Circuit Breakers, Retries & Timeouts

The three reliability primitives and how they interact

Retry storm prevention

What breaks in production

Decision guide: retry configuration

Resilience patterns compared

Exam Answer vs. Production Reality

Discussion

In-app Q&A

Circuit Breakers, Retries & Timeouts

The three reliability primitives and how they interact

Retry storm prevention

What breaks in production

Decision guide: retry configuration

Resilience patterns compared

Exam Answer vs. Production Reality

Discussion

In-app Q&A