Skip to main content
Career Paths
Concepts
Circuit Breakers Retries Timeouts
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Circuit Breakers, Retries & Timeouts

How Istio implements the three reliability primitives -- and the retry storm that turned a partial failure into a total outage.

Relevant for:Mid-levelSeniorStaff
Why this matters at your level
Mid-level

Configure retries, timeouts, and circuit breakers correctly for a service. Know the perTryTimeout < timeout rule. Know which HTTP methods are safe to retry.

Senior

Debug retry storms. Calculate amplification factors for multi-service topologies. Design retry budgets as a platform policy. Know response_flags in Istio metrics for diagnosing circuit breaker state.

Staff

Define retry and timeout policies as platform defaults in MeshConfig. Design the resilience architecture for a complex service graph. Own the amplification analysis for retry storms across the fleet.

Circuit Breakers, Retries & Timeouts

How Istio implements the three reliability primitives -- and the retry storm that turned a partial failure into a total outage.

~4 min read
Be the first to complete!
LIVEData Plane Cascade -- Retry Storm -- 1000x Request Amplification
Breaking News
T+0

Recommendation service degrades -- 30% requests at 5s latency

WARNING
T+2m

10 upstream services each retrying 3x -- 30x amplification at recommendation

CRITICAL
T+3m

Recommendation OOM kills -- all callers now getting 100% 503

CRITICAL
T+4m

Retry storm spreads to neighbors -- 6 services degraded

CRITICAL
T+15m

Retries disabled -- recommendation recovers -- storm subsides

—Request amplification from retries
—Services brought down by the cascade
—Retries per request (the trigger)
—Upstream services all retrying simultaneously

The question this raises

How do you configure retries, timeouts, and circuit breakers to improve resilience without creating amplification cascades that turn partial failures into total outages?

Test your assumption first

You configure a VirtualService with timeout: 5s, retries.attempts: 3, retries.perTryTimeout: 5s. Users report requests always fail after exactly 5 seconds with no retries observed. What is wrong?

Lesson outline

The three reliability primitives and how they interact

Retries, timeouts, and circuit breakers must be configured together -- each alone is dangerous

Retries without timeouts: a request that hangs forever is retried 3 times, each hanging forever -- 3x connection exhaustion. Timeouts without retries: transient errors immediately fail the user. Circuit breakers without retries: the circuit trips and users get 503 until the breaker resets. All three together: retries handle transient errors, timeouts bound retry cost, circuit breakers prevent the retry storm.


Request arrives at Envoy (source pod)
     |
     v
[1] Timeout check: does the route have a timeout?
    timeout: 10s -> entire request including all retries must complete in 10s
     |
     v
[2] Forward to upstream endpoint
     |
     +-- Success (2xx) -> return to caller
     |
     +-- Failure (503, connect-failure, 5xx):
              |
              v
         [3] Should retry?
             retries.attempts: 3
             retries.retryOn: 5xx,gateway-error,connect-failure
             retries.perTryTimeout: 3s
              |
              +-- Yes: pick a different healthy endpoint, retry
              |        (if all attempts used: return last error)
              |
              +-- Circuit breaker check (outlierDetection):
                       Has this endpoint had 5 consecutive 5xx?
                       Yes -> eject endpoint from pool
                       Envoy picks a different endpoint for retries
                       (passive circuit breaking -- per endpoint, not per service)

Active circuit breaking (connection pool overflow):
  If http1MaxPendingRequests exceeded -> 503 immediately (no retry)
  This is the "fail fast" behavior -- prevents queue buildup

Retry storm prevention

How this concept changes your thinking

Situation
Before
After

Retries at 10 callers each with attempts:3

“Amplification = callers x retries = 10 x 3 = 30x load on downstream”

“Add retry budgets: 20% of requests can be retried -- above that, fail fast. Add exponential backoff to spread retries over time.”

Timeout configuration for retries

“perTryTimeout: 5s, attempts: 3 -- total possible wait: 15s before the request fails to the user”

“Set timeout (global): 10s. Set perTryTimeout: 3s, attempts: 3. Global timeout caps the total: no retry can happen after 10s regardless of attempt count.”

Safe retry configuration principles

  • Only retry idempotent operations — GET, HEAD are safe to retry. POST without idempotency key is not -- retrying creates duplicate orders.
  • Set perTryTimeout shorter than global timeout — If perTryTimeout >= timeout, the first attempt already uses the full budget -- retries never fire.
  • retryOn: 5xx is too broad — Set retryOn to specific conditions: connect-failure,gateway-error,503. 500 Application errors should not be retried.
  • Add jitter to retry backoff — Multiple callers retrying simultaneously after the same failure = thundering herd. Jitter spreads retries across time.
  • Limit retry amplification — Use retryBudget in Linkerd-style or limit attempts to 2 max. 3 retries x 10 callers = 30x -- at 100 RPS, that's 3000 RPS to the degraded service.
retry-timeout-config.yaml
1apiVersion: networking.istio.io/v1beta1
2kind: VirtualService
3metadata:
4 name: recommendation-svc
5spec:
6 hosts:
7 - recommendation-svc
8 http:
9 - route:
10 - destination:
11 host: recommendation-svc
12 timeout: 10s # Global: entire operation must complete in 10s
13 retries:
14 attempts: 2 # Max 2 retries (not 3!) -- limit amplification
15 perTryTimeout: 3s # Each attempt must succeed in 3s (< timeout/attempts)
16 retryOn: connect-failure,gateway-error,503 # Specific conditions only
17 # NOT: 5xx -- application 500s should not be retried
18 # NOT: reset -- retrying on connection reset can cause duplicate writes
circuit-breaker-config.yaml
1apiVersion: networking.istio.io/v1beta1
2kind: DestinationRule
3metadata:
4 name: recommendation-svc
5spec:
6 host: recommendation-svc
7 trafficPolicy:
8 # Active circuit breaking (connection pool overflow -> 503)
9 connectionPool:
10 http:
11 http1MaxPendingRequests: 100 # Queue limit before instant 503
12 http2MaxRequests: 500
13 # Passive circuit breaking (outlier detection)
14 outlierDetection:
15 consecutive5xxErrors: 5 # Eject after 5 consecutive 5xx
16 consecutiveGatewayErrors: 3 # Eject after 3 gateway errors (faster)
17 interval: 10s # Check frequency
18 baseEjectionTime: 30s # Minimum ejection duration
19 maxEjectionPercent: 50 # Never eject more than half the pool
kubectl
1# Check retry configuration applied to a service
2istioctl proxy-config routes my-pod.production \
3 --name 8080 --output json | grep -A20 retryPolicy
4
5# Check which endpoints are ejected (circuit breaker state)
6istioctl proxy-config endpoint my-pod.production \
7 --cluster "outbound|8080||recommendation-svc.production.svc.cluster.local"
8
9# Monitor retry rate via Prometheus metric
10# istio_requests_total{response_flags="RL"} = rate limited (circuit breaker overflow)
11# istio_requests_total{response_flags="URX"} = upstream retry exhausted
12
13# Generate load to test circuit breaker
14kubectl run fortio -it --image=fortio/fortio -- \
15 load -c 10 -qps 50 -n 1000 \
16 http://recommendation-svc/recommend

What breaks in production

Blast radius of resilience misconfiguration

  • Retry storm — N callers x M retries = NxM amplification -- turns partial degradation into total collapse
  • perTryTimeout >= timeout — Retries never fire -- first attempt uses entire time budget, subsequent attempts impossible
  • Retrying non-idempotent operations — POST payment endpoint retried on 503 -- duplicate charges, double orders
  • Circuit breaker without retry — Breaker trips on transient error -- users get 503 until baseEjectionTime passes (30s default)
  • retryOn: reset on writes — Network reset during a write operation -- retry resends the write -- data corruption or duplicate records

perTryTimeout >= timeout -- retries can never fire

Bug
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  http:
  - route:
    - destination:
        host: payment-svc
    timeout: 5s         # Global: 5 seconds total
    retries:
      attempts: 3
      perTryTimeout: 5s  # WRONG: each attempt gets 5s
      # First attempt takes 5s -> timeout fires -> no time for retry
      # attempts: 3 setting is completely ignored
      retryOn: 5xx
Fix
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  http:
  - route:
    - destination:
        host: payment-svc
    timeout: 10s        # Global timeout: 10s for everything
    retries:
      attempts: 2       # 2 retries
      perTryTimeout: 3s  # Each attempt: 3s max
      # Budget: 3s (attempt 1) + 3s (attempt 2) = 6s max
      # Well under the 10s global timeout
      # Buffer for jitter and network overhead
      retryOn: connect-failure,gateway-error,503

perTryTimeout must be significantly less than timeout / attempts. A good rule: perTryTimeout <= timeout / (attempts + 1). This ensures there is time for all retry attempts plus some overhead within the global timeout budget. If the global timeout fires first, Envoy stops all retries and returns the error.

Decision guide: retry configuration

Is the operation idempotent (safe to repeat without side effects)?
YesRetries are safe -- set attempts: 2, perTryTimeout < timeout/3, retryOn: connect-failure,503
NoIs the operation a write (POST, PUT, DELETE)?
Is the operation a write (POST, PUT, DELETE)?
YesNo retries unless you have idempotency keys. Use circuit breaker only (outlierDetection) -- fail fast on write failures rather than retry.
NoEvaluate the risk -- some read-like POST endpoints are idempotent. Add retries only if the failure mode is confirmed transient.

Resilience patterns compared

PatternWhat it doesFailure mode preventedFailure mode introduced
TimeoutCaps maximum request durationSlow upstream hanging all connectionsUser sees failure instead of slow response
RetryRepeats failed requestsTransient errors (pod restart, network blip)Amplification cascade if downstream is overloaded
Passive circuit breakerEjects repeatedly failing endpointsContinuing to send to unhealthy podsHealthy endpoints may be overloaded during ejection
Active circuit breakerConnection pool overflow -> 503Queue buildup from slow upstreamLegitimate requests fast-fail during overload
Bulkhead (connection pool)Limits connections per upstreamOne slow service consuming all connectionsLow-limit causes 503 under normal load

Exam Answer vs. Production Reality

1 / 2

Retry configuration

📖 What the exam expects

Istio retries are configured via VirtualService with attempts, perTryTimeout, and retryOn fields. Retries improve resilience by hiding transient failures from users.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Asked in SRE and reliability engineering interviews. "How do retries improve resilience?" is the lead-in. "What is the risk of retries?" reveals depth. The retry storm scenario is a favourite senior interview scenario.

Common questions:

  • How would you configure retries in Istio to avoid a retry storm?
  • What is the difference between passive and active circuit breaking?
  • Why should you not retry POST requests by default?
  • How does perTryTimeout relate to the global timeout?
  • How do you identify retry storms in Istio metrics?

Strong answer: Immediately mentions retry storm risk, knows amplification = callers x retries, knows retryOn should be specific not 5xx, has debugged circuit breaker behavior via Envoy metrics.

Red flags: Thinks more retries = better reliability, does not know about retry amplification, confuses passive and active circuit breaking.

Related concepts

Explore topics that connect to this one.

  • DestinationRules & Load Balancing
  • Fault Injection & Resilience Testing
  • High availability (HA) & resilience

Suggested next

Often learned after this topic.

Fault Injection & Resilience Testing

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.

Sign in to track your progress and mark lessons complete.

Continue learning

Fault Injection & Resilience Testing

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.