Interactive Explainer

Relevant for:SeniorStaff

Why this matters at your level

Senior

Set up traffic mirroring as part of the pre-canary validation process. Know the write isolation requirement. Build Grafana dashboards to compare primary vs shadow latency side by side.

Staff

Define the mirror -> canary -> promote pipeline as the deployment standard. Automate shadow database provisioning. Know response diffing tools (Diffy) for behavioral correctness testing.

Traffic Mirroring & Shadowing

How to safely test a new service version under real production load without affecting users -- and how mirroring caught a 10x latency regression.

~3 min read

Be the first to complete!

LIVEData Plane Shadow Test -- Mirror Found 10x Latency Before Go-Live

Breaking News

T+0

Traffic mirroring configured: 10% of search queries mirrored to Rust service

T+30m

Rust service P99 latency: 1.2s (expected < 100ms)

WARNING

T+1h

Root cause: N+1 database queries on cache-miss path -- not covered by benchmark

CRITICAL

T+3h

N+1 query fixed -- mirror latency drops to 80ms

T+1d

Rust service promoted to 100% -- users never saw the slow version

—Latency regression caught before go-live

—Users affected by the bug

—Production traffic mirrored

—Time to find the N+1 bug

The question this raises

How does traffic mirroring work, what does the mirrored service actually receive, and what is the difference between mirroring and shadow testing at the infrastructure level?

Test your assumption first

You have a new service version ready. Benchmarks show it is 3x faster. Before running a canary, what step would give you the highest confidence the new version performs correctly under real production load with zero user risk?

Lesson outline

What traffic mirroring does

Mirrored traffic is fire-and-forget -- responses are discarded

When Envoy mirrors a request, it sends a copy to the shadow service. The response from the shadow service is completely ignored -- the user always receives the response from the primary service. This makes mirroring completely safe for end users: the shadow service can be slow, return errors, or crash without any impact on production traffic.


Request arrives at Envoy (source pod)
     |
     +---> Primary: search-svc-v1 (user response)
     |          |
     |          v
     |     Response returned to user <-- user always gets THIS response
     |
     +---> Mirror: search-svc-v2 (async, fire-and-forget)
               |
               | Request sent asynchronously
               v
          search-svc-v2 processes the request
               |
               v
          Response DISCARDED by Envoy
          (user never sees this response)
          But: Prometheus metrics still recorded!
          And: access logs still generated!

What the shadow service sees:
  - Real production request headers and body
  - Modified Host header: search-svc.production.svc.cluster.local-shadow
    (allows the shadow service to identify mirrored traffic)
  - All original query parameters and payload

What the shadow service does NOT see:
  - The user -- responses are never returned
  - Back-pressure -- if shadow is slow, Envoy does not wait

Mirror configuration and best practices

Setting up a safe traffic mirror

→

Deploy the shadow service -- ensure it has its own database/state (must not share write state with primary)

→

Configure VirtualService with mirror and mirrorPercentage

→

Verify shadow service receives traffic: check its Prometheus metrics or access logs

→

Compare P50/P99 latency between primary and shadow in Grafana

→

Check error rates: shadow service should have similar 2xx rate if implementation is correct

→

Look for behavioral differences: same input, different output = regression

Remove mirroring once shadow service is validated -- no need to run mirrors indefinitely

Deploy the shadow service -- ensure it has its own database/state (must not share write state with primary)

Configure VirtualService with mirror and mirrorPercentage

Verify shadow service receives traffic: check its Prometheus metrics or access logs

Compare P50/P99 latency between primary and shadow in Grafana

Check error rates: shadow service should have similar 2xx rate if implementation is correct

Look for behavioral differences: same input, different output = regression

Remove mirroring once shadow service is validated -- no need to run mirrors indefinitely

Shadow services MUST use separate write backends

If you mirror write traffic (POST, PUT) to a shadow service that shares the production database, you will get duplicate writes. Always provision a separate database (or use read replicas) for shadow services. Mirror read traffic (GET) first -- much safer. Only mirror writes when you have explicitly verified isolation.

mirror-virtual-service.yaml

1apiVersion: networking.istio.io/v1beta1
2kind: VirtualService
3metadata:
4  name: search-svc
5  namespace: production
6spec:
7  hosts:
8  - search-svc
9  http:
10  - route:
11    - destination:
12        host: search-svc       # primary: user response comes from here
13        subset: v1
14      weight: 100              # 100% of user responses from v1
15    mirror:
16      host: search-svc         # shadow target
17      subset: v2               # the new version being tested
18    mirrorPercentage:
19      value: 10.0              # mirror 10% of requests to v2
20                               # (responses always come from v1)

kubectl

1# Verify mirroring is active
2kubectl get virtualservice search-svc -n production \
3  -o jsonpath='{.spec.http[0].mirror}'
4 
5# Check shadow service is receiving traffic
6kubectl top pod -l version=v2 -n production
7# CPU/memory should show activity if mirroring is working
8 
9# Compare latency: primary vs shadow via Prometheus
10# Primary: histogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket{destination_workload="search-svc-v1"}[5m]))
11# Shadow:  histogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket{destination_workload="search-svc-v2"}[5m]))
12 
13# Check that shadow Host header indicates mirrored traffic
14# Shadow service logs will show: Host: search-svc.production.svc.cluster.local-shadow
15 
16# Remove mirroring after validation
17kubectl patch virtualservice search-svc -n production \
18  --type='json' \
19  -p='[{"op":"remove","path":"/spec/http/0/mirror"},{"op":"remove","path":"/spec/http/0/mirrorPercentage"}]'

What breaks in production

Blast radius of mirroring misconfiguration

Shadow service shares production write backend — Mirrored POST requests create duplicate records in the production database
100% mirror + slow shadow — Envoy fires mirror requests asynchronously but they still consume network bandwidth and shadow service resources
Mirror never removed — Shadow service runs indefinitely at 10% load -- resource cost, but more importantly: the mirror is never evaluated and the shadow is never promoted
Shadow service OOM — Mirror traffic volume overwhelms under-provisioned shadow -- shadow crashes, but users are unaffected

Mirroring writes to shadow sharing production database

Bug

# Shadow service deployment -- shares production DB
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-svc-v2
spec:
  template:
    spec:
      containers:
      - name: payment-svc
        env:
        - name: DATABASE_URL
          value: "postgres://prod-db:5432/payments"  # SHARED with v1!

# VirtualService mirrors POST /payments to v2
# Result: each production payment generates TWO database writes
# Duplicate orders, double charges, data corruption

Fix

# Shadow service deployment -- separate shadow database
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-svc-v2
spec:
  template:
    spec:
      containers:
      - name: payment-svc
        env:
        - name: DATABASE_URL
          value: "postgres://shadow-db:5432/payments"  # ISOLATED shadow DB
# Mirror only GET endpoints first:
# - Mirror reads to validate correctness and latency
# - Only mirror writes after explicitly verifying isolation
# - Or: instrument v2 to discard writes when Host header has '-shadow' suffix

The shadow service must have a completely separate write backend. The safest approach is to mirror only read operations (GET requests) first. If you must mirror writes, instrument the shadow service to check the Host header for the -shadow suffix and skip database writes -- treat mirrored requests as read-only validation exercises.

Decision guide: mirror vs canary

Do you need to validate the new version's performance and correctness under real production load without any user impact?

YesUse traffic mirroring -- 100% safe for users, validates behavior under real traffic patterns, catches benchmark-vs-production gaps

NoDo you need user-facing validation (real users testing the new version's UX)?

Do you need user-facing validation (real users testing the new version's UX)?

YesUse canary (VirtualService weight split) -- 5-10% of users get the new version, monitor error rate and latency

NoUse A/B testing with header-based routing for targeted user segment testing

Mirror vs canary vs blue-green compared

Strategy	User impact	Validates	When to use	Rollback
Traffic mirroring	Zero -- responses discarded	Performance, errors, behavior	Pre-canary validation, rewrite testing	Instant -- remove mirror config
Canary (1-10%)	Small % of users see new version	User impact + performance	Incremental rollout after mirror validation	Seconds -- set weight to 0
Blue-green	Instant cutover (or instant rollback)	Full prod validation	Major version changes, requires 2x resources	Seconds -- switch VS route
A/B testing	Specific user segment	UX, conversion, behavior	Feature experiments, user research	Remove header match rule

Exam Answer vs. Production Reality

1 / 2

Traffic mirroring

📖 What the exam expects

Traffic mirroring (or shadowing) sends a copy of production traffic to a new service version. The shadow service's response is discarded -- users always receive the primary service's response. This allows testing under real production load with zero user impact.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Asked in senior SRE and platform interviews as a follow-on to canary questions. "Is there a safer way to test than a canary?" leads here.

Common questions:

How does traffic mirroring differ from a canary deploy?
What happens to the response from the shadow service?
Why would you use mirroring instead of going straight to a canary?
What is the risk of mirroring write traffic to a shadow service?
How do you compare performance between primary and shadow using Istio metrics?

Strong answer: Mentions N+1 / cache miss scenarios that benchmarks miss, knows the write isolation requirement, uses mirror before canary as standard practice.

Red flags: Thinks mirror responses affect users, does not know about write isolation, has not thought about benchmark vs production traffic differences.

Related concepts

Explore topics that connect to this one.

Suggested next

Often learned after this topic.

Istio mTLS Encryption

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Continue learning

Istio mTLS Encryption

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

What traffic mirroring does

Mirrored traffic is fire-and-forget -- responses are discarded


Request arrives at Envoy (source pod)
     |
     +---> Primary: search-svc-v1 (user response)
     |          |
     |          v
     |     Response returned to user <-- user always gets THIS response
     |
     +---> Mirror: search-svc-v2 (async, fire-and-forget)
               |
               | Request sent asynchronously
               v
          search-svc-v2 processes the request
               |
               v
          Response DISCARDED by Envoy
          (user never sees this response)
          But: Prometheus metrics still recorded!
          And: access logs still generated!

What the shadow service sees:
  - Real production request headers and body
  - Modified Host header: search-svc.production.svc.cluster.local-shadow
    (allows the shadow service to identify mirrored traffic)
  - All original query parameters and payload

What the shadow service does NOT see:
  - The user -- responses are never returned
  - Back-pressure -- if shadow is slow, Envoy does not wait

Mirror configuration and best practices

Setting up a safe traffic mirror

→

Deploy the shadow service -- ensure it has its own database/state (must not share write state with primary)

→

Configure VirtualService with mirror and mirrorPercentage

→

Verify shadow service receives traffic: check its Prometheus metrics or access logs

→

Compare P50/P99 latency between primary and shadow in Grafana

→

Check error rates: shadow service should have similar 2xx rate if implementation is correct

→

Look for behavioral differences: same input, different output = regression

Remove mirroring once shadow service is validated -- no need to run mirrors indefinitely

Deploy the shadow service -- ensure it has its own database/state (must not share write state with primary)

Configure VirtualService with mirror and mirrorPercentage

Verify shadow service receives traffic: check its Prometheus metrics or access logs

Compare P50/P99 latency between primary and shadow in Grafana

Check error rates: shadow service should have similar 2xx rate if implementation is correct

Look for behavioral differences: same input, different output = regression

Remove mirroring once shadow service is validated -- no need to run mirrors indefinitely

Shadow services MUST use separate write backends

mirror-virtual-service.yaml

1apiVersion: networking.istio.io/v1beta1
2kind: VirtualService
3metadata:
4  name: search-svc
5  namespace: production
6spec:
7  hosts:
8  - search-svc
9  http:
10  - route:
11    - destination:
12        host: search-svc       # primary: user response comes from here
13        subset: v1
14      weight: 100              # 100% of user responses from v1
15    mirror:
16      host: search-svc         # shadow target
17      subset: v2               # the new version being tested
18    mirrorPercentage:
19      value: 10.0              # mirror 10% of requests to v2
20                               # (responses always come from v1)

kubectl

1# Verify mirroring is active
2kubectl get virtualservice search-svc -n production \
3  -o jsonpath='{.spec.http[0].mirror}'
4 
5# Check shadow service is receiving traffic
6kubectl top pod -l version=v2 -n production
7# CPU/memory should show activity if mirroring is working
8 
9# Compare latency: primary vs shadow via Prometheus
10# Primary: histogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket{destination_workload="search-svc-v1"}[5m]))
11# Shadow:  histogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket{destination_workload="search-svc-v2"}[5m]))
12 
13# Check that shadow Host header indicates mirrored traffic
14# Shadow service logs will show: Host: search-svc.production.svc.cluster.local-shadow
15 
16# Remove mirroring after validation
17kubectl patch virtualservice search-svc -n production \
18  --type='json' \
19  -p='[{"op":"remove","path":"/spec/http/0/mirror"},{"op":"remove","path":"/spec/http/0/mirrorPercentage"}]'

What breaks in production

Blast radius of mirroring misconfiguration

Shadow service shares production write backend — Mirrored POST requests create duplicate records in the production database
100% mirror + slow shadow — Envoy fires mirror requests asynchronously but they still consume network bandwidth and shadow service resources
Mirror never removed — Shadow service runs indefinitely at 10% load -- resource cost, but more importantly: the mirror is never evaluated and the shadow is never promoted
Shadow service OOM — Mirror traffic volume overwhelms under-provisioned shadow -- shadow crashes, but users are unaffected

Mirroring writes to shadow sharing production database

Bug

# Shadow service deployment -- shares production DB
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-svc-v2
spec:
  template:
    spec:
      containers:
      - name: payment-svc
        env:
        - name: DATABASE_URL
          value: "postgres://prod-db:5432/payments"  # SHARED with v1!

# VirtualService mirrors POST /payments to v2
# Result: each production payment generates TWO database writes
# Duplicate orders, double charges, data corruption

Fix

# Shadow service deployment -- separate shadow database
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-svc-v2
spec:
  template:
    spec:
      containers:
      - name: payment-svc
        env:
        - name: DATABASE_URL
          value: "postgres://shadow-db:5432/payments"  # ISOLATED shadow DB
# Mirror only GET endpoints first:
# - Mirror reads to validate correctness and latency
# - Only mirror writes after explicitly verifying isolation
# - Or: instrument v2 to discard writes when Host header has '-shadow' suffix

Decision guide: mirror vs canary

Do you need to validate the new version's performance and correctness under real production load without any user impact?

YesUse traffic mirroring -- 100% safe for users, validates behavior under real traffic patterns, catches benchmark-vs-production gaps

NoDo you need user-facing validation (real users testing the new version's UX)?

Do you need user-facing validation (real users testing the new version's UX)?

YesUse canary (VirtualService weight split) -- 5-10% of users get the new version, monitor error rate and latency

NoUse A/B testing with header-based routing for targeted user segment testing

Mirror vs canary vs blue-green compared

Strategy	User impact	Validates	When to use	Rollback
Traffic mirroring	Zero -- responses discarded	Performance, errors, behavior	Pre-canary validation, rewrite testing	Instant -- remove mirror config
Canary (1-10%)	Small % of users see new version	User impact + performance	Incremental rollout after mirror validation	Seconds -- set weight to 0
Blue-green	Instant cutover (or instant rollback)	Full prod validation	Major version changes, requires 2x resources	Seconds -- switch VS route
A/B testing	Specific user segment	UX, conversion, behavior	Feature experiments, user research	Remove header match rule