How to safely test a new service version under real production load without affecting users -- and how mirroring caught a 10x latency regression.
Set up traffic mirroring as part of the pre-canary validation process. Know the write isolation requirement. Build Grafana dashboards to compare primary vs shadow latency side by side.
Define the mirror -> canary -> promote pipeline as the deployment standard. Automate shadow database provisioning. Know response diffing tools (Diffy) for behavioral correctness testing.
How to safely test a new service version under real production load without affecting users -- and how mirroring caught a 10x latency regression.
Traffic mirroring configured: 10% of search queries mirrored to Rust service
Rust service P99 latency: 1.2s (expected < 100ms)
WARNINGRoot cause: N+1 database queries on cache-miss path -- not covered by benchmark
CRITICALN+1 query fixed -- mirror latency drops to 80ms
Rust service promoted to 100% -- users never saw the slow version
The question this raises
How does traffic mirroring work, what does the mirrored service actually receive, and what is the difference between mirroring and shadow testing at the infrastructure level?
You have a new service version ready. Benchmarks show it is 3x faster. Before running a canary, what step would give you the highest confidence the new version performs correctly under real production load with zero user risk?
Lesson outline
Mirrored traffic is fire-and-forget -- responses are discarded
When Envoy mirrors a request, it sends a copy to the shadow service. The response from the shadow service is completely ignored -- the user always receives the response from the primary service. This makes mirroring completely safe for end users: the shadow service can be slow, return errors, or crash without any impact on production traffic.
Request arrives at Envoy (source pod)
|
+---> Primary: search-svc-v1 (user response)
| |
| v
| Response returned to user <-- user always gets THIS response
|
+---> Mirror: search-svc-v2 (async, fire-and-forget)
|
| Request sent asynchronously
v
search-svc-v2 processes the request
|
v
Response DISCARDED by Envoy
(user never sees this response)
But: Prometheus metrics still recorded!
And: access logs still generated!
What the shadow service sees:
- Real production request headers and body
- Modified Host header: search-svc.production.svc.cluster.local-shadow
(allows the shadow service to identify mirrored traffic)
- All original query parameters and payload
What the shadow service does NOT see:
- The user -- responses are never returned
- Back-pressure -- if shadow is slow, Envoy does not wait
Setting up a safe traffic mirror
01
Deploy the shadow service -- ensure it has its own database/state (must not share write state with primary)
02
Configure VirtualService with mirror and mirrorPercentage
03
Verify shadow service receives traffic: check its Prometheus metrics or access logs
04
Compare P50/P99 latency between primary and shadow in Grafana
05
Check error rates: shadow service should have similar 2xx rate if implementation is correct
06
Look for behavioral differences: same input, different output = regression
07
Remove mirroring once shadow service is validated -- no need to run mirrors indefinitely
Deploy the shadow service -- ensure it has its own database/state (must not share write state with primary)
Configure VirtualService with mirror and mirrorPercentage
Verify shadow service receives traffic: check its Prometheus metrics or access logs
Compare P50/P99 latency between primary and shadow in Grafana
Check error rates: shadow service should have similar 2xx rate if implementation is correct
Look for behavioral differences: same input, different output = regression
Remove mirroring once shadow service is validated -- no need to run mirrors indefinitely
Shadow services MUST use separate write backends
If you mirror write traffic (POST, PUT) to a shadow service that shares the production database, you will get duplicate writes. Always provision a separate database (or use read replicas) for shadow services. Mirror read traffic (GET) first -- much safer. Only mirror writes when you have explicitly verified isolation.
1apiVersion: networking.istio.io/v1beta12kind: VirtualService3metadata:4name: search-svc5namespace: production6spec:7hosts:8- search-svc9http:10- route:11- destination:12host: search-svc # primary: user response comes from here13subset: v114weight: 100 # 100% of user responses from v115mirror:16host: search-svc # shadow target17subset: v2 # the new version being tested18mirrorPercentage:19value: 10.0 # mirror 10% of requests to v220# (responses always come from v1)
1# Verify mirroring is active2kubectl get virtualservice search-svc -n production \3-o jsonpath='{.spec.http[0].mirror}'45# Check shadow service is receiving traffic6kubectl top pod -l version=v2 -n production7# CPU/memory should show activity if mirroring is working89# Compare latency: primary vs shadow via Prometheus10# Primary: histogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket{destination_workload="search-svc-v1"}[5m]))11# Shadow: histogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket{destination_workload="search-svc-v2"}[5m]))1213# Check that shadow Host header indicates mirrored traffic14# Shadow service logs will show: Host: search-svc.production.svc.cluster.local-shadow1516# Remove mirroring after validation17kubectl patch virtualservice search-svc -n production \18--type='json' \19-p='[{"op":"remove","path":"/spec/http/0/mirror"},{"op":"remove","path":"/spec/http/0/mirrorPercentage"}]'
Blast radius of mirroring misconfiguration
Mirroring writes to shadow sharing production database
# Shadow service deployment -- shares production DB
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-svc-v2
spec:
template:
spec:
containers:
- name: payment-svc
env:
- name: DATABASE_URL
value: "postgres://prod-db:5432/payments" # SHARED with v1!
# VirtualService mirrors POST /payments to v2
# Result: each production payment generates TWO database writes
# Duplicate orders, double charges, data corruption# Shadow service deployment -- separate shadow database
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-svc-v2
spec:
template:
spec:
containers:
- name: payment-svc
env:
- name: DATABASE_URL
value: "postgres://shadow-db:5432/payments" # ISOLATED shadow DB
# Mirror only GET endpoints first:
# - Mirror reads to validate correctness and latency
# - Only mirror writes after explicitly verifying isolation
# - Or: instrument v2 to discard writes when Host header has '-shadow' suffixThe shadow service must have a completely separate write backend. The safest approach is to mirror only read operations (GET requests) first. If you must mirror writes, instrument the shadow service to check the Host header for the -shadow suffix and skip database writes -- treat mirrored requests as read-only validation exercises.
| Strategy | User impact | Validates | When to use | Rollback |
|---|---|---|---|---|
| Traffic mirroring | Zero -- responses discarded | Performance, errors, behavior | Pre-canary validation, rewrite testing | Instant -- remove mirror config |
| Canary (1-10%) | Small % of users see new version | User impact + performance | Incremental rollout after mirror validation | Seconds -- set weight to 0 |
| Blue-green | Instant cutover (or instant rollback) | Full prod validation | Major version changes, requires 2x resources | Seconds -- switch VS route |
| A/B testing | Specific user segment | UX, conversion, behavior | Feature experiments, user research | Remove header match rule |
Traffic mirroring
📖 What the exam expects
Traffic mirroring (or shadowing) sends a copy of production traffic to a new service version. The shadow service's response is discarded -- users always receive the primary service's response. This allows testing under real production load with zero user impact.
Toggle between what certifications teach and what production actually requires
Asked in senior SRE and platform interviews as a follow-on to canary questions. "Is there a safer way to test than a canary?" leads here.
Common questions:
Strong answer: Mentions N+1 / cache miss scenarios that benchmarks miss, knows the write isolation requirement, uses mirror before canary as standard practice.
Red flags: Thinks mirror responses affect users, does not know about write isolation, has not thought about benchmark vs production traffic differences.
Related concepts
Explore topics that connect to this one.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.