50 Scenarios Live

Think like an
architect.

Production constraints. Hard trade-offs. Expert-graded decisions. Build the judgment that separates senior engineers from architects.

50
Scenarios
0
Completed
,
Avg Score
Kubernetes
🏗️

Design a Multi-Tenant Kubernetes Cluster

Consolidate 50 teams onto shared clusters while maintaining isolation, fairness, and compliance.

30m4Architecture Decision
Start
Kubernetes
🚨

Kubernetes Pod Won't Start: Debug the CrashLoopBackOff

Production pods are in CrashLoopBackOff after a deploy. Diagnose from logs, events, and metrics, then fix it.

15m3Incident Response
Start
Cloud Infrastructure
⚖️

EKS vs AKS vs GKE for a Healthcare Platform

Your healthcare startup needs managed Kubernetes with HIPAA compliance. Compare the three major providers and recommend one.

25m4Trade-off Analysis
Start
CI/CD
🏗️

Design a GitOps Deployment Pipeline

Migrate 15 microservices from Jenkins to GitOps with automated promotion across dev, staging, and production.

20m3Architecture Decision
Start
Security
🔍

Implement Zero-Trust Security for Microservices

Design a zero-trust architecture for a healthcare fintech with 30 microservices, HIPAA compliance, and multi-cloud requirements.

30m4Design Review
Start
FinOps
💰

Reduce AWS Bill by 40%

Cut a $45K/month AWS bill by 40% in 90 days for a Series B startup without impacting production reliability.

20m3Cost Optimization
Start
Platform Engineering
🚀

Migrate a Monolith to Microservices

Decompose an 8-year-old e-commerce monolith into microservices while serving 2M daily active users with zero downtime.

45m5Migration Planning
Start
Observability
🏗️

Design an Observability Stack for Distributed Systems

Unify metrics, logs, and traces for 50 microservices to cut P1 triage time from 45 minutes to under 10.

20m3Architecture Decision
Start
Cloud Infrastructure
🔍

Identify and Eliminate Single Points of Failure

Review a financial platform architecture, find every SPOF, and design a remediation roadmap for 99.99% availability.

25m4Design Review
Start
Data & Storage
⚖️

Choose the Right Database for a Social Platform

Select the optimal database stack for a developer social platform with profiles, social graph, activity feeds, and messaging.

15m3Trade-off Analysis
Start
Cloud Infrastructure
🏗️

Design a Serverless Runtime with Firecracker MicroVMs

Build a multi-tenant serverless platform using Firecracker microVMs that handles 10K concurrent functions with sub-200ms cold starts.

30m4Architecture Decision
Start
Kubernetes
🏗️

Design a Kubernetes Autoscaling Strategy

Choose between HPA, KEDA, and VPA for a 30-service SaaS platform where some services need event-driven scaling and others need metric-based scaling.

20m3Architecture Decision
Start
CI/CD
🏗️

Choose a Deployment Strategy for Weekly Releases

Pick the right deployment strategy for an e-commerce platform processing $2M/day that is moving from monthly to weekly releases.

18m3Architecture Decision
Start
Cloud Infrastructure
🏗️

Design a Disaster Recovery Strategy for a FinTech Platform

Design a DR architecture for a payment processing platform with RTO 15 minutes and RPO 1 hour, currently running in a single AWS region with no DR.

25m3Architecture Decision
Start
Platform Engineering
🏗️

Design a Terraform State and Module Strategy for 5 Teams

Fix a Terraform setup where 5 teams share state files, causing corruptions and conflicts, design remote state, workspaces, and a module registry.

20m3Architecture Decision
Start
Observability
🏗️

Design SLOs and Error Budgets for a B2B SaaS Platform

A payments company has no SLOs, define meaningful SLIs, set error budgets, and decide what happens when the budget is exhausted.

18m3Architecture Decision
Start
CI/CD
💰

Cut CI Pipeline from 47 Minutes to Under 10

A monorepo CI pipeline takes 47 minutes end-to-end, engineers have stopped running tests locally and now batch changes to reduce CI waits. Fix it.

16m3Cost Optimization
Start
Data & Storage
🏗️

Streaming vs Batch: Redesign a Broken Analytics Pipeline

An e-commerce analytics pipeline runs nightly batch jobs that take 14 hours to complete, product teams are making decisions on yesterday's data. Decide what to stream, what to batch, and how to handle late-arriving events.

20m3Architecture Decision
Start
FinOps
💰

Build Interruption-Tolerant ML Training on Spot Instances

An ML team's training jobs fail completely when Spot instances are reclaimed, design a checkpoint strategy and fault-tolerant architecture to cut compute costs by 70% without sacrificing training reliability.

18m3Cost Optimization
Start
Security
🔍

Audit and Harden IAM for a Multi-Account AWS Organization

A security audit reveals 23 IAM users with AdministratorAccess, cross-account roles with * resource wildcards, and production credentials stored in developer laptops. Design the remediation.

20m3Design Review
Start
Cloud Infrastructure
🏗️

Design a Global CDN and Traffic Strategy for a Multi-Region SaaS

A video SaaS has 40% of traffic from Southeast Asia with 800ms API latency, users complain and churn. Design a CDN, edge caching, and traffic routing strategy without replicating all backend services.

18m3Architecture Decision
Start
Cloud Infrastructure
🏗️

Active-Active vs Active-Passive: Design Multi-Region Failover for a Healthcare SaaS

A healthcare platform needs 99.99% uptime to meet HIPAA requirements after a 4-hour regional AWS outage. Choose between active-active and active-passive multi-region, and design the failover automation.

25m3Architecture Decision
Start
Platform Engineering
🏗️

Design an Internal Developer Platform for 500 Engineers

Engineers at a 500-person company wait 3 weeks for new service provisioning and spend 30% of their time on cloud ops instead of product work, design an IDP that reduces provisioning to 30 minutes and eliminates toil.

25m3Architecture Decision
Start
Security
🔍

Harden Software Supply Chain After a Dependency Compromise

A popular npm package your company depends on was compromised, attackers injected malicious code that exfiltrated environment variables. Design a software supply chain security strategy to prevent the next SolarWinds-style attack.

22m3Design Review
Start
Data & Storage
⚖️

Choose a Data Warehouse: Snowflake vs BigQuery vs Redshift

A fast-growing SaaS must migrate from a 4TB PostgreSQL analytics database that takes 14 hours to run month-end reports, evaluate Snowflake, BigQuery, and Redshift for specific workload, team, and cost constraints.

22m3Trade-off Analysis
Start
Cloud Infrastructure
🏗️

Design an API Gateway for 15 Exposed Microservices

An e-commerce company has 15 microservices all directly exposed to the internet. Design an API gateway pattern that adds resilience, security, and observability.

16m3Architecture Decision
Start
Observability
🔍

Safely Introduce Chaos Engineering for a Payment Service

A payment service claims 99.9% SLO but has never been tested under failure conditions. Design a chaos engineering program that validates resilience safely, starting with steady-state hypothesis definition.

17m3Design Review
Start
CI/CD
🏗️

Container Registry Chaos: One Repo for 40 Services

A team pushing all Docker images to a single ECR repository is causing tag conflicts, failed deployments, and runaway storage costs. Fix the registry architecture.

14m3Architecture Decision
Start
FinOps
🚨

Investigate a 40% AWS Bill Spike with No Alerting in Place

An AWS bill jumped 40% overnight with no alerting. Design a cost anomaly detection, root cause investigation, and preventive tagging strategy for an engineering team.

15m3Incident Response
Start
Cloud Infrastructure
🏗️

Single AWS Account Chaos: Design Multi-Account Architecture

Production, staging, and dev workloads all share a single AWS account. A misconfigured dev IAM policy once gave a contractor access to production RDS. Design a proper multi-account structure.

17m3Architecture Decision
Start
Data & Storage
🚀

Zero-Downtime PostgreSQL Column Rename on a 50M Row Table

A 50-million row PostgreSQL table needs a column renamed during business hours without downtime. Design the expand-contract migration pattern safely.

17m3Migration Planning
Start
Kubernetes
⚖️

Choose Between EKS Fargate and Managed Node Groups

A startup is choosing between EKS Fargate and managed node groups for their microservices. Design a hybrid strategy that optimizes cost, operational burden, and workload compatibility.

15m3Trade-off Analysis
Start
Cloud Infrastructure
🚀

DB Triggers Are Killing Your Monolith: Move to Events

A monolith uses database triggers for downstream notifications, creating tight coupling, cascading failures, and load spikes. Migrate to an event-driven architecture.

16m3Migration Planning
Start
CI/CD
🏗️

Stop Manual Drift in an ArgoCD GitOps Environment

Engineers are using kubectl apply directly in a cluster managed by ArgoCD, causing config drift between Git and production. Design a GitOps policy to detect and prevent drift.

15m3Architecture Decision
Start
Kubernetes
🔍

Containers Running as Root in Production K8s

A Kubernetes cluster has containers running as root with privileged mode enabled on several workloads. A security audit has flagged critical findings. Harden it.

15m3Design Review
Start
Kubernetes
🚀

EKS 3 Versions Behind: Zero-Downtime Upgrade Strategy

A production EKS cluster is 3 minor versions behind on a 200-node cluster. AWS is ending support in 60 days. Plan a zero-downtime upgrade strategy.

18m3Migration Planning
Start
Kubernetes
🔍

Implement Network Isolation Across a Kubernetes Cluster

A Kubernetes cluster where all pods can reach all other pods has failed a security audit. Design NetworkPolicy rules to implement namespace isolation and least-privilege network access.

16m3Design Review
Start
Cloud Infrastructure
💰

Reduce Lambda Auth Function Cold Starts from 1200ms to Under 200ms

An authentication Lambda runs at 1200ms p99 with half the latency coming from cold starts. Design a cold start optimization strategy using package size reduction, runtime choice, and provisioned concurrency.

16m3Cost Optimization
Start
Observability
💰

$18K/Month CloudWatch Bill: Fix the Logging Architecture

A platform sending 500GB/day to CloudWatch Logs is spending $18,000/month. 90% of logs are debug noise never queried in production. Redesign the pipeline to cut costs without losing observability.

15m3Cost Optimization
Start
Cloud Infrastructure
🏗️

Fix a Message Queue Losing Orders on Service Restart

An order processing system is dropping messages when the consumer service restarts. Design durable messaging with dead-letter handling and backpressure to fix the data loss.

15m3Architecture Decision
Start
FinOps
💰

$45K/Month GPU Bill: 80% Idle Time on ML Infrastructure

An ML team is running GPU instances 24/7 for experiments with 80% idle time, costing $45,000/month. Redesign the infrastructure to cut costs while maintaining model training and inference capability.

18m3Cost Optimization
Start
Observability
🔍

Grafana Dashboard Chaos: Version Control Your Observability

Teams have 47 Grafana dashboards that diverge between teams, alerts get accidentally deleted, and nobody knows which version is "correct." Implement observability-as-code.

16m3Design Review
Start
Data & Storage
🚨

RDS PostgreSQL at 95% CPU: Diagnose and Fix

An RDS PostgreSQL instance is hitting 95% CPU during peak hours. EXPLAIN shows sequential scans on critical queries. Engineers want to upgrade the instance, but the real fix is elsewhere.

16m3Incident Response
Start
Data & Storage
🏗️

Fix 2.3-Second Product Page Loads with Redis Caching

A product catalog page makes 15 database queries and loads in 2.3 seconds. Design a Redis caching strategy with correct invalidation, graceful degradation, and protection against the hot key problem.

15m3Architecture Decision
Start
Security
🔍

Every Pod Uses the Same IAM Role: Fix Workload Identity

All 30 microservices in an EKS cluster share a single node IAM role with broad S3, DynamoDB, and SQS permissions. A single compromised container can access all data. Implement per-pod IAM identity.

17m3Design Review
Start
Observability
🚨

Fix Invisible 5-Second Latency Spikes Across 12 Services

A 12-service application has intermittent 5-second latency spikes with no visibility into which service is the culprit. Design distributed tracing to identify and fix the root cause.

16m3Incident Response
Start
Platform Engineering
🔍

Taming 200 Terraform Files: Build a Module Library

Six teams each wrote their own VPC, EKS, and RDS Terraform configs. The result is 200 files of copy-paste with inconsistent security baselines and no shared standards. Fix it with a proper module design.

15m3Design Review
Start
Security
🚀

Migrate 50 Microservices from Hardcoded DB Passwords to Vault

50 microservices have hardcoded database passwords in Kubernetes Secrets and environment variables. Design a migration to HashiCorp Vault with dynamic secrets and automatic rotation.

18m3Migration Planning
Start
CI/CD
🏗️

Ship Your First Feature to Production

Your code works on localhost. Now you need to get it to real users. Choose how to deploy, test, and roll back safely.

12m3Architecture Decision
Start
Observability
⚖️

Your Boss Wants 99.99% Uptime, Now What?

Your manager promised a client "four nines" reliability. Figure out what that actually means, how to measure it, and what to do when the budget runs out.

14m3Trade-off Analysis
Start