Think like an
architect.
Production constraints. Hard trade-offs. Expert-graded decisions. Build the judgment that separates senior engineers from architects.
Design a Multi-Tenant Kubernetes Cluster
Consolidate 50 teams onto shared clusters while maintaining isolation, fairness, and compliance.
Kubernetes Pod Won't Start: Debug the CrashLoopBackOff
Production pods are in CrashLoopBackOff after a deploy. Diagnose from logs, events, and metrics, then fix it.
EKS vs AKS vs GKE for a Healthcare Platform
Your healthcare startup needs managed Kubernetes with HIPAA compliance. Compare the three major providers and recommend one.
Design a GitOps Deployment Pipeline
Migrate 15 microservices from Jenkins to GitOps with automated promotion across dev, staging, and production.
Implement Zero-Trust Security for Microservices
Design a zero-trust architecture for a healthcare fintech with 30 microservices, HIPAA compliance, and multi-cloud requirements.
Reduce AWS Bill by 40%
Cut a $45K/month AWS bill by 40% in 90 days for a Series B startup without impacting production reliability.
Migrate a Monolith to Microservices
Decompose an 8-year-old e-commerce monolith into microservices while serving 2M daily active users with zero downtime.
Design an Observability Stack for Distributed Systems
Unify metrics, logs, and traces for 50 microservices to cut P1 triage time from 45 minutes to under 10.
Identify and Eliminate Single Points of Failure
Review a financial platform architecture, find every SPOF, and design a remediation roadmap for 99.99% availability.
Choose the Right Database for a Social Platform
Select the optimal database stack for a developer social platform with profiles, social graph, activity feeds, and messaging.
Design a Serverless Runtime with Firecracker MicroVMs
Build a multi-tenant serverless platform using Firecracker microVMs that handles 10K concurrent functions with sub-200ms cold starts.
Design a Kubernetes Autoscaling Strategy
Choose between HPA, KEDA, and VPA for a 30-service SaaS platform where some services need event-driven scaling and others need metric-based scaling.
Choose a Deployment Strategy for Weekly Releases
Pick the right deployment strategy for an e-commerce platform processing $2M/day that is moving from monthly to weekly releases.
Design a Disaster Recovery Strategy for a FinTech Platform
Design a DR architecture for a payment processing platform with RTO 15 minutes and RPO 1 hour, currently running in a single AWS region with no DR.
Design a Terraform State and Module Strategy for 5 Teams
Fix a Terraform setup where 5 teams share state files, causing corruptions and conflicts, design remote state, workspaces, and a module registry.
Design SLOs and Error Budgets for a B2B SaaS Platform
A payments company has no SLOs, define meaningful SLIs, set error budgets, and decide what happens when the budget is exhausted.
Cut CI Pipeline from 47 Minutes to Under 10
A monorepo CI pipeline takes 47 minutes end-to-end, engineers have stopped running tests locally and now batch changes to reduce CI waits. Fix it.
Streaming vs Batch: Redesign a Broken Analytics Pipeline
An e-commerce analytics pipeline runs nightly batch jobs that take 14 hours to complete, product teams are making decisions on yesterday's data. Decide what to stream, what to batch, and how to handle late-arriving events.
Build Interruption-Tolerant ML Training on Spot Instances
An ML team's training jobs fail completely when Spot instances are reclaimed, design a checkpoint strategy and fault-tolerant architecture to cut compute costs by 70% without sacrificing training reliability.
Audit and Harden IAM for a Multi-Account AWS Organization
A security audit reveals 23 IAM users with AdministratorAccess, cross-account roles with * resource wildcards, and production credentials stored in developer laptops. Design the remediation.
Design a Global CDN and Traffic Strategy for a Multi-Region SaaS
A video SaaS has 40% of traffic from Southeast Asia with 800ms API latency, users complain and churn. Design a CDN, edge caching, and traffic routing strategy without replicating all backend services.
Active-Active vs Active-Passive: Design Multi-Region Failover for a Healthcare SaaS
A healthcare platform needs 99.99% uptime to meet HIPAA requirements after a 4-hour regional AWS outage. Choose between active-active and active-passive multi-region, and design the failover automation.
Design an Internal Developer Platform for 500 Engineers
Engineers at a 500-person company wait 3 weeks for new service provisioning and spend 30% of their time on cloud ops instead of product work, design an IDP that reduces provisioning to 30 minutes and eliminates toil.
Harden Software Supply Chain After a Dependency Compromise
A popular npm package your company depends on was compromised, attackers injected malicious code that exfiltrated environment variables. Design a software supply chain security strategy to prevent the next SolarWinds-style attack.
Choose a Data Warehouse: Snowflake vs BigQuery vs Redshift
A fast-growing SaaS must migrate from a 4TB PostgreSQL analytics database that takes 14 hours to run month-end reports, evaluate Snowflake, BigQuery, and Redshift for specific workload, team, and cost constraints.
Design an API Gateway for 15 Exposed Microservices
An e-commerce company has 15 microservices all directly exposed to the internet. Design an API gateway pattern that adds resilience, security, and observability.
Safely Introduce Chaos Engineering for a Payment Service
A payment service claims 99.9% SLO but has never been tested under failure conditions. Design a chaos engineering program that validates resilience safely, starting with steady-state hypothesis definition.
Container Registry Chaos: One Repo for 40 Services
A team pushing all Docker images to a single ECR repository is causing tag conflicts, failed deployments, and runaway storage costs. Fix the registry architecture.
Investigate a 40% AWS Bill Spike with No Alerting in Place
An AWS bill jumped 40% overnight with no alerting. Design a cost anomaly detection, root cause investigation, and preventive tagging strategy for an engineering team.
Single AWS Account Chaos: Design Multi-Account Architecture
Production, staging, and dev workloads all share a single AWS account. A misconfigured dev IAM policy once gave a contractor access to production RDS. Design a proper multi-account structure.
Zero-Downtime PostgreSQL Column Rename on a 50M Row Table
A 50-million row PostgreSQL table needs a column renamed during business hours without downtime. Design the expand-contract migration pattern safely.
Choose Between EKS Fargate and Managed Node Groups
A startup is choosing between EKS Fargate and managed node groups for their microservices. Design a hybrid strategy that optimizes cost, operational burden, and workload compatibility.
DB Triggers Are Killing Your Monolith: Move to Events
A monolith uses database triggers for downstream notifications, creating tight coupling, cascading failures, and load spikes. Migrate to an event-driven architecture.
Stop Manual Drift in an ArgoCD GitOps Environment
Engineers are using kubectl apply directly in a cluster managed by ArgoCD, causing config drift between Git and production. Design a GitOps policy to detect and prevent drift.
Containers Running as Root in Production K8s
A Kubernetes cluster has containers running as root with privileged mode enabled on several workloads. A security audit has flagged critical findings. Harden it.
EKS 3 Versions Behind: Zero-Downtime Upgrade Strategy
A production EKS cluster is 3 minor versions behind on a 200-node cluster. AWS is ending support in 60 days. Plan a zero-downtime upgrade strategy.
Implement Network Isolation Across a Kubernetes Cluster
A Kubernetes cluster where all pods can reach all other pods has failed a security audit. Design NetworkPolicy rules to implement namespace isolation and least-privilege network access.
Reduce Lambda Auth Function Cold Starts from 1200ms to Under 200ms
An authentication Lambda runs at 1200ms p99 with half the latency coming from cold starts. Design a cold start optimization strategy using package size reduction, runtime choice, and provisioned concurrency.
$18K/Month CloudWatch Bill: Fix the Logging Architecture
A platform sending 500GB/day to CloudWatch Logs is spending $18,000/month. 90% of logs are debug noise never queried in production. Redesign the pipeline to cut costs without losing observability.
Fix a Message Queue Losing Orders on Service Restart
An order processing system is dropping messages when the consumer service restarts. Design durable messaging with dead-letter handling and backpressure to fix the data loss.
$45K/Month GPU Bill: 80% Idle Time on ML Infrastructure
An ML team is running GPU instances 24/7 for experiments with 80% idle time, costing $45,000/month. Redesign the infrastructure to cut costs while maintaining model training and inference capability.
Grafana Dashboard Chaos: Version Control Your Observability
Teams have 47 Grafana dashboards that diverge between teams, alerts get accidentally deleted, and nobody knows which version is "correct." Implement observability-as-code.
RDS PostgreSQL at 95% CPU: Diagnose and Fix
An RDS PostgreSQL instance is hitting 95% CPU during peak hours. EXPLAIN shows sequential scans on critical queries. Engineers want to upgrade the instance, but the real fix is elsewhere.
Fix 2.3-Second Product Page Loads with Redis Caching
A product catalog page makes 15 database queries and loads in 2.3 seconds. Design a Redis caching strategy with correct invalidation, graceful degradation, and protection against the hot key problem.
Every Pod Uses the Same IAM Role: Fix Workload Identity
All 30 microservices in an EKS cluster share a single node IAM role with broad S3, DynamoDB, and SQS permissions. A single compromised container can access all data. Implement per-pod IAM identity.
Fix Invisible 5-Second Latency Spikes Across 12 Services
A 12-service application has intermittent 5-second latency spikes with no visibility into which service is the culprit. Design distributed tracing to identify and fix the root cause.
Taming 200 Terraform Files: Build a Module Library
Six teams each wrote their own VPC, EKS, and RDS Terraform configs. The result is 200 files of copy-paste with inconsistent security baselines and no shared standards. Fix it with a proper module design.
Migrate 50 Microservices from Hardcoded DB Passwords to Vault
50 microservices have hardcoded database passwords in Kubernetes Secrets and environment variables. Design a migration to HashiCorp Vault with dynamic secrets and automatic rotation.
Ship Your First Feature to Production
Your code works on localhost. Now you need to get it to real users. Choose how to deploy, test, and roll back safely.
Your Boss Wants 99.99% Uptime, Now What?
Your manager promised a client "four nines" reliability. Figure out what that actually means, how to measure it, and what to do when the budget runs out.