Think like an
architect.

30m4Architecture Decision

Design a Multi-Tenant Kubernetes Cluster

Consolidate 50 teams onto shared clusters while maintaining isolation, fairness, and compliance.

🚨

Kubernetes Pod Won't Start: Debug the CrashLoopBackOff

Production pods are in CrashLoopBackOff after a deploy. Diagnose from logs, events, and metrics, then fix it.

15m3Incident Response

⚖️

EKS vs AKS vs GKE for a Healthcare Platform

Your healthcare startup needs managed Kubernetes with HIPAA compliance. Compare the three major providers and recommend one.

25m4Trade-off Analysis

Design a GitOps Deployment Pipeline

Migrate 15 microservices from Jenkins to GitOps with automated promotion across dev, staging, and production.

Implement Zero-Trust Security for Microservices

Design a zero-trust architecture for a healthcare fintech with 30 microservices, HIPAA compliance, and multi-cloud requirements.

Reduce AWS Bill by 40%

Cut a $45K/month AWS bill by 40% in 90 days for a Series B startup without impacting production reliability.

20m3Cost Optimization

🚀

Migrate a Monolith to Microservices

Decompose an 8-year-old e-commerce monolith into microservices while serving 2M daily active users with zero downtime.

45m5Migration Planning

Design an Observability Stack for Distributed Systems

Unify metrics, logs, and traces for 50 microservices to cut P1 triage time from 45 minutes to under 10.

Identify and Eliminate Single Points of Failure

Review a financial platform architecture, find every SPOF, and design a remediation roadmap for 99.99% availability.

Choose the Right Database for a Social Platform

Select the optimal database stack for a developer social platform with profiles, social graph, activity feeds, and messaging.

15m3Trade-off Analysis

30m4Architecture Decision

Design a Serverless Runtime with Firecracker MicroVMs

Build a multi-tenant serverless platform using Firecracker microVMs that handles 10K concurrent functions with sub-200ms cold starts.

Design a Kubernetes Autoscaling Strategy

Choose between HPA, KEDA, and VPA for a 30-service SaaS platform where some services need event-driven scaling and others need metric-based scaling.

18m3Architecture Decision

Choose a Deployment Strategy for Weekly Releases

Pick the right deployment strategy for an e-commerce platform processing $2M/day that is moving from monthly to weekly releases.

25m3Architecture Decision

Design a Disaster Recovery Strategy for a FinTech Platform

Design a DR architecture for a payment processing platform with RTO 15 minutes and RPO 1 hour, currently running in a single AWS region with no DR.

Design a Terraform State and Module Strategy for 5 Teams

Fix a Terraform setup where 5 teams share state files, causing corruptions and conflicts, design remote state, workspaces, and a module registry.

18m3Architecture Decision

Design SLOs and Error Budgets for a B2B SaaS Platform

A payments company has no SLOs, define meaningful SLIs, set error budgets, and decide what happens when the budget is exhausted.

Cut CI Pipeline from 47 Minutes to Under 10

A monorepo CI pipeline takes 47 minutes end-to-end, engineers have stopped running tests locally and now batch changes to reduce CI waits. Fix it.

16m3Cost Optimization

Data & Storage

Streaming vs Batch: Redesign a Broken Analytics Pipeline

An e-commerce analytics pipeline runs nightly batch jobs that take 14 hours to complete, product teams are making decisions on yesterday's data. Decide what to stream, what to batch, and how to handle late-arriving events.

FinOps

Build Interruption-Tolerant ML Training on Spot Instances

An ML team's training jobs fail completely when Spot instances are reclaimed, design a checkpoint strategy and fault-tolerant architecture to cut compute costs by 70% without sacrificing training reliability.

18m3Cost Optimization

18m3Architecture Decision

Audit and Harden IAM for a Multi-Account AWS Organization

A security audit reveals 23 IAM users with AdministratorAccess, cross-account roles with * resource wildcards, and production credentials stored in developer laptops. Design the remediation.

Design a Global CDN and Traffic Strategy for a Multi-Region SaaS

A video SaaS has 40% of traffic from Southeast Asia with 800ms API latency, users complain and churn. Design a CDN, edge caching, and traffic routing strategy without replicating all backend services.

25m3Architecture Decision

Active-Active vs Active-Passive: Design Multi-Region Failover for a Healthcare SaaS

A healthcare platform needs 99.99% uptime to meet HIPAA requirements after a 4-hour regional AWS outage. Choose between active-active and active-passive multi-region, and design the failover automation.

25m3Architecture Decision

Design an Internal Developer Platform for 500 Engineers

Engineers at a 500-person company wait 3 weeks for new service provisioning and spend 30% of their time on cloud ops instead of product work, design an IDP that reduces provisioning to 30 minutes and eliminates toil.

Harden Software Supply Chain After a Dependency Compromise

A popular npm package your company depends on was compromised, attackers injected malicious code that exfiltrated environment variables. Design a software supply chain security strategy to prevent the next SolarWinds-style attack.

Choose a Data Warehouse: Snowflake vs BigQuery vs Redshift

A fast-growing SaaS must migrate from a 4TB PostgreSQL analytics database that takes 14 hours to run month-end reports, evaluate Snowflake, BigQuery, and Redshift for specific workload, team, and cost constraints.

22m3Trade-off Analysis

16m3Architecture Decision

Design an API Gateway for 15 Exposed Microservices

An e-commerce company has 15 microservices all directly exposed to the internet. Design an API gateway pattern that adds resilience, security, and observability.

14m3Architecture Decision

Safely Introduce Chaos Engineering for a Payment Service

A payment service claims 99.9% SLO but has never been tested under failure conditions. Design a chaos engineering program that validates resilience safely, starting with steady-state hypothesis definition.

Container Registry Chaos: One Repo for 40 Services

A team pushing all Docker images to a single ECR repository is causing tag conflicts, failed deployments, and runaway storage costs. Fix the registry architecture.

FinOps

🚨

Investigate a 40% AWS Bill Spike with No Alerting in Place

An AWS bill jumped 40% overnight with no alerting. Design a cost anomaly detection, root cause investigation, and preventive tagging strategy for an engineering team.

15m3Incident Response

17m3Architecture Decision

Single AWS Account Chaos: Design Multi-Account Architecture

Production, staging, and dev workloads all share a single AWS account. A misconfigured dev IAM policy once gave a contractor access to production RDS. Design a proper multi-account structure.

Data & Storage

🚀

Zero-Downtime PostgreSQL Column Rename on a 50M Row Table

A 50-million row PostgreSQL table needs a column renamed during business hours without downtime. Design the expand-contract migration pattern safely.

17m3Migration Planning

⚖️

Choose Between EKS Fargate and Managed Node Groups

A startup is choosing between EKS Fargate and managed node groups for their microservices. Design a hybrid strategy that optimizes cost, operational burden, and workload compatibility.

15m3Trade-off Analysis

🚀

DB Triggers Are Killing Your Monolith: Move to Events

A monolith uses database triggers for downstream notifications, creating tight coupling, cascading failures, and load spikes. Migrate to an event-driven architecture.

16m3Migration Planning

15m3Architecture Decision

Stop Manual Drift in an ArgoCD GitOps Environment

Engineers are using kubectl apply directly in a cluster managed by ArgoCD, causing config drift between Git and production. Design a GitOps policy to detect and prevent drift.

Containers Running as Root in Production K8s

A Kubernetes cluster has containers running as root with privileged mode enabled on several workloads. A security audit has flagged critical findings. Harden it.

EKS 3 Versions Behind: Zero-Downtime Upgrade Strategy

A production EKS cluster is 3 minor versions behind on a 200-node cluster. AWS is ending support in 60 days. Plan a zero-downtime upgrade strategy.

18m3Migration Planning

Implement Network Isolation Across a Kubernetes Cluster

A Kubernetes cluster where all pods can reach all other pods has failed a security audit. Design NetworkPolicy rules to implement namespace isolation and least-privilege network access.

Reduce Lambda Auth Function Cold Starts from 1200ms to Under 200ms

An authentication Lambda runs at 1200ms p99 with half the latency coming from cold starts. Design a cold start optimization strategy using package size reduction, runtime choice, and provisioned concurrency.

16m3Cost Optimization

$18K/Month CloudWatch Bill: Fix the Logging Architecture

A platform sending 500GB/day to CloudWatch Logs is spending $18,000/month. 90% of logs are debug noise never queried in production. Redesign the pipeline to cut costs without losing observability.

15m3Cost Optimization

15m3Architecture Decision

Fix a Message Queue Losing Orders on Service Restart

An order processing system is dropping messages when the consumer service restarts. Design durable messaging with dead-letter handling and backpressure to fix the data loss.

FinOps

$45K/Month GPU Bill: 80% Idle Time on ML Infrastructure

An ML team is running GPU instances 24/7 for experiments with 80% idle time, costing $45,000/month. Redesign the infrastructure to cut costs while maintaining model training and inference capability.

18m3Cost Optimization

Grafana Dashboard Chaos: Version Control Your Observability

Teams have 47 Grafana dashboards that diverge between teams, alerts get accidentally deleted, and nobody knows which version is "correct." Implement observability-as-code.

RDS PostgreSQL at 95% CPU: Diagnose and Fix

An RDS PostgreSQL instance is hitting 95% CPU during peak hours. EXPLAIN shows sequential scans on critical queries. Engineers want to upgrade the instance, but the real fix is elsewhere.

16m3Incident Response

Data & Storage

15m3Architecture Decision

Fix 2.3-Second Product Page Loads with Redis Caching

A product catalog page makes 15 database queries and loads in 2.3 seconds. Design a Redis caching strategy with correct invalidation, graceful degradation, and protection against the hot key problem.

Every Pod Uses the Same IAM Role: Fix Workload Identity

All 30 microservices in an EKS cluster share a single node IAM role with broad S3, DynamoDB, and SQS permissions. A single compromised container can access all data. Implement per-pod IAM identity.

Fix Invisible 5-Second Latency Spikes Across 12 Services

A 12-service application has intermittent 5-second latency spikes with no visibility into which service is the culprit. Design distributed tracing to identify and fix the root cause.

16m3Incident Response

Taming 200 Terraform Files: Build a Module Library

Six teams each wrote their own VPC, EKS, and RDS Terraform configs. The result is 200 files of copy-paste with inconsistent security baselines and no shared standards. Fix it with a proper module design.

Migrate 50 Microservices from Hardcoded DB Passwords to Vault

50 microservices have hardcoded database passwords in Kubernetes Secrets and environment variables. Design a migration to HashiCorp Vault with dynamic secrets and automatic rotation.

18m3Migration Planning

12m3Architecture Decision

Ship Your First Feature to Production

Your code works on localhost. Now you need to get it to real users. Choose how to deploy, test, and roll back safely.

⚖️

Your Boss Wants 99.99% Uptime, Now What?

Your manager promised a client "four nines" reliability. Figure out what that actually means, how to measure it, and what to do when the budget runs out.

14m3Trade-off Analysis