Roadmap

Site Reliability Engineer (SRE)

From Linux internals to running planet-scale systems, the complete path to a job-ready Site Reliability Engineer.

20stages133topics~84hours

Curated from the best, MDN · Kubernetes · AWS · OWASP · Google SRE & more

SREs command $180K-$350K+ at FAANG. The role is expanding as every company realizes they need production excellence, not just more features.

The complete path, 12 of 133 topics have lessons here; the other 121 are marked learn anywhere. We won't pretend we cover everything.

Ready to build? See capstone projects

Stage 1 / 20 · 6 topics · 1 lessons

Foundations: What SRE Is

Understand the discipline, its origins, and how SRE differs from and relates to DevOps and traditional ops.

SRE as a DisciplineGoogle's framing: operations as a software engineering problem. Google SRE Book

SRE vs DevOps vs Platform EngStart hereHow the roles overlap, differ, and report in real orgs.

Toil and Its EliminationDefining manual, repetitive, automatable work and capping it. Google SRE Book

Error BudgetsTrading reliability against feature velocity via a budget. Google SRE Book

SRE Engagement ModelsrecEmbedded, consulting, and shared-ownership team structures. Google SRE Book

Blameless CulturePsychological safety as a prerequisite for reliability. Google SRE Book

Stage 2 / 20 · 8 topics · 0 lessons

Linux & Operating System Internals

The OS is the substrate of everything SRE. Master processes, memory, I/O, and the kernel boundary.

Shell & Core UtilitiesFluency with bash, pipes, grep/awk/sed, find, xargs. The Linux Command Line

Processes, Threads & Signalsfork/exec, process states, signal handling, zombies. Linux man pages

Virtual Memory & OOMPaging, swap, page cache, the OOM killer. OSTEP (free book)

Filesystems & Inodesext4/xfs, mounts, inodes, permissions, links. Wikipedia

System Calls & stracerecThe user/kernel boundary; tracing with strace/ltrace. Linux man pages

cgroups & NamespacesKernel primitives that make containers possible. Linux man pages

systemd & initService units, journald, boot process, targets. systemd docs

Performance Toolsrectop/htop, vmstat, iostat, sar, perf, eBPF basics. Brendan Gregg

Stage 3 / 20 · 9 topics · 0 lessons

Networking Fundamentals

Distributed systems are networked systems. Know the stack from cables to TLS to load balancers.

OSI & TCP/IP ModelsLayered model and how real packets map to it. Cloudflare Learning

TCP vs UDPHandshake, flow control, congestion, retransmission. Cloudflare Learning

IP Addressing & SubnettingCIDR, private ranges, NAT, IPv4/IPv6. Cloudflare Learning

DNSResolution flow, record types, TTLs, anycast. Cloudflare Learning Lab

HTTP/1.1, HTTP/2, HTTP/3Methods, status codes, keep-alive, multiplexing, QUIC. MDN Lab

TLS & CertificatesHandshake, PKI, cert rotation, mTLS. Cloudflare Learning

Load BalancingL4 vs L7, algorithms, health checks, connection draining. Cloudflare Learning Lab

Network Debuggingtcpdump, dig, curl, ss, traceroute, mtr. Cloudflare Learning

CDNs & EdgerecCaching at the edge, origin shielding, geo-routing. Cloudflare Learning

Stage 4 / 20 · 7 topics · 0 lessons

Programming & Automation

SREs write software. Build solid coding skills plus the scripting that automates operations away.

Python for SREScripting, stdlib, requests, automation tooling. Python tutorial

Go for InfrastructurerecThe lingua franca of cloud-native tooling. Go docs

Advanced Bash ScriptingRobust scripts with error handling and set -euo pipefail. GNU Bash manual Lab

Data Structures & AlgorithmsEnough to pass coding interviews and reason about cost. GeeksforGeeks

Working with APIsREST, gRPC, pagination, rate limits, idempotency. MDN

Git & Version ControlBranching, rebasing, hooks, trunk-based development. Pro Git book

Testing Your CoderecUnit, integration, and linting for ops code. pytest docs

Stage 5 / 20 · 7 topics · 0 lessons

Distributed Systems Theory

The mental models behind why large systems fail in surprising ways.

CAP & PACELCConsistency, availability, partition tolerance trade-offs. Wikipedia

Consistency ModelsStrong, eventual, causal, read-your-writes. Jepsen

Consensus AlgorithmsrecPaxos, Raft, leader election, quorums. Raft

Replication & ShardingPrimary-replica, multi-leader, partitioning strategies. PostgreSQL docs

Idempotency, Retries & BackoffSafe retries, exponential backoff with jitter. AWS Builders Library

Distributed Failure ModesPartial failure, gray failure, clock skew, split brain. AWS Builders Library

Queues & MessagingrecKafka, pub/sub, delivery guarantees, backpressure. AWS docs

Stage 6 / 20 · 8 topics · 1 lessons

Cloud Platforms

Modern SRE runs on the cloud. Know the core service categories and at least one provider deeply.

IaaS / PaaS / SaaSService models and the shared responsibility model. Cloudflare Learning

Compute ServicesVMs, autoscaling groups, spot instances, serverless. AWS docs

Cloud NetworkingVPCs, subnets, security groups, peering, transit. AWS docs

Storage ServicesObject, block, file storage and durability tiers. AWS docs

Managed DatabasesrecRDS/Cloud SQL, replicas, backups, failover. AWS docs

Cloud IAMRoles, policies, least privilege, service accounts. AWS docs

Regions, Zones & Multi-RegionFailure domains and designing for AZ/region loss. Lab

Cloud Cost ManagementrecFinOps basics, rightsizing, budgets, tagging. FinOps Foundation

Stage 7 / 20 · 4 topics · 0 lessons

Containers

Packaging and isolating workloads, the foundation under orchestration.

Docker FundamentalsImages, layers, Dockerfiles, registries, networking. Docker docs

Image Optimization & SecurityrecMulti-stage builds, distroless, scanning, signing. Docker docs

OCI & Runtimesoptionalcontainerd, runc, the OCI spec, image format. OCI

Container Storage & NetworkingrecVolumes, bind mounts, bridge/host networking. Docker docs

Stage 8 / 20 · 9 topics · 0 lessons

Kubernetes & Orchestration

The dominant orchestration platform. Operating it well is central to most SRE roles.

Control Plane ArchitectureAPI server, etcd, scheduler, controller manager, kubelet. Kubernetes docs

Workloads & ControllersPods, Deployments, StatefulSets, DaemonSets, Jobs. Kubernetes docs

Kubernetes NetworkingServices, Ingress, CNI, kube-proxy, DNS. Kubernetes docs

Config & SecretsConfigMaps, Secrets, environment injection. Kubernetes docs

Resource Mgmt & SchedulingRequests/limits, QoS, affinity, taints, autoscaling. Kubernetes docs

Persistent StoragerecPV, PVC, StorageClasses, CSI drivers. Kubernetes docs

Helm & PackagingrecCharts, templating, releases, repositories. Helm docs

Operators & CRDsrecExtending the API; the controller/reconcile pattern. Kubernetes docs

Debugging Clusterskubectl debug, events, CrashLoopBackOff, OOMKilled. Kubernetes docs

Stage 9 / 20 · 6 topics · 2 lessons

Infrastructure as Code & Config Mgmt

Declarative, version-controlled infrastructure is non-negotiable at scale.

Terraform / OpenTofuHCL, providers, state, modules, plan/apply. Terraform docs Lab

State ManagementRemote state, locking, workspaces, drift.

Configuration ManagementrecAnsible (and Chef/Puppet) for mutable infra.

Immutable InfrastructurerecGolden images with Packer; cattle not pets. CNCF Glossary

Policy as CodeoptionalOPA/Conftest, Sentinel, guardrails in pipelines. OPA docs

Testing IaCrecValidation, terratest, plan review, modules CI. Terraform docs

Stage 10 / 20 · 7 topics · 3 lessons

CI/CD & Release Engineering

Safe, fast, repeatable delivery is how reliability ships to production.

CI PipelinesBuild, test, artifact stages in GitHub Actions/GitLab. GitLab docs

Deployment StrategiesBlue-green, canary, rolling, feature flags. Lab

GitOpsrecArgo CD / Flux, git as source of truth.

Progressive DeliveryrecAutomated canary analysis and rollback. Martin Fowler

Artifacts & RegistriesVersioning, immutability, promotion across envs. JFrog docs

Supply Chain SecurityoptionalSBOM, signing (Sigstore), provenance, SLSA.

Rollback & Release SafetyFast rollback, kill switches, deploy freezes. Google SRE Workbook

Stage 11 / 20 · 8 topics · 0 lessons

Observability & Monitoring

You cannot operate what you cannot see. The instrumentation core of SRE.

Metrics, Logs & TracesThe three pillars and when each applies. OpenTelemetry docs

Prometheus & PromQLPull model, exporters, querying, recording rules. Prometheus docs

Dashboards with GrafanaVisualization, RED/USE method dashboards. Grafana docs

Centralized LoggingStructured logs, Loki/ELK, retention, sampling. OpenTelemetry docs

Distributed TracingrecSpans, context propagation, OpenTelemetry. OpenTelemetry docs

Alerting & AlertmanagerSymptom-based alerts, routing, on actionable signals. Prometheus docs

Cardinality & CostrecLabel explosion, metric cost, retention trade-offs. Grafana docs

Synthetic & RUMrecBlack-box probes and real-user monitoring. Grafana docs

Stage 12 / 20 · 6 topics · 0 lessons

SLIs, SLOs & Reliability Engineering

The quantitative heart of SRE: defining, measuring, and budgeting reliability.

Defining SLIsChoosing good indicators from the user's perspective. Google SRE Book

Setting SLO TargetsNines, request vs window-based, realistic targets. Google SRE Book

Error Budget PolicyWhat happens when the budget burns; enforcement. Google SRE Workbook

Burn-Rate AlertingMulti-window multi-burn-rate alerts. Google SRE Workbook

SLAs & ContractsrecCustomer-facing agreements vs internal objectives. Google SRE Book

Availability MathComposing dependencies, MTTR/MTBF, the nines table. Google SRE Book

Stage 13 / 20 · 8 topics · 0 lessons

Incident Management & On-Call

When things break, this is the SRE's defining moment. Respond, mitigate, and learn.

On-Call PracticesRotations, escalation, sustainable load, handoffs. Google SRE Book

Incident Command SystemRoles: IC, comms, ops lead; coordination at scale. Google SRE Book

Severity & TriageClassifying impact and prioritizing response. PagerDuty Response

Mitigation FirstStop the bleeding before root-causing. PagerDuty Response

Incident CommunicationrecStatus pages, stakeholder updates, customer comms. Google SRE Book

Blameless PostmortemsTimeline, root cause, action items, follow-through. Google SRE Book

Runbooks & PlaybooksActionable, tested operational documentation. Google SRE Workbook

Paging & ToolingrecPagerDuty/Opsgenie, alert hygiene, fatigue. Prometheus docs

Stage 14 / 20 · 6 topics · 0 lessons

Capacity, Performance & Scalability

Ensuring systems have the headroom to serve load, and finding bottlenecks when they don't.

Capacity PlanningForecasting demand, headroom, provisioning lead time. Google SRE Book

Load & Stress Testingk6/Locust/JMeter, finding the knee of the curve. Grafana k6 docs

Performance AnalysisrecProfiling, flame graphs, USE method, latency tails. Brendan Gregg

Scaling PatternsHorizontal vs vertical, autoscaling, sharding. Azure Architecture Center

Caching StrategiesrecCDN, Redis, cache invalidation, stampede protection. Redis docs

Queueing Theory & Little's LawoptionalWhy latency explodes near saturation. Wikipedia

Stage 15 / 20 · 6 topics · 2 lessons

Resilience & Chaos Engineering

Designing systems that degrade gracefully and proving it deliberately.

Resilience PatternsTimeouts, circuit breakers, bulkheads, fallbacks. Azure Architecture Center

Graceful DegradationLoad shedding, feature degradation under stress. Google SRE Book

Rate Limiting & ThrottlingToken bucket, quotas, protecting downstreams. Cloudflare Learning

Chaos EngineeringrecHypothesis-driven fault injection; Chaos Monkey.

Game Days & DR DrillsrecRehearsing failure and validating runbooks. AWS Well-Architected

Disaster Recovery & BCPRTO/RPO, backups, failover, region evacuation. Lab

Stage 16 / 20 · 6 topics · 0 lessons

Databases & Stateful Systems

Stateful services are the hardest to operate reliably, and where outages hurt most.

Relational DB OperationsIndexes, query plans, connection pools, locks. PostgreSQL docs

NoSQL SystemsrecKey-value, document, wide-column trade-offs. MongoDB

Replication & FailoverRead replicas, automatic failover, lag handling. PostgreSQL docs

Backups & Restore TestingPITR, regular restore drills, the 3-2-1 rule. PostgreSQL docs

Safe Schema MigrationsrecOnline migrations, backward compatibility, expand/contract. Flyway docs

Data Integrity & CorruptionrecChecksums, detecting and recovering from corruption. Google SRE Book

Stage 17 / 20 · 6 topics · 2 lessons

Security & Compliance

Reliability includes security. SREs own the operational side of keeping systems safe.

Secrets ManagementVault, KMS, rotation, no secrets in code. Vault docs

Least Privilege & RBACScoped access, just-in-time, audit trails. AWS docs

Network SecurityrecFirewalls, segmentation, zero trust, WAF. Cloudflare Learning

Vulnerability ManagementrecScanning, patching cadence, CVE response.

Security Incident ResponserecDetection, containment, forensics basics.

Compliance & AuditingoptionalSOC2, PCI, GDPR and their operational impact. Google Cloud docs

Stage 18 / 20 · 5 topics · 1 lessons

Platform Engineering & Service Mesh

Advanced infrastructure SREs increasingly build internal platforms and run meshes.

Service MeshrecIstio/Linkerd: traffic mgmt, mTLS, observability.

Internal Developer PlatformsoptionalGolden paths, self-service, Backstage portals. internaldeveloperplatform.org

Multi-Cluster & Fleet MgmtoptionalFederation, cluster API, fleet-wide rollouts. Cluster API docs

API GatewaysrecRouting, auth, rate limiting at the edge. Azure Architecture Center

eBPF in ProductionoptionalCilium, observability and security via eBPF. ebpf.io

Stage 19 / 20 · 5 topics · 0 lessons

Automation, AIOps & Modern Practice

Where SRE is heading: heavy automation, ML-assisted ops, and running AI systems.

Auto-RemediationrecSelf-healing systems, event-driven automation. Google SRE Book

AIOps & Anomaly DetectionoptionalML for alerting, noise reduction, correlation. IBM Think

Operating ML/LLM SystemsoptionalGPU fleets, model serving, inference reliability. Hugging Face docs

Systematic Toil AutomationMeasuring toil and building tools to kill it. Google SRE Book

Observability & Alerts as CoderecVersioned dashboards, SLOs, and alert definitions. Grafana docs

Stage 20 / 20 · 6 topics · 0 lessons

Career, Interviews & Soft Skills

Landing and thriving in an SRE role takes more than technical depth.

System Design InterviewsDesigning scalable, reliable systems on a whiteboard. System Design Primer

Coding & Scripting InterviewsAlgorithmic plus practical ops scripting rounds. Tech Interview Handbook

Debugging InterviewsThe 'why is this slow/down' live-debug round. GitHub

Portfolio & Home LabrecDemonstrable projects: a cluster, monitoring, IaC. Google SRE Book

Communication & InfluenceWriting docs, driving reviews, partnering with devs. Google SRE Book

Reliability LeadershiprecDriving culture, prioritization, and reliability roadmaps. Google SRE Book

You're job-ready.

Clear every stage, earn the certificate, and walk into interviews prepared. The complete path, nothing hidden, no gaps.

Destination reached