Interactive Explainer

🎯Key Takeaways

CI/CD pipelines are mandatory — manual deployments create fear and risk

Tag Docker images with commit SHA — "latest" makes rollbacks impossible

Zero-downtime deployments: rolling update (default), blue-green (instant rollback), canary (gradual rollout)

Kubernetes resource limits are mandatory — without them, one pod can starve all others

Infrastructure as Code with Terraform — never make manual cloud console changes

readinessProbe gates traffic; livenessProbe triggers restart — both are required in production

Deployment & Infrastructure: From Code to Production

How senior engineers ship code safely and reliably: automated pipelines, container orchestration, zero-downtime deployments, and the IaC practices that prevent configuration drift.

~5 min read

Be the first to complete!

What you'll learn

CI/CD pipelines are mandatory — manual deployments create fear and risk
Tag Docker images with commit SHA — "latest" makes rollbacks impossible
Zero-downtime deployments: rolling update (default), blue-green (instant rollback), canary (gradual rollout)
Kubernetes resource limits are mandatory — without them, one pod can starve all others
Infrastructure as Code with Terraform — never make manual cloud console changes
readinessProbe gates traffic; livenessProbe triggers restart — both are required in production

Lesson outline

The cost of manual deployments

Every manual deployment step is a risk: the wrong branch checked out, a step skipped, a config not updated. Teams that deploy manually accumulate "deployment fear" — they delay releases because deployments are risky. This creates a vicious cycle: longer delays → bigger changes → higher risk.

Automated CI/CD eliminates this. Every commit is buildable, testable, and deployable. You deploy small, frequent changes. When something goes wrong, rollback is one click. This is why high-performing engineering teams (DORA metrics) deploy 46x more frequently with 2,555x faster recovery time.

CI/CD pipeline design

A CI/CD pipeline is a series of automated stages that transform source code into a running service:

→

Trigger: push to branch or PR opened

→

Build: compile, type-check, lint (fast feedback — under 2 minutes)

→

Test: unit tests, integration tests (parallel where possible)

→

Security scan: SAST (static analysis), dependency audit, container image scan

→

Build artifact: Docker image, tagged with commit SHA

→

Deploy to staging: update staging environment, run smoke tests

→

Deploy to production: gate on approval (if needed) or automatic on main

Post-deploy: verify SLOs, automated rollback if error rate spikes

Trigger: push to branch or PR opened

Build: compile, type-check, lint (fast feedback — under 2 minutes)

Test: unit tests, integration tests (parallel where possible)

Security scan: SAST (static analysis), dependency audit, container image scan

Build artifact: Docker image, tagged with commit SHA

Deploy to staging: update staging environment, run smoke tests

Deploy to production: gate on approval (if needed) or automatic on main

Post-deploy: verify SLOs, automated rollback if error rate spikes

Golden rule: fail fast. Put the fastest checks first (lint, type-check). Run tests in parallel. A pipeline that takes 45 minutes is not used — developers skip it.

Tag images with commit SHA, not "latest"

"latest" is mutable — you cannot roll back to it reliably. Tag every image with the commit SHA: registry/app:abc1234. This makes rollbacks precise: "deploy the image from commit abc1234."

.github/workflows/deploy.yml

1name: CI/CD Pipeline
2 
3on:
4  push:
5    branches: [main]
6  pull_request:
7    branches: [main]
8 
9jobs:
10  build-and-test:
11    runs-on: ubuntu-latest
12    steps:
13      - uses: actions/checkout@v4
14 
15      - name: Setup Node.js
16        uses: actions/setup-node@v4
17        with:
18          node-version: '20'
19          cache: 'npm'
20 
21      - name: Install dependencies
npm ci is reproducible (uses package-lock.json) and faster than npm install
22        run: npm ci  # ci is faster and more reliable than install
23 
24      - name: Type check
25        run: npm run typecheck  # Fast feedback — runs in ~10s
26 
27      - name: Lint
28        run: npm run lint
29 
Image tagged with commit SHA — enables precise rollbacks
30      - name: Unit tests (parallel)
31        run: npm run test -- --maxWorkers=4
32 
33      - name: Build Docker image
34        run: |
35          docker build -t ${{ env.REGISTRY }}/app:${{ github.sha }} .
36          docker push ${{ env.REGISTRY }}/app:${{ github.sha }}
37 
38  deploy-staging:
39    needs: build-and-test
40    runs-on: ubuntu-latest
41    if: github.ref == 'refs/heads/main'
42    steps:
43      - name: Deploy to staging
44        run: |
Manual approval gate for production — require explicit sign-off
45          kubectl set image deployment/app-staging \
46            app=${{ env.REGISTRY }}/app:${{ github.sha }}
47          kubectl rollout status deployment/app-staging --timeout=120s
48 
49      - name: Smoke test staging
50        run: npm run test:smoke -- --env=staging
51 
52  deploy-production:
53    needs: deploy-staging
54    runs-on: ubuntu-latest
55    environment: production  # Requires manual approval in GitHub
56    steps:
57      - name: Deploy to production (rolling update)
58        run: |
59          kubectl set image deployment/app-prod \
60            app=${{ env.REGISTRY }}/app:${{ github.sha }}
61          kubectl rollout status deployment/app-prod --timeout=300s
62 
63      - name: Verify SLOs post-deploy
64        run: npm run verify:slos -- --window=5m

Zero-downtime deployment strategies

Rolling update: Replace old pods one at a time. Default Kubernetes strategy. Zero downtime, but both old and new versions run simultaneously — backward-compatible API changes only.

Blue-green deployment: Maintain two identical environments (blue = current, green = new). Switch traffic from blue to green. Instant rollback by switching back. Double the infrastructure cost.

Canary deployment: Route small percentage of traffic (1-5%) to new version. Monitor SLOs. Gradually increase to 100% if metrics are healthy. Automatically rollback if error rate spikes. Best for high-risk changes.

Feature flags: Ship code dark (disabled). Enable for specific users or cohorts. Separate deployment from feature launch. Roll back features without rolling back code.

Strategy	Rollback Speed	Infrastructure Cost	Risk	Best For
Rolling update	Minutes (rollback deployment)	1x	Low	Most deployments
Blue-green	Instant (switch LB)	2x	Medium	Database migrations, major changes
Canary	Instant (drain canary)	1.1x	Very low	High-risk changes, algorithm updates
Feature flags	Instant (toggle off)	1x	Very low	A/B tests, gradual rollouts

Kubernetes fundamentals for full stack engineers

Kubernetes (k8s) is a container orchestration platform that handles: scheduling (which node runs this container?), scaling (how many replicas?), self-healing (restart crashed containers), service discovery (how do services find each other?), and rolling updates.

Core objects: Pod (one or more containers sharing network/storage), Deployment (manages desired replica count and rolling updates), Service (stable DNS + load balancing for a set of pods), ConfigMap (non-secret config), Secret (sensitive config, base64-encoded), Ingress (HTTP routing from outside the cluster to Services).

Resource limits are mandatory in production: Without CPU/memory limits, one runaway pod can starve all other pods on the node. Set requests (what the pod needs) and limits (what it is allowed to use).

k8s-deployment.yaml

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: orders-api
5  labels:
6    app: orders-api
7spec:
8  replicas: 3
9  selector:
10    matchLabels:
11      app: orders-api
12  strategy:
13    type: RollingUpdate
14    rollingUpdate:
15      maxSurge: 1        # Allow 1 extra pod during update
maxUnavailable: 0 = zero-downtime rolling update
16      maxUnavailable: 0  # Zero-downtime: never kill a pod before new one is ready
17 
18  template:
19    metadata:
20      labels:
21        app: orders-api
22    spec:
23      containers:
24        - name: api
25          image: registry/orders-api:abc1234  # Always use specific tag, never 'latest'
26          ports:
27            - containerPort: 3000
Resource limits are mandatory — without them, one pod can starve others
28 
29          resources:
30            requests:
31              cpu: 250m       # 0.25 CPU cores guaranteed
32              memory: 256Mi   # 256MB RAM guaranteed
33            limits:
34              cpu: 500m       # Never use more than 0.5 cores
35              memory: 512Mi   # OOMKilled if exceeded — set carefully
readinessProbe gates traffic — pod is NOT ready until this passes
36 
37          readinessProbe:   # Pod receives traffic only when this passes
38            httpGet:
39              path: /health/ready
40              port: 3000
41            initialDelaySeconds: 5
42            periodSeconds: 10
43            failureThreshold: 3
44 
45          livenessProbe:    # Pod is restarted if this fails
46            httpGet:
47              path: /health/live
48              port: 3000
49            initialDelaySeconds: 15
50            periodSeconds: 20
51 
52          env:
53            - name: DATABASE_URL
54              valueFrom:
55                secretKeyRef:
56                  name: app-secrets
57                  key: database-url

Infrastructure as Code with Terraform

Infrastructure as Code means your cloud resources (VPCs, databases, load balancers, Kubernetes clusters) are defined in code, version-controlled, peer-reviewed, and applied automatically.

Terraform workflow: `terraform plan` (shows what will change — review before applying), `terraform apply` (make the changes), `terraform destroy` (tear down). State is stored remotely (S3 + DynamoDB lock) — never commit terraform.tfstate.

Why IaC matters: Reproducible environments (staging matches production), drift detection (catch manual changes), disaster recovery (rebuild from scratch in minutes), audit trail (who changed what and when).

Never make manual cloud console changes

Every manual cloud console change creates drift between your IaC code and reality. The next terraform apply may destroy your manual change. All infrastructure changes go through IaC and code review.

terraform/main.tf

1# Terraform: Production ECS + RDS setup
2terraform {
3  required_providers {
4    aws = { source = "hashicorp/aws", version = "~> 5.0" }
5  }
6  backend "s3" {
7    bucket         = "mycompany-terraform-state"
8    key            = "production/app/terraform.tfstate"
9    region         = "us-east-1"
10    dynamodb_table = "terraform-state-lock"  # Prevents concurrent applies
DynamoDB lock prevents two engineers from running terraform apply simultaneously
11    encrypt        = true
12  }
13}
14 
15# RDS PostgreSQL
16resource "aws_db_instance" "postgres" {
17  identifier        = "prod-postgres"
18  engine            = "postgres"
19  engine_version    = "16.1"
20  instance_class    = "db.r6g.xlarge"
21  allocated_storage = 100
22  storage_type      = "gp3"
23 
24  db_name  = "production"
Fetch password from Secrets Manager — never hardcode in Terraform
25  username = "postgres"
26  password = data.aws_secretsmanager_secret_version.db_password.secret_string
27 
multi_az = true for production — single AZ = single point of failure
28  multi_az               = true    # High availability — standby in another AZ
29  deletion_protection    = true    # Prevent accidental destroy
30  backup_retention_period = 30     # 30-day automated backups
31  skip_final_snapshot    = false   # Take snapshot before destroy
32 
33  vpc_security_group_ids = [aws_security_group.rds.id]
34  db_subnet_group_name   = aws_db_subnet_group.main.name
35 
36  tags = {
37    Environment = "production"
38    Terraform   = "true"
39  }
40}

How this might come up in interviews

Deployment questions test whether you understand reliability engineering, not just Docker commands.

Common questions:

Walk me through your CI/CD pipeline design.
How would you deploy a breaking API change with zero downtime?
What is the difference between a readiness probe and a liveness probe?
How do you handle database migrations in a Kubernetes deployment?

Strong answers include:

Knows canary vs blue-green vs rolling and when to use each
Understands expand-contract for breaking changes
Mentions resource limits for Kubernetes
Uses IaC — never manually edits cloud console

Red flags:

Deploys by SSH-ing into servers
Does not know what a readiness probe does
Uses "latest" as the Docker tag
Cannot explain how to deploy a breaking API change safely

Quick check · Deployment & Infrastructure: From Code to Production

1 / 1

You need to deploy a breaking API change (removes a field that old clients depend on). What is the safest strategy?

Key takeaways

CI/CD pipelines are mandatory — manual deployments create fear and risk
Tag Docker images with commit SHA — "latest" makes rollbacks impossible
Zero-downtime deployments: rolling update (default), blue-green (instant rollback), canary (gradual rollout)
Kubernetes resource limits are mandatory — without them, one pod can starve all others
Infrastructure as Code with Terraform — never make manual cloud console changes
readinessProbe gates traffic; livenessProbe triggers restart — both are required in production

From the books

Accelerate: The Science of Lean Software and DevOps — Forsgren, Humble, Kim (2018)

Chapter 2: Measuring Performance

The four DORA metrics — deployment frequency, lead time for changes, time to restore service, change failure rate — are the best predictors of organizational performance. High performers deploy multiple times per day with 2,555x faster recovery than low performers.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

The cost of manual deployments

CI/CD pipeline design

A CI/CD pipeline is a series of automated stages that transform source code into a running service:

→

Trigger: push to branch or PR opened

→

Build: compile, type-check, lint (fast feedback — under 2 minutes)

→

Test: unit tests, integration tests (parallel where possible)

→

Security scan: SAST (static analysis), dependency audit, container image scan

→

Build artifact: Docker image, tagged with commit SHA

→

Deploy to staging: update staging environment, run smoke tests

→

Deploy to production: gate on approval (if needed) or automatic on main

Post-deploy: verify SLOs, automated rollback if error rate spikes

Trigger: push to branch or PR opened

Build: compile, type-check, lint (fast feedback — under 2 minutes)

Test: unit tests, integration tests (parallel where possible)

Security scan: SAST (static analysis), dependency audit, container image scan

Build artifact: Docker image, tagged with commit SHA

Deploy to staging: update staging environment, run smoke tests

Deploy to production: gate on approval (if needed) or automatic on main

Post-deploy: verify SLOs, automated rollback if error rate spikes

Golden rule: fail fast. Put the fastest checks first (lint, type-check). Run tests in parallel. A pipeline that takes 45 minutes is not used — developers skip it.

Tag images with commit SHA, not "latest"

"latest" is mutable — you cannot roll back to it reliably. Tag every image with the commit SHA: registry/app:abc1234. This makes rollbacks precise: "deploy the image from commit abc1234."

.github/workflows/deploy.yml

1name: CI/CD Pipeline
2 
3on:
4  push:
5    branches: [main]
6  pull_request:
7    branches: [main]
8 
9jobs:
10  build-and-test:
11    runs-on: ubuntu-latest
12    steps:
13      - uses: actions/checkout@v4
14 
15      - name: Setup Node.js
16        uses: actions/setup-node@v4
17        with:
18          node-version: '20'
19          cache: 'npm'
20 
21      - name: Install dependencies
npm ci is reproducible (uses package-lock.json) and faster than npm install
22        run: npm ci  # ci is faster and more reliable than install
23 
24      - name: Type check
25        run: npm run typecheck  # Fast feedback — runs in ~10s
26 
27      - name: Lint
28        run: npm run lint
29 
Image tagged with commit SHA — enables precise rollbacks
30      - name: Unit tests (parallel)
31        run: npm run test -- --maxWorkers=4
32 
33      - name: Build Docker image
34        run: |
35          docker build -t ${{ env.REGISTRY }}/app:${{ github.sha }} .
36          docker push ${{ env.REGISTRY }}/app:${{ github.sha }}
37 
38  deploy-staging:
39    needs: build-and-test
40    runs-on: ubuntu-latest
41    if: github.ref == 'refs/heads/main'
42    steps:
43      - name: Deploy to staging
44        run: |
Manual approval gate for production — require explicit sign-off
45          kubectl set image deployment/app-staging \
46            app=${{ env.REGISTRY }}/app:${{ github.sha }}
47          kubectl rollout status deployment/app-staging --timeout=120s
48 
49      - name: Smoke test staging
50        run: npm run test:smoke -- --env=staging
51 
52  deploy-production:
53    needs: deploy-staging
54    runs-on: ubuntu-latest
55    environment: production  # Requires manual approval in GitHub
56    steps:
57      - name: Deploy to production (rolling update)
58        run: |
59          kubectl set image deployment/app-prod \
60            app=${{ env.REGISTRY }}/app:${{ github.sha }}
61          kubectl rollout status deployment/app-prod --timeout=300s
62 
63      - name: Verify SLOs post-deploy
64        run: npm run verify:slos -- --window=5m

Zero-downtime deployment strategies

Rolling update: Replace old pods one at a time. Default Kubernetes strategy. Zero downtime, but both old and new versions run simultaneously — backward-compatible API changes only.

Blue-green deployment: Maintain two identical environments (blue = current, green = new). Switch traffic from blue to green. Instant rollback by switching back. Double the infrastructure cost.

Feature flags: Ship code dark (disabled). Enable for specific users or cohorts. Separate deployment from feature launch. Roll back features without rolling back code.

Strategy	Rollback Speed	Infrastructure Cost	Risk	Best For
Rolling update	Minutes (rollback deployment)	1x	Low	Most deployments
Blue-green	Instant (switch LB)	2x	Medium	Database migrations, major changes
Canary	Instant (drain canary)	1.1x	Very low	High-risk changes, algorithm updates
Feature flags	Instant (toggle off)	1x	Very low	A/B tests, gradual rollouts

Kubernetes fundamentals for full stack engineers

k8s-deployment.yaml

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: orders-api
5  labels:
6    app: orders-api
7spec:
8  replicas: 3
9  selector:
10    matchLabels:
11      app: orders-api
12  strategy:
13    type: RollingUpdate
14    rollingUpdate:
15      maxSurge: 1        # Allow 1 extra pod during update
maxUnavailable: 0 = zero-downtime rolling update
16      maxUnavailable: 0  # Zero-downtime: never kill a pod before new one is ready
17 
18  template:
19    metadata:
20      labels:
21        app: orders-api
22    spec:
23      containers:
24        - name: api
25          image: registry/orders-api:abc1234  # Always use specific tag, never 'latest'
26          ports:
27            - containerPort: 3000
Resource limits are mandatory — without them, one pod can starve others
28 
29          resources:
30            requests:
31              cpu: 250m       # 0.25 CPU cores guaranteed
32              memory: 256Mi   # 256MB RAM guaranteed
33            limits:
34              cpu: 500m       # Never use more than 0.5 cores
35              memory: 512Mi   # OOMKilled if exceeded — set carefully
readinessProbe gates traffic — pod is NOT ready until this passes
36 
37          readinessProbe:   # Pod receives traffic only when this passes
38            httpGet:
39              path: /health/ready
40              port: 3000
41            initialDelaySeconds: 5
42            periodSeconds: 10
43            failureThreshold: 3
44 
45          livenessProbe:    # Pod is restarted if this fails
46            httpGet:
47              path: /health/live
48              port: 3000
49            initialDelaySeconds: 15
50            periodSeconds: 20
51 
52          env:
53            - name: DATABASE_URL
54              valueFrom:
55                secretKeyRef:
56                  name: app-secrets
57                  key: database-url

Infrastructure as Code with Terraform

Infrastructure as Code means your cloud resources (VPCs, databases, load balancers, Kubernetes clusters) are defined in code, version-controlled, peer-reviewed, and applied automatically.

Never make manual cloud console changes

Every manual cloud console change creates drift between your IaC code and reality. The next terraform apply may destroy your manual change. All infrastructure changes go through IaC and code review.

terraform/main.tf

1# Terraform: Production ECS + RDS setup
2terraform {
3  required_providers {
4    aws = { source = "hashicorp/aws", version = "~> 5.0" }
5  }
6  backend "s3" {
7    bucket         = "mycompany-terraform-state"
8    key            = "production/app/terraform.tfstate"
9    region         = "us-east-1"
10    dynamodb_table = "terraform-state-lock"  # Prevents concurrent applies
DynamoDB lock prevents two engineers from running terraform apply simultaneously
11    encrypt        = true
12  }
13}
14 
15# RDS PostgreSQL
16resource "aws_db_instance" "postgres" {
17  identifier        = "prod-postgres"
18  engine            = "postgres"
19  engine_version    = "16.1"
20  instance_class    = "db.r6g.xlarge"
21  allocated_storage = 100
22  storage_type      = "gp3"
23 
24  db_name  = "production"
Fetch password from Secrets Manager — never hardcode in Terraform
25  username = "postgres"
26  password = data.aws_secretsmanager_secret_version.db_password.secret_string
27 
multi_az = true for production — single AZ = single point of failure
28  multi_az               = true    # High availability — standby in another AZ
29  deletion_protection    = true    # Prevent accidental destroy
30  backup_retention_period = 30     # 30-day automated backups
31  skip_final_snapshot    = false   # Take snapshot before destroy
32 
33  vpc_security_group_ids = [aws_security_group.rds.id]
34  db_subnet_group_name   = aws_db_subnet_group.main.name
35 
36  tags = {
37    Environment = "production"
38    Terraform   = "true"
39  }
40}