Interactive Explainer

Relevant for:JuniorMid-levelSeniorStaff

Why this matters at your level

Junior

Know what resource requests and limits do in a pod spec. Know that requests affect scheduling and limits enforce cgroup rules. Always set both memory requests AND limits for production workloads.

Mid-level

Understand QoS classes (Guaranteed, Burstable, BestEffort) and their eviction ordering. Read cgroup v2 files to debug resource throttling. Know how CPU throttling differs from OOM killing.

Senior

Design resource quota strategies for namespaces. Tune OOM score adjustments for critical system pods. Debug CPU throttle rate (container_cpu_cfs_throttled_seconds_total) for high-p99 latency incidents.

Staff

Define cluster-wide resource policies using LimitRange and ResourceQuota. Evaluate cgroup v2 migration impact on existing workloads. Design multi-tenant resource isolation strategies preventing noisy-neighbor problems.

Linux cgroups: Resource Governance for Every Container

Control groups (cgroups) are the kernel mechanism that limits, accounts for, and isolates the resource usage of process groups. Every Kubernetes resource request and limit is ultimately a cgroup rule. When a pod is OOM-killed, a container is CPU-throttled, or a node goes NotReady from memory pressure, the cgroup hierarchy is where it starts and where you diagnose it.

~5 min read

Be the first to complete!

LIVEData Plane Failure — OOM Kill Cascade — Production Node NotReady

Breaking News

T+0

Pod deployed without memory limits -- running in BestEffort QoS class

T+6h

Pod memory reaches 14GB on a 16GB node -- only 2GB free for OS and kubelet

T+6h02m

Kernel OOM killer fires -- kills the leaking container process

T+6h02m30s

Kubelet restarts container (restartPolicy:Always) -- leak resumes immediately

T+6h03m

OOM killer targets other containers with higher oom_score_adj -- collateral evictions begin

T+6h06m

47 pods killed across multiple deployments -- node goes NotReady

—Killed in 4 minutes

—Consumed by one container

—Memory limits on the pod

—Time for leak to hit crisis

The question this raises

Why did the 46 innocent pods die instead of just the leaking pod — and how does the cgroup hierarchy determine who the kernel kills first?

Test your assumption first

A Java service shows intermittent 500ms p99 latency spikes. Logs show no errors, GC pauses are normal, and CPU utilization metrics report only 40%. What cgroup metric should you check first?

Lesson outline

What Problem cgroups Solve

The noisy neighbor problem

Without resource limits, any process on a shared host can consume all available CPU or memory. One poorly written service can starve every other workload. cgroups solve this by creating a hierarchy of resource budgets -- each group of processes gets a quota, and the kernel enforces it at the hardware scheduler and memory allocator level, not at the application level.

memory controller

Use for: Enforces memory.limit_in_bytes. Triggers OOM killer when group exceeds limit. Tracks working set, RSS, cache, and swap usage per cgroup.

cpu controller (CFS)

Use for: Uses CFS bandwidth control for limits (hard throttle via cpu.cfs_quota_us) and cpu.shares for requests (relative weight for scheduling). Limits create time-slice quotas within 100ms periods.

blkio controller

Use for: Controls block I/O bandwidth and IOPS per cgroup. Prevents one container from saturating disk bandwidth and starving I/O-sensitive workloads.

pid controller

Use for: Limits the number of processes in a cgroup. Prevents fork bombs from spawning unlimited child processes and exhausting the kernel process table.

The System View: cgroup Hierarchy on a Kubernetes Node

/sys/fs/cgroup/ (cgroup v2 unified hierarchy)
+-- kubepods/
|   +-- Guaranteed/
|   |   +-- pod<uid-api-server>/
|   |       +-- memory.max: 1073741824  (1Gi -- req==limit)
|   |       +-- cpu.max: "100000 100000" (1 CPU / 100ms period)
|   |       +-- oom_score_adj: -997      (last to be killed)
|   |
|   +-- Burstable/
|   |   +-- pod<uid-web-app>/
|   |       +-- memory.max: 2147483648  (2Gi limit, 512Mi request)
|   |       +-- cpu.max: "200000 100000" (2 CPU limit)
|   |       +-- cpu.weight: 51           (0.5 CPU request -> shares)
|   |       +-- oom_score_adj: ~500      (killed after BestEffort)
|   |
|   +-- BestEffort/
|       +-- pod<uid-batch>/
|           +-- memory.max: max (NO LIMIT -- can use all node memory!)
|           +-- cpu.weight: 2   (minimum scheduling weight)
|           +-- oom_score_adj: 1000 (FIRST to be killed by OOM killer)
|
+-- system.slice/   (kubelet, containerd, OS services)
+-- user.slice/

The kubelet creates this cgroup hierarchy at pod creation time. QoS class determines the subtree, oom_score_adj determines kill order. BestEffort pods with no memory.max are the most dangerous -- they can consume all available node memory.

How CPU limits actually work

Situation

Before

After

A pod has CPU limit set to 500m (0.5 CPU)

“The pod can use at most 50% of one CPU core at any given moment. If the pod needs more, it waits its turn.”

“The kernel creates cpu.cfs_quota_us=50000 (50ms) per cpu.cfs_period_us=100000 (100ms). The pod gets 50ms of CPU time per 100ms window. If the pod uses all 50ms in the first 10ms of a window, it is PAUSED for the remaining 90ms -- even if the node CPU is completely idle. This is throttling, and it causes p99 latency spikes invisible in average utilization metrics.”

A pod is consuming 40% average CPU with a 500m limit

“The pod is well within its CPU budget. There should be no performance issues.”

“40% average can mean 100% in some 100ms periods and 0% in others. In the 100% periods, the cgroup quota runs out and the pod is throttled. Prometheus's minute-level CPU metrics average these out -- the throttle spikes look invisible. Check container_cpu_cfs_throttled_seconds_total instead.”

How It Actually Works: Inside the Kubelet cgroup Setup

From pod spec to kernel enforcement

→

1. Kubelet classifies the pod into a QoS class -- Guaranteed if all containers have equal req==limit for both CPU and memory; Burstable if any container has requests or limits but not equal; BestEffort if no container has any requests or limits. This determines the cgroup subtree.

→

2. Kubelet creates cgroup directories -- for each pod, creates /sys/fs/cgroup/kubepods/<QoS>/<podUID>/ and per-container subdirectories. Writes limit values: memory.max for memory limit, cpu.max for CPU limit (in cgroup v2 format "quota period"), cpu.weight for CPU request.

→

3. Container runtime launches process into cgroup -- containerd starts the container process and adds it to the correct cgroup by writing the PID to /sys/fs/cgroup/.../cgroup.procs. From this point the kernel tracks all resource usage and enforces limits for that PID and all its children.

→

4. Kernel enforces CPU limits in 100ms windows -- the CPU scheduler checks cpu.max every 100ms period. If the cgroup has exhausted its quota in that period, ALL processes in the cgroup are paused until the next period starts. This creates the throttle spikes that cause latency issues.

5. Kernel OOM killer uses oom_score_adj for victim selection -- when system memory is exhausted, the OOM killer selects a process to kill using oom_score + oom_score_adj. The kubelet sets oom_score_adj based on QoS: BestEffort=1000 (kill first), Burstable=2-999 (proportional to memory overage), Guaranteed=-997 (kill last).

diagnose-cpu-throttle.sh

1# 1. Check if CPU throttling is causing latency (run on the node)
2$ cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/cpu.stat
3usage_usec 5400000000
4user_usec 4800000000
5system_usec 600000000
nr_throttled is the number of 100ms periods where the cgroup ran out of CPU quota. 6540 out of 10000 = 65.4% throttle rate.
6nr_periods 10000
7nr_throttled 6540     # 65.4% of periods were throttled!
8throttled_usec 65400000000  # 65.4 seconds of throttle time
9 
10# 2. Check throttle rate via Prometheus (no node access needed)
In Prometheus, this ratio tells you what fraction of CPU periods the container was paused. >25% is the production threshold to investigate.
11# rate(container_cpu_cfs_throttled_seconds_total[5m]) /
12#   rate(container_cpu_cfs_periods_total[5m])
13# Value > 0.25 (25%) explains p99 latency spikes
14 
15# 3. Fix: raise the CPU limit (or remove it for latency-sensitive services)
16$ kubectl set resources deployment/my-api --limits=cpu=2000m
17# OR remove the CPU limit entirely (set request only):
18# limits: {} -- removes limit, pod can burst to full node CPU

What Breaks in Production: Blast Radius

Blast radius when cgroup configuration is wrong

No memory limit (BestEffort pod) — Pod can consume all node memory -- OOM kills collateral pods in Burstable and BestEffort classes, potentially taking node NotReady
CPU limit == CPU request on latency-sensitive service — Pod is Guaranteed QoS (good for eviction) but CPU throttled whenever it briefly exceeds limit -- invisible p99 latency spikes
Memory limit below JVM heap or Go runtime overhead — Continuous OOM kills, pod in CrashLoopBackOff, never stabilizes -- must profile actual memory footprint first
No LimitRange in namespace — Developers forget limits, BestEffort pods proliferate, first noisy workload takes down the node
cgroup v2 migration without runtime update — Resource accounting breaks, kubelet reports wrong memory usage, eviction fires incorrectly -- test on a canary node group first
OOM kill of kubelet itself — Node goes NotReady, all pods rescheduled simultaneously onto remaining nodes -- thundering herd cascades the problem

No LimitRange -- BestEffort pods proliferate

Bug

# Namespace with no LimitRange -- all pods default to BestEffort
# A developer deploys this and it quietly consumes all node memory:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: data-processor
spec:
  template:
    spec:
      containers:
      - name: processor
        image: my-processor:v1
        # No resources block -- QoS: BestEffort
        # Can use 100% of node memory
        # First to be evicted under any memory pressure

Fix

# LimitRange enforces default limits for every pod in namespace
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - type: Container
    default:       # applied when pod has no limits
      memory: "512Mi"
      cpu: "500m"
    defaultRequest:  # applied when pod has no requests
      memory: "256Mi"
      cpu: "100m"
    max:           # hard ceiling -- admission rejects pods above this
      memory: "4Gi"
      cpu: "4000m"

LimitRange is a namespace admission object that injects default resource constraints into pods that do not specify them. Without it, any pod without an explicit resources block is BestEffort and can consume unlimited memory. With LimitRange, every pod has at least a default limit, preventing the OOM cascade scenario while still allowing teams to override with explicit resource specs.

Decision Guide: How to Set Resource Requests and Limits

Is this a latency-sensitive service (APIs, databases, payment processing)?

YesSet memory request == memory limit (Guaranteed QoS -- last evicted). Set CPU limit at 2-3x CPU request to avoid throttle-induced latency spikes.

NoContinue to batch/background path.

Is this a batch job or background worker tolerating variable latency?

YesSet CPU request conservatively. CPU limit can be higher or omitted. Set memory limit generously to avoid OOM kills during peak processing.

NoContinue to sizing path.

Do you have real usage data (VPA recommendations, 2+ weeks of Prometheus metrics)?

YesSet request at p50 usage, limit at p99 + 20% safety margin. Use VPA in recommend-only mode to validate.

NoStart with conservative defaults from LimitRange, instrument with metrics, tune after 2 weeks of production data.

Cost and Complexity: Resource Configuration Trade-offs

Configuration	QoS Class	Eviction Risk	Latency Risk	Recommendation
No limits (BestEffort)	BestEffort	First evicted under any pressure	OOM kills cause restart latency	Never in production -- use LimitRange
Req < Limit (Burstable)	Burstable	Evicted after BestEffort	CPU throttle possible at limit	Default for most services
Req == Limit (Guaranteed)	Guaranteed	Last evicted	CPU throttle if limit too low	Required for databases and stateful services
CPU limit removed, memory Guaranteed	Mixed (Burstable)	Low memory eviction risk	No CPU throttle -- can burst freely	Best for latency-sensitive if cluster has capacity

Exam Answer vs. Production Reality

1 / 2

Memory limits vs CPU limits: what actually happens

📖 What the exam expects

Memory limit creates cgroup memory.limit_in_bytes. When exceeded, the kernel OOM killer fires and kills a process in the cgroup. CPU limit creates cpu.cfs_quota_us -- the container is PAUSED (throttled) when it exhausts its quota in a 100ms window.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Common in Kubernetes performance debugging interviews and CKA/CKS certification. Shows up as "why is my pod at 40% CPU but still slow?" or "explain QoS classes and their eviction order".

Common questions:

What is the difference between a CPU request and a CPU limit in Kubernetes?
What are the three QoS classes and which pods are evicted first?
Why might a pod at 40% CPU utilization have high p99 latency?
How do cgroups enforce the memory limit in a Kubernetes pod?
What happens when a pod exceeds its memory limit vs its CPU limit?

Strong answer: Knowing container_cpu_cfs_throttled_seconds_total as the throttle metric. Understanding memory request==limit == Guaranteed QoS. Mentioning LimitRange to enforce default limits. Knowing BestEffort pods are killed first under memory pressure.

Red flags: Thinking CPU limits prevent the pod from using more than X% of the node globally (they throttle within a 100ms window). Believing OOM kills happen at exactly the memory limit (OOM killer fires when system memory is exhausted, prioritized by cgroup hierarchy and oom_score_adj).

Related concepts

Explore topics that connect to this one.

Suggested next

Often learned after this topic.

Container Runtimes & OCI: The Layer That Actually Runs Your Containers

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Continue learning

Container Runtimes & OCI: The Layer That Actually Runs Your Containers

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord