Linux cgroups: Resource Governance for Every Container
Control groups (cgroups) are the kernel mechanism that limits, accounts for, and isolates the resource usage of process groups. Every Kubernetes resource request and limit is ultimately a cgroup rule. When a pod is OOM-killed, a container is CPU-throttled, or a node goes NotReady from memory pressure, the cgroup hierarchy is where it starts and where you diagnose it.
Why this matters at your level
Know what resource requests and limits do in a pod spec. Know that requests affect scheduling and limits enforce cgroup rules. Always set both memory requests AND limits for production workloads.
Understand QoS classes (Guaranteed, Burstable, BestEffort) and their eviction ordering. Read cgroup v2 files to debug resource throttling. Know how CPU throttling differs from OOM killing.
Design resource quota strategies for namespaces. Tune OOM score adjustments for critical system pods. Debug CPU throttle rate (container_cpu_cfs_throttled_seconds_total) for high-p99 latency incidents.
Define cluster-wide resource policies using LimitRange and ResourceQuota. Evaluate cgroup v2 migration impact on existing workloads. Design multi-tenant resource isolation strategies preventing noisy-neighbor problems.
Linux cgroups: Resource Governance for Every Container
Control groups (cgroups) are the kernel mechanism that limits, accounts for, and isolates the resource usage of process groups. Every Kubernetes resource request and limit is ultimately a cgroup rule. When a pod is OOM-killed, a container is CPU-throttled, or a node goes NotReady from memory pressure, the cgroup hierarchy is where it starts and where you diagnose it.
Pod deployed without memory limits -- running in BestEffort QoS class
Pod memory reaches 14GB on a 16GB node -- only 2GB free for OS and kubelet
Kernel OOM killer fires -- kills the leaking container process
Kubelet restarts container (restartPolicy:Always) -- leak resumes immediately
OOM killer targets other containers with higher oom_score_adj -- collateral evictions begin
47 pods killed across multiple deployments -- node goes NotReady
The question this raises
Why did the 46 innocent pods die instead of just the leaking pod — and how does the cgroup hierarchy determine who the kernel kills first?
A Java service shows intermittent 500ms p99 latency spikes. Logs show no errors, GC pauses are normal, and CPU utilization metrics report only 40%. What cgroup metric should you check first?
Lesson outline
What Problem cgroups Solve
The noisy neighbor problem
Without resource limits, any process on a shared host can consume all available CPU or memory. One poorly written service can starve every other workload. cgroups solve this by creating a hierarchy of resource budgets -- each group of processes gets a quota, and the kernel enforces it at the hardware scheduler and memory allocator level, not at the application level.
memory controller
Use for: Enforces memory.limit_in_bytes. Triggers OOM killer when group exceeds limit. Tracks working set, RSS, cache, and swap usage per cgroup.
cpu controller (CFS)
Use for: Uses CFS bandwidth control for limits (hard throttle via cpu.cfs_quota_us) and cpu.shares for requests (relative weight for scheduling). Limits create time-slice quotas within 100ms periods.
blkio controller
Use for: Controls block I/O bandwidth and IOPS per cgroup. Prevents one container from saturating disk bandwidth and starving I/O-sensitive workloads.
pid controller
Use for: Limits the number of processes in a cgroup. Prevents fork bombs from spawning unlimited child processes and exhausting the kernel process table.
The System View: cgroup Hierarchy on a Kubernetes Node
/sys/fs/cgroup/ (cgroup v2 unified hierarchy) +-- kubepods/ | +-- Guaranteed/ | | +-- pod<uid-api-server>/ | | +-- memory.max: 1073741824 (1Gi -- req==limit) | | +-- cpu.max: "100000 100000" (1 CPU / 100ms period) | | +-- oom_score_adj: -997 (last to be killed) | | | +-- Burstable/ | | +-- pod<uid-web-app>/ | | +-- memory.max: 2147483648 (2Gi limit, 512Mi request) | | +-- cpu.max: "200000 100000" (2 CPU limit) | | +-- cpu.weight: 51 (0.5 CPU request -> shares) | | +-- oom_score_adj: ~500 (killed after BestEffort) | | | +-- BestEffort/ | +-- pod<uid-batch>/ | +-- memory.max: max (NO LIMIT -- can use all node memory!) | +-- cpu.weight: 2 (minimum scheduling weight) | +-- oom_score_adj: 1000 (FIRST to be killed by OOM killer) | +-- system.slice/ (kubelet, containerd, OS services) +-- user.slice/
The kubelet creates this cgroup hierarchy at pod creation time. QoS class determines the subtree, oom_score_adj determines kill order. BestEffort pods with no memory.max are the most dangerous -- they can consume all available node memory.
How CPU limits actually work
A pod has CPU limit set to 500m (0.5 CPU)
“The pod can use at most 50% of one CPU core at any given moment. If the pod needs more, it waits its turn.”
“The kernel creates cpu.cfs_quota_us=50000 (50ms) per cpu.cfs_period_us=100000 (100ms). The pod gets 50ms of CPU time per 100ms window. If the pod uses all 50ms in the first 10ms of a window, it is PAUSED for the remaining 90ms -- even if the node CPU is completely idle. This is throttling, and it causes p99 latency spikes invisible in average utilization metrics.”
A pod is consuming 40% average CPU with a 500m limit
“The pod is well within its CPU budget. There should be no performance issues.”
“40% average can mean 100% in some 100ms periods and 0% in others. In the 100% periods, the cgroup quota runs out and the pod is throttled. Prometheus's minute-level CPU metrics average these out -- the throttle spikes look invisible. Check container_cpu_cfs_throttled_seconds_total instead.”
How It Actually Works: Inside the Kubelet cgroup Setup
From pod spec to kernel enforcement
01
1. Kubelet classifies the pod into a QoS class -- Guaranteed if all containers have equal req==limit for both CPU and memory; Burstable if any container has requests or limits but not equal; BestEffort if no container has any requests or limits. This determines the cgroup subtree.
02
2. Kubelet creates cgroup directories -- for each pod, creates /sys/fs/cgroup/kubepods/<QoS>/<podUID>/ and per-container subdirectories. Writes limit values: memory.max for memory limit, cpu.max for CPU limit (in cgroup v2 format "quota period"), cpu.weight for CPU request.
03
3. Container runtime launches process into cgroup -- containerd starts the container process and adds it to the correct cgroup by writing the PID to /sys/fs/cgroup/.../cgroup.procs. From this point the kernel tracks all resource usage and enforces limits for that PID and all its children.
04
4. Kernel enforces CPU limits in 100ms windows -- the CPU scheduler checks cpu.max every 100ms period. If the cgroup has exhausted its quota in that period, ALL processes in the cgroup are paused until the next period starts. This creates the throttle spikes that cause latency issues.
05
5. Kernel OOM killer uses oom_score_adj for victim selection -- when system memory is exhausted, the OOM killer selects a process to kill using oom_score + oom_score_adj. The kubelet sets oom_score_adj based on QoS: BestEffort=1000 (kill first), Burstable=2-999 (proportional to memory overage), Guaranteed=-997 (kill last).
1. Kubelet classifies the pod into a QoS class -- Guaranteed if all containers have equal req==limit for both CPU and memory; Burstable if any container has requests or limits but not equal; BestEffort if no container has any requests or limits. This determines the cgroup subtree.
2. Kubelet creates cgroup directories -- for each pod, creates /sys/fs/cgroup/kubepods/<QoS>/<podUID>/ and per-container subdirectories. Writes limit values: memory.max for memory limit, cpu.max for CPU limit (in cgroup v2 format "quota period"), cpu.weight for CPU request.
3. Container runtime launches process into cgroup -- containerd starts the container process and adds it to the correct cgroup by writing the PID to /sys/fs/cgroup/.../cgroup.procs. From this point the kernel tracks all resource usage and enforces limits for that PID and all its children.
4. Kernel enforces CPU limits in 100ms windows -- the CPU scheduler checks cpu.max every 100ms period. If the cgroup has exhausted its quota in that period, ALL processes in the cgroup are paused until the next period starts. This creates the throttle spikes that cause latency issues.
5. Kernel OOM killer uses oom_score_adj for victim selection -- when system memory is exhausted, the OOM killer selects a process to kill using oom_score + oom_score_adj. The kubelet sets oom_score_adj based on QoS: BestEffort=1000 (kill first), Burstable=2-999 (proportional to memory overage), Guaranteed=-997 (kill last).
1# 1. Check if CPU throttling is causing latency (run on the node)2$ cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/cpu.stat3usage_usec 54000000004user_usec 48000000005system_usec 600000000nr_throttled is the number of 100ms periods where the cgroup ran out of CPU quota. 6540 out of 10000 = 65.4% throttle rate.6nr_periods 100007nr_throttled 6540 # 65.4% of periods were throttled!8throttled_usec 65400000000 # 65.4 seconds of throttle time910# 2. Check throttle rate via Prometheus (no node access needed)In Prometheus, this ratio tells you what fraction of CPU periods the container was paused. >25% is the production threshold to investigate.11# rate(container_cpu_cfs_throttled_seconds_total[5m]) /12# rate(container_cpu_cfs_periods_total[5m])13# Value > 0.25 (25%) explains p99 latency spikes1415# 3. Fix: raise the CPU limit (or remove it for latency-sensitive services)16$ kubectl set resources deployment/my-api --limits=cpu=2000m17# OR remove the CPU limit entirely (set request only):18# limits: {} -- removes limit, pod can burst to full node CPU
What Breaks in Production: Blast Radius
Blast radius when cgroup configuration is wrong
- No memory limit (BestEffort pod) — Pod can consume all node memory -- OOM kills collateral pods in Burstable and BestEffort classes, potentially taking node NotReady
- CPU limit == CPU request on latency-sensitive service — Pod is Guaranteed QoS (good for eviction) but CPU throttled whenever it briefly exceeds limit -- invisible p99 latency spikes
- Memory limit below JVM heap or Go runtime overhead — Continuous OOM kills, pod in CrashLoopBackOff, never stabilizes -- must profile actual memory footprint first
- No LimitRange in namespace — Developers forget limits, BestEffort pods proliferate, first noisy workload takes down the node
- cgroup v2 migration without runtime update — Resource accounting breaks, kubelet reports wrong memory usage, eviction fires incorrectly -- test on a canary node group first
- OOM kill of kubelet itself — Node goes NotReady, all pods rescheduled simultaneously onto remaining nodes -- thundering herd cascades the problem
No LimitRange -- BestEffort pods proliferate
# Namespace with no LimitRange -- all pods default to BestEffort
# A developer deploys this and it quietly consumes all node memory:
apiVersion: apps/v1
kind: Deployment
metadata:
name: data-processor
spec:
template:
spec:
containers:
- name: processor
image: my-processor:v1
# No resources block -- QoS: BestEffort
# Can use 100% of node memory
# First to be evicted under any memory pressure# LimitRange enforces default limits for every pod in namespace
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default: # applied when pod has no limits
memory: "512Mi"
cpu: "500m"
defaultRequest: # applied when pod has no requests
memory: "256Mi"
cpu: "100m"
max: # hard ceiling -- admission rejects pods above this
memory: "4Gi"
cpu: "4000m"LimitRange is a namespace admission object that injects default resource constraints into pods that do not specify them. Without it, any pod without an explicit resources block is BestEffort and can consume unlimited memory. With LimitRange, every pod has at least a default limit, preventing the OOM cascade scenario while still allowing teams to override with explicit resource specs.
Decision Guide: How to Set Resource Requests and Limits
Cost and Complexity: Resource Configuration Trade-offs
| Configuration | QoS Class | Eviction Risk | Latency Risk | Recommendation |
|---|---|---|---|---|
| No limits (BestEffort) | BestEffort | First evicted under any pressure | OOM kills cause restart latency | Never in production -- use LimitRange |
| Req < Limit (Burstable) | Burstable | Evicted after BestEffort | CPU throttle possible at limit | Default for most services |
| Req == Limit (Guaranteed) | Guaranteed | Last evicted | CPU throttle if limit too low | Required for databases and stateful services |
| CPU limit removed, memory Guaranteed | Mixed (Burstable) | Low memory eviction risk | No CPU throttle -- can burst freely | Best for latency-sensitive if cluster has capacity |
Exam Answer vs. Production Reality
Memory limits vs CPU limits: what actually happens
📖 What the exam expects
Memory limit creates cgroup memory.limit_in_bytes. When exceeded, the kernel OOM killer fires and kills a process in the cgroup. CPU limit creates cpu.cfs_quota_us -- the container is PAUSED (throttled) when it exhausts its quota in a 100ms window.
Toggle between what certifications teach and what production actually requires
How this might come up in interviews
Common in Kubernetes performance debugging interviews and CKA/CKS certification. Shows up as "why is my pod at 40% CPU but still slow?" or "explain QoS classes and their eviction order".
Common questions:
- What is the difference between a CPU request and a CPU limit in Kubernetes?
- What are the three QoS classes and which pods are evicted first?
- Why might a pod at 40% CPU utilization have high p99 latency?
- How do cgroups enforce the memory limit in a Kubernetes pod?
- What happens when a pod exceeds its memory limit vs its CPU limit?
Strong answer: Knowing container_cpu_cfs_throttled_seconds_total as the throttle metric. Understanding memory request==limit == Guaranteed QoS. Mentioning LimitRange to enforce default limits. Knowing BestEffort pods are killed first under memory pressure.
Red flags: Thinking CPU limits prevent the pod from using more than X% of the node globally (they throttle within a 100ms window). Believing OOM kills happen at exactly the memory limit (OOM killer fires when system memory is exhausted, prioritized by cgroup hierarchy and oom_score_adj).
Related concepts
Explore topics that connect to this one.
Suggested next
Often learned after this topic.
Container Runtimes & OCI: The Layer That Actually Runs Your ContainersReady to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Discussion
Questions? Discuss in the community or start a thread below.
Join DiscordIn-app Q&A
Sign in to start or join a thread.