Interactive Explainer

Relevant for:Mid-levelSeniorStaff

Why this matters at your level

Junior

Know that Kubernetes uses containerd since K8s 1.24 (after Docker deprecation). Know the CRI interface separates Kubernetes from the runtime. Understand that OCI images are pulled and extracted by containerd.

Mid-level

Understand the full stack: kubelet -> CRI (gRPC) -> containerd -> containerd-shim -> runc -> namespaces+cgroups. Use crictl to debug container issues below the kubectl layer. Read containerd logs for image pull and startup failures.

Senior

Evaluate runtime security: gVisor intercepts syscalls in userspace, Kata Containers runs a full VM kernel per pod. Understand OCI Runtime Spec and runc implementation. Know when to use RuntimeClass and sandbox runtimes for tenant isolation.

Staff

Design runtime strategy for multi-tenant clusters. Evaluate containerd vs CRI-O security properties. Plan containerd version upgrades without node disruption. Implement RuntimeClass policies enforcing sandbox runtimes for untrusted workloads.

Container Runtimes & OCI: The Layer That Actually Runs Your Containers

The Open Container Initiative (OCI) defines the Image Spec and Runtime Spec that make containers portable across runtimes. containerd and CRI-O implement the Kubernetes CRI. runc is the low-level OCI runtime that performs the actual namespace and cgroup setup. Understanding this stack is essential for debugging container startup failures, evaluating security vulnerabilities, and planning runtime migrations.

~5 min read

Be the first to complete!

LIVEData Plane Vulnerability — containerd CVE-2022-23648 — March 2022

Breaking News

T+0

Attacker controls a malicious image in a registry accessible to the cluster

T+5m

Kubernetes schedules a pod using the malicious image -- containerd pulls and processes the OCI manifest

T+6m

Path traversal in containerd resolves to a host path outside the container rootfs during snapshot setup

T+7m

Container process reads /etc/shadow, kubelet service account tokens, or other sensitive host files

T+8m

Attacker exfiltrates secrets -- no host access required, no privilege escalation needed

—Critical severity

—Required to exploit

—EKS, GKE, AKS affected

—Recommended patch window

The question this raises

Which layer in the container runtime stack -- image processing, snapshot management, mount setup, or runc execution -- was bypassed, and what does this tell you about defence-in-depth for container security?

Test your assumption first

kubectl describe pod shows "ContainerCreating" for 10 minutes and events show "failed to create containerd task". Where do you look next?

Lesson outline

What Problem the Runtime Stack Solves

Why the two-layer design exists

Kubernetes needed to support multiple container runtimes without coupling its code to any single one. The CRI (gRPC API) creates a stable interface. High-level runtimes implement CRI and handle image management. Low-level OCI runtimes actually execute the container. This separation allows switching runtimes (containerd -> CRI-O, runc -> gVisor) without changing a single line of Kubernetes code.

kubelet (Kubernetes layer)

Use for: Makes CRI gRPC calls to the high-level runtime. ImageService (pull, list, remove images) and RuntimeService (create/start/stop/delete containers and sandboxes). Knows nothing about how containers are actually run.

containerd (CRI layer)

Use for: Implements CRI. Pulls images from registries, manages image layers via snapshotter, creates container config, and delegates execution to containerd-shim and then runc. Handles the lifecycle between Kubernetes requests and actual container processes.

containerd-shim (persistence layer)

Use for: A per-container process that stays alive even if containerd restarts. Keeps stdin/stdout/stderr open and reports exit codes. Enables daemonless container execution -- existing containers survive containerd crashes.

runc (OCI runtime layer)

Use for: Reads OCI Runtime Spec (config.json). Calls clone() for namespaces, writes cgroup files, mounts rootfs via overlayFS, drops capabilities, applies seccomp filter, then exec()s the container entrypoint. Exits after the container starts -- shim takes over.

The System View: From kubelet Call to Container Process

kubectl apply -f pod.yaml
        |
        v
[kube-apiserver]  -- stores pod spec in etcd
        |
        v (kubelet watches for pods on this node)
[kubelet]         -- calls CRI gRPC: RunPodSandbox() then CreateContainer()
        |
        v (gRPC over unix socket /var/run/containerd/containerd.sock)
[containerd]      -- implements CRI
        |-- Pulls image layers from registry (content-addressable, deduplicated)
        |-- Unpacks layers via snapshotter to /var/lib/containerd/
        |-- Generates OCI bundle: rootfs/ + config.json (Runtime Spec)
        |
        v (exec containerd-shim binary)
[containerd-shim] -- per-container process (survives containerd restarts)
        |-- calls runc create, then runc start
        |
        v (runc reads config.json and executes kernel calls)
[runc]            -- does the actual kernel work then exits
        |-- clone(CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|...)
        |-- pivot_root to container rootfs
        |-- write PID to cgroup controller files
        |-- apply seccomp filter, drop capabilities
        |-- exec() into container entrypoint (becomes PID 1 in container)
        v
[Container Process]  PID 1 in new namespaces, limited by cgroups

runc exits after the container starts -- the containerd-shim holds the container alive. This is why existing containers survive a containerd crash.

The Docker removal misconception

Situation

Before

After

Kubernetes 1.24 announced dockershim removal

“Kubernetes no longer supports Docker containers. Images built with Docker will not work. All container orchestration changes.”

“Only the runtime path changed: kubelet now talks to containerd directly via CRI instead of going through the dockershim adapter. OCI images (built by docker build, Buildah, Kaniko, or any other builder) work identically because containerd implements the same OCI Image Spec. Application containers are completely unaffected.”

containerd crashes on a node

“All containers on the node stop immediately. The node goes NotReady and all pods are rescheduled.”

“Existing containers keep running via their containerd-shim processes. New containers cannot be started. The node may show degraded status but existing workloads continue. Only after a timeout does the kubelet mark the node NotReady -- giving time to restart containerd before rescheduling begins.”

How It Actually Works: Debugging the CRI Layer

Container start-up: step by step

→

1. Pod scheduled to node -- kubelet receives pod assignment from API server watch and begins the container creation sequence by calling CRI ImageService.PullImage() if the image is not cached, then RuntimeService.RunPodSandbox() to create the pause container (which holds namespaces), then RuntimeService.CreateContainer().

→

2. Image pulled via containerd -- containerd checks its local image store (content-addressable by SHA256 digest). Missing layers are pulled from the registry. Layers shared between images are deduplicated in /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/ -- a node running 10 pods using ubuntu:22.04 as base stores the base layer once.

→

3. OCI bundle creation -- containerd creates an OCI bundle: a directory with rootfs/ (the merged overlayFS view of all image layers) and config.json (the OCI Runtime Spec generated from the pod spec, including namespace config, capabilities, seccomp profile, cgroup paths, mounts, and the entrypoint).

→

4. runc executes the OCI bundle -- containerd-shim calls runc create (applies namespaces, cgroups, mounts) then runc start (exec()s the entrypoint). runc reads every field of config.json: process.capabilities for capability dropping, linux.seccomp for the syscall filter, linux.namespaces for which namespace types to create, linux.cgroupsPath for where to write cgroup membership.

5. shim reports container running -- once the container process is running, runc exits. The containerd-shim keeps the container alive via its file descriptors. The shim reports the running state back through containerd to kubelet to the API server. kubectl get pod shows Running.

debug-container-creating.sh

1# 1. Find which node the stuck pod is on
2$ kubectl get pod <pod-name> -o wide
3NAME      READY   STATUS              NODE
4my-pod    0/1     ContainerCreating   node-3
5 
6# 2. SSH to node-3, use crictl (the CRI debug tool, not docker)
7$ crictl ps -a          # Show all containers including stopped/exited
crictl is the CRI debug tool -- use this when kubectl fails or containers have not started. It speaks directly to containerd via gRPC.
8$ crictl pods           # Show pod sandboxes (pause containers)
9$ crictl inspect <id>   # Full OCI spec + state for a container
10 
11# 3. Check containerd logs for the error
12$ journalctl -u containerd --since "10 minutes ago" --no-pager
13# Common errors to search for:
These three error patterns cover 90% of ContainerCreating failures. Search the containerd journal for them first.
14# "failed to pull image" -> registry auth, DNS resolution, rate limits
15# "failed to create containerd task" -> runc error, OCI spec issue
16# "error unpacking image" -> /var/lib/containerd disk full
17 
18# 4. Check disk space for containerd image storage
19$ df -h /var/lib/containerd
When /var/lib/containerd fills to 100%, all image pulls fail silently. This is a common cause of ContainerCreating on nodes with many unique images.
20Filesystem      Size  Used Avail Use%
21/dev/sda        100G   97G    0G  100%   <- THIS is the problem
22 
23# 5. Clean unused images to free space (caution: check what is safe)
24$ crictl rmi --prune    # Remove images with no running containers

What Breaks in Production: Blast Radius

Blast radius when the container runtime layer fails

containerd crash — Existing containers keep running via shim. NO new pods can start on the node until containerd recovers -- node appears Ready but new scheduling fails
/var/lib/containerd disk full — All image pulls fail with "no space left on device", all CreateContainer calls fail, node effectively unavailable for new workloads
runc version incompatibility after OS update — All container starts fail with "invalid OCI spec" or "permission denied" errors -- requires matching runc and containerd versions
containerd-shim crash — Container process orphaned, kubelet loses ability to track it, pod shows Unknown status indefinitely until manual cleanup
Image layer corruption in containerd cache — All pods using that image fail to start -- requires crictl rmi and fresh pull to fix, can affect many pods simultaneously
containerd upgrade on live node — Brief disruption to new pod scheduling during restart. Existing pods unaffected if shim is stable. Plan during low-traffic periods.

Untrusted workload using default runc instead of sandbox runtime

Bug

# Untrusted tenant workload using default runc runtime.
# runc uses the host kernel directly -- a kernel exploit in this
# pod can escape to the node.
apiVersion: v1
kind: Pod
metadata:
  name: tenant-job
spec:
  # No runtimeClassName -- defaults to runc
  containers:
  - name: job
    image: untrusted-tenant/job:latest

Fix

# Untrusted tenant workload uses gVisor (runsc) sandbox runtime.
# gVisor intercepts syscalls in userspace -- kernel exploits are
# contained within the gVisor sandbox, not reaching the host kernel.
apiVersion: v1
kind: Pod
metadata:
  name: tenant-job
spec:
  runtimeClassName: gvisor   # Selects the RuntimeClass named 'gvisor'
  containers:
  - name: job
    image: untrusted-tenant/job:latest
---
# RuntimeClass must be pre-created (cluster admin task):
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc    # containerd handler name for gVisor runtime

For multi-tenant clusters running untrusted code, gVisor or Kata Containers provide a stronger isolation boundary than runc. gVisor intercepts system calls in userspace, preventing kernel exploits from reaching the host kernel. Kata Containers runs a full VM kernel per pod for the strongest isolation. RuntimeClass tells containerd which handler to use -- the same pod spec field works transparently.

Decision Guide: Which Runtime for Which Workload?

Is this workload running code from untrusted sources (multi-tenant SaaS, user-submitted jobs)?

YesUse gVisor (runsc) or Kata Containers via RuntimeClass. gVisor is lower overhead (~20%); Kata provides stronger isolation with a full VM kernel.

NoContinue -- standard runc is appropriate for trusted workloads.

Does the workload need direct hardware access (GPUs, high-performance networking, eBPF programs)?

YesUse runc -- gVisor/Kata add overhead and may not support all syscalls or hardware interfaces. Test compatibility explicitly.

NoContinue.

Is this a privileged infrastructure workload (CNI plugin, log shipper, node monitor)?

YesUse runc. These typically need hostPID or hostNetwork -- enforce PodSecurity Baseline/Restricted to limit who can deploy privileged pods.

NoUse runc with security hardening: drop ALL capabilities, readOnlyRootFilesystem:true, runAsNonRoot:true, seccomp:RuntimeDefault.

Cost and Complexity: Runtime Selection Trade-offs

Runtime	Isolation Level	Performance Overhead	Syscall Compat	Best For
runc (default)	Namespace + cgroup	~1% overhead	100% -- direct kernel	Trusted workloads, all environments
gVisor (runsc)	Userspace kernel intercept	10-30% CPU overhead	~85% -- not all syscalls	Untrusted code, sandboxed jobs
Kata Containers	Full VM kernel per pod	5-15% startup, ~5% runtime	100% -- own kernel	Strictest multi-tenant isolation
CRI-O + runc	Same as runc	~1% overhead	100%	Minimal-footprint runtime (OpenShift default)

Exam Answer vs. Production Reality

1 / 2

The two-layer runtime stack

📖 What the exam expects

High-level runtimes (containerd, CRI-O) implement the Kubernetes CRI gRPC API and handle image management. Low-level OCI runtimes (runc, kata-runtime, gVisor runsc) implement the OCI Runtime Spec and actually execute the container. The separation allows switching runtimes without changing Kubernetes.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Common in Kubernetes administrator and CKA exam contexts. "What changed when dockershim was removed?" and "how do you debug a ContainerCreating pod?" are the most frequent entry points.

Common questions:

What changed when Kubernetes removed dockershim in 1.24?
What is the difference between containerd and runc?
How do you debug a pod stuck in ContainerCreating?
What is a RuntimeClass and when would you use one?
How does containerd handle OCI image layers and deduplication?

Strong answer: Mentioning containerd-shim keeps containers running if containerd crashes. Knowing /var/lib/containerd can fill up. Discussing gVisor or Kata for multi-tenant isolation. Understanding OCI images are content-addressable layers that deduplicate across images.

Red flags: Saying "Kubernetes runs Docker containers" without qualification. Not knowing what to do when kubectl logs returns nothing. Thinking all pods use the same runtime without knowing RuntimeClass exists.

Related concepts

Explore topics that connect to this one.

Suggested next

Often learned after this topic.

Container Networking Fundamentals: How Packets Move Between Pods

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Continue learning

Container Networking Fundamentals: How Packets Move Between Pods

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

Container Runtimes & OCI: The Layer That Actually Runs Your Containers

What Problem the Runtime Stack Solves

The System View: From kubelet Call to Container Process

How It Actually Works: Debugging the CRI Layer

What Breaks in Production: Blast Radius

Decision Guide: Which Runtime for Which Workload?

Cost and Complexity: Runtime Selection Trade-offs

Exam Answer vs. Production Reality

Discussion

In-app Q&A

Container Runtimes & OCI: The Layer That Actually Runs Your Containers

What Problem the Runtime Stack Solves

The System View: From kubelet Call to Container Process

How It Actually Works: Debugging the CRI Layer

What Breaks in Production: Blast Radius

Decision Guide: Which Runtime for Which Workload?

Cost and Complexity: Runtime Selection Trade-offs

Exam Answer vs. Production Reality

Discussion

In-app Q&A