The Open Container Initiative (OCI) defines the Image Spec and Runtime Spec that make containers portable across runtimes. containerd and CRI-O implement the Kubernetes CRI. runc is the low-level OCI runtime that performs the actual namespace and cgroup setup. Understanding this stack is essential for debugging container startup failures, evaluating security vulnerabilities, and planning runtime migrations.
Know that Kubernetes uses containerd since K8s 1.24 (after Docker deprecation). Know the CRI interface separates Kubernetes from the runtime. Understand that OCI images are pulled and extracted by containerd.
Understand the full stack: kubelet -> CRI (gRPC) -> containerd -> containerd-shim -> runc -> namespaces+cgroups. Use crictl to debug container issues below the kubectl layer. Read containerd logs for image pull and startup failures.
Evaluate runtime security: gVisor intercepts syscalls in userspace, Kata Containers runs a full VM kernel per pod. Understand OCI Runtime Spec and runc implementation. Know when to use RuntimeClass and sandbox runtimes for tenant isolation.
Design runtime strategy for multi-tenant clusters. Evaluate containerd vs CRI-O security properties. Plan containerd version upgrades without node disruption. Implement RuntimeClass policies enforcing sandbox runtimes for untrusted workloads.
The Open Container Initiative (OCI) defines the Image Spec and Runtime Spec that make containers portable across runtimes. containerd and CRI-O implement the Kubernetes CRI. runc is the low-level OCI runtime that performs the actual namespace and cgroup setup. Understanding this stack is essential for debugging container startup failures, evaluating security vulnerabilities, and planning runtime migrations.
Attacker controls a malicious image in a registry accessible to the cluster
Kubernetes schedules a pod using the malicious image -- containerd pulls and processes the OCI manifest
Path traversal in containerd resolves to a host path outside the container rootfs during snapshot setup
Container process reads /etc/shadow, kubelet service account tokens, or other sensitive host files
Attacker exfiltrates secrets -- no host access required, no privilege escalation needed
The question this raises
Which layer in the container runtime stack -- image processing, snapshot management, mount setup, or runc execution -- was bypassed, and what does this tell you about defence-in-depth for container security?
kubectl describe pod shows "ContainerCreating" for 10 minutes and events show "failed to create containerd task". Where do you look next?
Lesson outline
Why the two-layer design exists
Kubernetes needed to support multiple container runtimes without coupling its code to any single one. The CRI (gRPC API) creates a stable interface. High-level runtimes implement CRI and handle image management. Low-level OCI runtimes actually execute the container. This separation allows switching runtimes (containerd -> CRI-O, runc -> gVisor) without changing a single line of Kubernetes code.
kubelet (Kubernetes layer)
Use for: Makes CRI gRPC calls to the high-level runtime. ImageService (pull, list, remove images) and RuntimeService (create/start/stop/delete containers and sandboxes). Knows nothing about how containers are actually run.
containerd (CRI layer)
Use for: Implements CRI. Pulls images from registries, manages image layers via snapshotter, creates container config, and delegates execution to containerd-shim and then runc. Handles the lifecycle between Kubernetes requests and actual container processes.
containerd-shim (persistence layer)
Use for: A per-container process that stays alive even if containerd restarts. Keeps stdin/stdout/stderr open and reports exit codes. Enables daemonless container execution -- existing containers survive containerd crashes.
runc (OCI runtime layer)
Use for: Reads OCI Runtime Spec (config.json). Calls clone() for namespaces, writes cgroup files, mounts rootfs via overlayFS, drops capabilities, applies seccomp filter, then exec()s the container entrypoint. Exits after the container starts -- shim takes over.
kubectl apply -f pod.yaml
|
v
[kube-apiserver] -- stores pod spec in etcd
|
v (kubelet watches for pods on this node)
[kubelet] -- calls CRI gRPC: RunPodSandbox() then CreateContainer()
|
v (gRPC over unix socket /var/run/containerd/containerd.sock)
[containerd] -- implements CRI
|-- Pulls image layers from registry (content-addressable, deduplicated)
|-- Unpacks layers via snapshotter to /var/lib/containerd/
|-- Generates OCI bundle: rootfs/ + config.json (Runtime Spec)
|
v (exec containerd-shim binary)
[containerd-shim] -- per-container process (survives containerd restarts)
|-- calls runc create, then runc start
|
v (runc reads config.json and executes kernel calls)
[runc] -- does the actual kernel work then exits
|-- clone(CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|...)
|-- pivot_root to container rootfs
|-- write PID to cgroup controller files
|-- apply seccomp filter, drop capabilities
|-- exec() into container entrypoint (becomes PID 1 in container)
v
[Container Process] PID 1 in new namespaces, limited by cgroupsrunc exits after the container starts -- the containerd-shim holds the container alive. This is why existing containers survive a containerd crash.
The Docker removal misconception
Kubernetes 1.24 announced dockershim removal
“Kubernetes no longer supports Docker containers. Images built with Docker will not work. All container orchestration changes.”
“Only the runtime path changed: kubelet now talks to containerd directly via CRI instead of going through the dockershim adapter. OCI images (built by docker build, Buildah, Kaniko, or any other builder) work identically because containerd implements the same OCI Image Spec. Application containers are completely unaffected.”
containerd crashes on a node
“All containers on the node stop immediately. The node goes NotReady and all pods are rescheduled.”
“Existing containers keep running via their containerd-shim processes. New containers cannot be started. The node may show degraded status but existing workloads continue. Only after a timeout does the kubelet mark the node NotReady -- giving time to restart containerd before rescheduling begins.”
Container start-up: step by step
01
1. Pod scheduled to node -- kubelet receives pod assignment from API server watch and begins the container creation sequence by calling CRI ImageService.PullImage() if the image is not cached, then RuntimeService.RunPodSandbox() to create the pause container (which holds namespaces), then RuntimeService.CreateContainer().
02
2. Image pulled via containerd -- containerd checks its local image store (content-addressable by SHA256 digest). Missing layers are pulled from the registry. Layers shared between images are deduplicated in /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/ -- a node running 10 pods using ubuntu:22.04 as base stores the base layer once.
03
3. OCI bundle creation -- containerd creates an OCI bundle: a directory with rootfs/ (the merged overlayFS view of all image layers) and config.json (the OCI Runtime Spec generated from the pod spec, including namespace config, capabilities, seccomp profile, cgroup paths, mounts, and the entrypoint).
04
4. runc executes the OCI bundle -- containerd-shim calls runc create (applies namespaces, cgroups, mounts) then runc start (exec()s the entrypoint). runc reads every field of config.json: process.capabilities for capability dropping, linux.seccomp for the syscall filter, linux.namespaces for which namespace types to create, linux.cgroupsPath for where to write cgroup membership.
05
5. shim reports container running -- once the container process is running, runc exits. The containerd-shim keeps the container alive via its file descriptors. The shim reports the running state back through containerd to kubelet to the API server. kubectl get pod shows Running.
1. Pod scheduled to node -- kubelet receives pod assignment from API server watch and begins the container creation sequence by calling CRI ImageService.PullImage() if the image is not cached, then RuntimeService.RunPodSandbox() to create the pause container (which holds namespaces), then RuntimeService.CreateContainer().
2. Image pulled via containerd -- containerd checks its local image store (content-addressable by SHA256 digest). Missing layers are pulled from the registry. Layers shared between images are deduplicated in /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/ -- a node running 10 pods using ubuntu:22.04 as base stores the base layer once.
3. OCI bundle creation -- containerd creates an OCI bundle: a directory with rootfs/ (the merged overlayFS view of all image layers) and config.json (the OCI Runtime Spec generated from the pod spec, including namespace config, capabilities, seccomp profile, cgroup paths, mounts, and the entrypoint).
4. runc executes the OCI bundle -- containerd-shim calls runc create (applies namespaces, cgroups, mounts) then runc start (exec()s the entrypoint). runc reads every field of config.json: process.capabilities for capability dropping, linux.seccomp for the syscall filter, linux.namespaces for which namespace types to create, linux.cgroupsPath for where to write cgroup membership.
5. shim reports container running -- once the container process is running, runc exits. The containerd-shim keeps the container alive via its file descriptors. The shim reports the running state back through containerd to kubelet to the API server. kubectl get pod shows Running.
1# 1. Find which node the stuck pod is on2$ kubectl get pod <pod-name> -o wide3NAME READY STATUS NODE4my-pod 0/1 ContainerCreating node-356# 2. SSH to node-3, use crictl (the CRI debug tool, not docker)7$ crictl ps -a # Show all containers including stopped/exitedcrictl is the CRI debug tool -- use this when kubectl fails or containers have not started. It speaks directly to containerd via gRPC.8$ crictl pods # Show pod sandboxes (pause containers)9$ crictl inspect <id> # Full OCI spec + state for a container1011# 3. Check containerd logs for the error12$ journalctl -u containerd --since "10 minutes ago" --no-pager13# Common errors to search for:These three error patterns cover 90% of ContainerCreating failures. Search the containerd journal for them first.14# "failed to pull image" -> registry auth, DNS resolution, rate limits15# "failed to create containerd task" -> runc error, OCI spec issue16# "error unpacking image" -> /var/lib/containerd disk full1718# 4. Check disk space for containerd image storage19$ df -h /var/lib/containerdWhen /var/lib/containerd fills to 100%, all image pulls fail silently. This is a common cause of ContainerCreating on nodes with many unique images.20Filesystem Size Used Avail Use%21/dev/sda 100G 97G 0G 100% <- THIS is the problem2223# 5. Clean unused images to free space (caution: check what is safe)24$ crictl rmi --prune # Remove images with no running containers
Blast radius when the container runtime layer fails
Untrusted workload using default runc instead of sandbox runtime
# Untrusted tenant workload using default runc runtime.
# runc uses the host kernel directly -- a kernel exploit in this
# pod can escape to the node.
apiVersion: v1
kind: Pod
metadata:
name: tenant-job
spec:
# No runtimeClassName -- defaults to runc
containers:
- name: job
image: untrusted-tenant/job:latest# Untrusted tenant workload uses gVisor (runsc) sandbox runtime.
# gVisor intercepts syscalls in userspace -- kernel exploits are
# contained within the gVisor sandbox, not reaching the host kernel.
apiVersion: v1
kind: Pod
metadata:
name: tenant-job
spec:
runtimeClassName: gvisor # Selects the RuntimeClass named 'gvisor'
containers:
- name: job
image: untrusted-tenant/job:latest
---
# RuntimeClass must be pre-created (cluster admin task):
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc # containerd handler name for gVisor runtimeFor multi-tenant clusters running untrusted code, gVisor or Kata Containers provide a stronger isolation boundary than runc. gVisor intercepts system calls in userspace, preventing kernel exploits from reaching the host kernel. Kata Containers runs a full VM kernel per pod for the strongest isolation. RuntimeClass tells containerd which handler to use -- the same pod spec field works transparently.
| Runtime | Isolation Level | Performance Overhead | Syscall Compat | Best For |
|---|---|---|---|---|
| runc (default) | Namespace + cgroup | ~1% overhead | 100% -- direct kernel | Trusted workloads, all environments |
| gVisor (runsc) | Userspace kernel intercept | 10-30% CPU overhead | ~85% -- not all syscalls | Untrusted code, sandboxed jobs |
| Kata Containers | Full VM kernel per pod | 5-15% startup, ~5% runtime | 100% -- own kernel | Strictest multi-tenant isolation |
| CRI-O + runc | Same as runc | ~1% overhead | 100% | Minimal-footprint runtime (OpenShift default) |
The two-layer runtime stack
📖 What the exam expects
High-level runtimes (containerd, CRI-O) implement the Kubernetes CRI gRPC API and handle image management. Low-level OCI runtimes (runc, kata-runtime, gVisor runsc) implement the OCI Runtime Spec and actually execute the container. The separation allows switching runtimes without changing Kubernetes.
Toggle between what certifications teach and what production actually requires
Common in Kubernetes administrator and CKA exam contexts. "What changed when dockershim was removed?" and "how do you debug a ContainerCreating pod?" are the most frequent entry points.
Common questions:
Strong answer: Mentioning containerd-shim keeps containers running if containerd crashes. Knowing /var/lib/containerd can fill up. Discussing gVisor or Kata for multi-tenant isolation. Understanding OCI images are content-addressable layers that deduplicate across images.
Red flags: Saying "Kubernetes runs Docker containers" without qualification. Not knowing what to do when kubectl logs returns nothing. Thinking all pods use the same runtime without knowing RuntimeClass exists.
Related concepts
Explore topics that connect to this one.
Suggested next
Often learned after this topic.
Container Networking Fundamentals: How Packets Move Between PodsReady to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.