Linux Namespaces: The Kernel Primitive Behind Every Container
Linux namespaces wrap a global kernel resource so that processes inside a namespace see their own isolated instance of it. Every container runtime — Docker, containerd, CRI-O — creates namespaces at container start. Understanding namespaces is the difference between cargo-culting "containers are secure" and knowing exactly what isolation you have and where it ends.
Why this matters at your level
Know that containers are not VMs. Know the 6 namespace types by name and what each isolates. Be able to explain why a container process cannot see sibling containers via ps.
Inspect namespaces via /proc/<pid>/ns/. Trace what clone() flags the container runtime passes. Understand user namespaces and why they matter for rootless containers.
Understand namespace escape attack vectors (CVE-2019-5736). Know the difference between namespace isolation and seccomp/AppArmor/SELinux mandatory access control. Design container security policies that layer all three.
Evaluate whether new workloads require user namespaces. Audit the blast radius of host-mounted volumes and hostPID:true pods. Own the threat model for the container runtime layer across the platform.
Linux Namespaces: The Kernel Primitive Behind Every Container
Linux namespaces wrap a global kernel resource so that processes inside a namespace see their own isolated instance of it. Every container runtime — Docker, containerd, CRI-O — creates namespaces at container start. Understanding namespaces is the difference between cargo-culting "containers are secure" and knowing exactly what isolation you have and where it ends.
Attacker controls a malicious image accessible to the cluster
Container runs as UID 0, reads /proc/self/exe — a symlink to host runc binary
Race loop opens /proc/self/exe for writing while runc is executing the container
Host runc binary overwritten with attacker payload via the open file descriptor
Next container launch executes attacker payload as root on the node — full host compromise
The question this raises
If containers use namespace isolation, how can a process inside one escape to the host — and which of the 6 namespace types failed to prevent it?
Your security team reports a container can run ps aux and see processes from other containers and the host. Which Pod spec field is most likely causing this?
Lesson outline
What Problem Namespaces Solve
The core problem: global kernel resources
Before namespaces, all processes on a Linux host shared one global view of every resource — PID 1 was always init, /proc showed every process, every program could see every network interface. Running isolated workloads on one host required full VMs. Namespaces give each group of processes its own isolated view of a specific kernel resource without spawning a separate kernel.
PID Namespace
Use for: Isolates process ID numbers. Container processes think PID 1 is their init process. Host processes are invisible. Signal delivery is confined within the namespace.
Network Namespace
Use for: Isolates network interfaces, IP routing tables, firewall rules, and socket state. Each container gets its own eth0 with its own IP. A host veth pair bridges into the container.
Mount Namespace
Use for: Isolates the filesystem mount table. Container sees its own rootfs. The host /etc, /var, and /proc are not visible unless explicitly bind-mounted into the container.
User Namespace
Use for: Maps container UIDs/GIDs to different host UIDs. UID 0 inside maps to an unprivileged host UID — enabling rootless container runtimes where container root does not equal host root.
UTS Namespace
Use for: Isolates hostname and domain name. The container can set its own hostname (e.g., pod name) without affecting the host or other containers.
IPC Namespace
Use for: Isolates System V IPC objects (shared memory, semaphores, message queues) and POSIX message queues. Prevents cross-container shared memory attacks.
The System View: What Isolation Actually Looks Like
HOST KERNEL (one shared kernel for all containers)
+------------------------------------------------------------------+
| Global Namespace (init_ns) |
| PID: 1(systemd) 2(kthreadd) 847(containerd) 901(kubelet)... |
| NET: eth0(192.168.1.10) lo docker0(172.17.0.1) |
| |
| Container A Namespaces Container B Namespaces |
| +-----------------------------+ +-----------------------------+ |
| | PID_NS: 1(nginx) 7(sh) | | PID_NS: 1(redis) 4(sh) | |
| | NET_NS: eth0(10.244.0.5) | | NET_NS: eth0(10.244.0.6) | |
| | MNT_NS: /etc/nginx rootfs | | MNT_NS: /etc/redis rootfs | |
| | UTS_NS: hostname=pod-a | | UTS_NS: hostname=pod-b | |
| | USER_NS: uid0->uid65534 | | USER_NS: uid0->uid65535 | |
| +-----------------------------+ +-----------------------------+ |
| veth0 <---> cni0 bridge <---> veth1 |
+------------------------------------------------------------------+
NOTE: hostPID:true removes the PID_NS boundary entirely.
Container process sees ALL 901 host processes.Each container has its own namespace for each resource type. The host kernel is shared — namespaces only restrict what each container can see, not kernel execution itself.
Container isolation: misconception vs reality
A container is started with docker run or as a Kubernetes pod
“The container is isolated like a VM — it has its own kernel, its own process table, and is completely separated from the host and other containers.”
“The container is a process group running on the HOST kernel with a restricted view of specific kernel resources. One kernel, many namespace views. Anything the kernel exposes that is not namespace-aware is shared by all containers.”
A pod spec includes hostPID: true
“The pod gets some extra visibility but is still isolated. It can see more processes but is still safely sandboxed.”
“hostPID:true completely removes PID namespace isolation. The container can see, signal, and ptrace every process on the node including kubelet and containerd. This is effectively host access for process operations.”
How It Actually Works: Namespace Creation Internals
From container start to isolated process
01
1. Runtime calls clone() with namespace flags — containerd calls clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) to create a new process that is the first member of fresh namespaces for each requested type.
02
2. New namespaces start empty (or forked from parent) — the new PID namespace starts empty, the cloned process becomes PID 1 inside it. The new NET namespace starts with only loopback. MNT namespace is a copy of the parent's mount table, then the runtime unmounts host mounts and mounts the container rootfs.
03
3. veth pair bridges network namespace to host — the runtime creates a virtual ethernet pair: one end (eth0) placed inside the container's NET namespace, the other (vethXXX) stays on the host and bridges to cni0. This is how pod traffic flows to other pods and the outside world.
04
4. /proc/<pid>/ns/ files pin the namespace alive — while the process runs, its namespaces are pinned by files under /proc/<pid>/ns/<type>. Even if all processes in a namespace exit, the namespace stays alive if another process holds an open fd to its ns file. This is how Kubernetes pause containers work — holding namespaces open while app containers restart.
05
5. setns() enters an existing namespace — any process can enter an existing namespace by opening /proc/<pid>/ns/<type> and calling setns(fd). This is how kubectl exec works — it enters the container's namespaces to provide a shell inside the container's view of the world.
1. Runtime calls clone() with namespace flags — containerd calls clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) to create a new process that is the first member of fresh namespaces for each requested type.
2. New namespaces start empty (or forked from parent) — the new PID namespace starts empty, the cloned process becomes PID 1 inside it. The new NET namespace starts with only loopback. MNT namespace is a copy of the parent's mount table, then the runtime unmounts host mounts and mounts the container rootfs.
3. veth pair bridges network namespace to host — the runtime creates a virtual ethernet pair: one end (eth0) placed inside the container's NET namespace, the other (vethXXX) stays on the host and bridges to cni0. This is how pod traffic flows to other pods and the outside world.
4. /proc/<pid>/ns/ files pin the namespace alive — while the process runs, its namespaces are pinned by files under /proc/<pid>/ns/<type>. Even if all processes in a namespace exit, the namespace stays alive if another process holds an open fd to its ns file. This is how Kubernetes pause containers work — holding namespaces open while app containers restart.
5. setns() enters an existing namespace — any process can enter an existing namespace by opening /proc/<pid>/ns/<type> and calling setns(fd). This is how kubectl exec works — it enters the container's namespaces to provide a shell inside the container's view of the world.
1# Get the PID of a running container on a node (requires node access)2$ crictl inspect <container-id> | grep '"pid"'3"pid": 1234545# List namespace files -- inode number identifies the namespace6$ ls -la /proc/12345/ns/7lrwxrwxrwx /proc/12345/ns/ipc -> ipc:[4026532456]8lrwxrwxrwx /proc/12345/ns/mnt -> mnt:[4026532457]The inode number in brackets uniquely identifies the namespace. Two processes sharing the same inode are in the same namespace.9lrwxrwxrwx /proc/12345/ns/net -> net:[4026532458]10lrwxrwxrwx /proc/12345/ns/pid -> pid:[4026532459]11lrwxrwxrwx /proc/12345/ns/uts -> uts:[4026532460]12lrwxrwxrwx /proc/12345/ns/user -> user:[4026531837]13# user inode matches host? Container is NOT using user namespaces.If the user namespace inode matches the host init process, user namespaces are NOT active -- UID 0 in container equals UID 0 on host.14# UID 0 inside = UID 0 on host for file operations.1516# Enter the container's network namespace from the hostnsenter lets you enter a container namespace from the host. kubectl exec uses the same mechanism.17$ nsenter --target 12345 --net -- ip addr181: lo: <LOOPBACK,UP>192: eth0@if45: <BROADCAST,UP> inet 10.244.0.5/242021# Check if two containers share PID namespace (dangerous!)22$ stat --format="%i" /proc/12345/ns/pid /proc/67890/ns/pid23# Same inode number = shared PID namespace -- they see each other's processes
What Breaks in Production: Blast Radius
Blast radius when namespace isolation is removed or bypassed
- hostPID:true — Container can ptrace, kill, or read /proc/<pid>/mem of every process on the node including kubelet and containerd
- hostNetwork:true — Container bypasses network namespace entirely — can sniff all unencrypted node traffic and bypass all NetworkPolicy rules
- hostIPC:true — Container can read shared memory segments from other containers and host processes — IPC-based credential theft
- privileged:true — Disables seccomp and AppArmor, grants 40+ Linux capabilities including CAP_SYS_ADMIN, mounts /dev — treat as full host access
- Volume mount of /proc or /sys — Gives container write access to host kernel state — can modify kernel parameters, trigger crashes, or read sensitive host data
- User namespace not enabled — UID 0 inside container equals UID 0 on host for most file operations and kernel calls — running as root in container is running as root on host
Debugging pod with full host namespace access
# DO NOT use in production -- grants host-level access
apiVersion: v1
kind: Pod
metadata:
name: debug-pod
spec:
hostPID: true # sees ALL host processes
hostNetwork: true # bypasses all NetworkPolicy
hostIPC: true # reads host shared memory
containers:
- name: debug
image: busybox
securityContext:
privileged: true # disables seccomp, AppArmor, grants CAP_SYS_ADMIN# Safe debugging pod -- isolated to its own namespaces
apiVersion: v1
kind: Pod
metadata:
name: debug-pod
spec:
# No hostPID, hostNetwork, or hostIPC
containers:
- name: debug
image: busybox
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem: trueThe wrong version combines 4 host namespace flags + privileged mode -- this gives the pod visibility into the host kernel equivalent to a root shell. The correct version drops all capabilities and refuses privilege escalation. Even if the container image is compromised, it cannot reach the host kernel in any meaningful way.
Decision Guide: Which Namespace Permissions Are Safe?
Cost and Complexity: Namespace Permission Trade-offs
| Permission | Legitimate Use Case | Attack Surface Added | Risk Level |
|---|---|---|---|
| hostPID:true | Node-level profiling (perf, strace) | ptrace/kill ANY process including kubelet | Critical -- treat as host access |
| hostNetwork:true | CNI plugins, ingress controllers needing host ports | Sniff all unencrypted traffic, bypass NetworkPolicy | High -- limit to infra DaemonSets |
| privileged:true | Kernel module loading, device drivers | Disables all MAC (AppArmor/SELinux/seccomp), grants /dev access | Critical -- avoid entirely |
| User namespace enabled | Rootless container runtimes | Minimal -- UID 0 maps to unprivileged host UID | Low -- this is the target default |
| runAsUser:0, no user NS | Legacy applications expecting root | UID 0 inside = UID 0 on host for file operations | High -- enforce runAsNonRoot in policy |
Exam Answer vs. Production Reality
What namespaces actually isolate
📖 What the exam expects
Linux provides 6 namespace types: PID (process trees), Network (interfaces, routing), Mount (filesystem view), UTS (hostname), IPC (shared memory), User (UID/GID mappings). Created via clone() with CLONE_NEW* flags. Visible at /proc/<pid>/ns/.
Toggle between what certifications teach and what production actually requires
How this might come up in interviews
Asked in Kubernetes security interviews, CKS exam, and senior platform engineering roles as "what is the difference between a container and a VM" or "what does privileged:true actually do".
Common questions:
- What is the difference between a Linux namespace and a cgroup?
- What does hostPID:true in a Pod spec actually enable?
- How does a rootless container runtime use user namespaces?
- Why is running a container as root dangerous even with namespace isolation?
- What is a namespace escape and how do you prevent them at the platform level?
Strong answer: Mentioning /proc/<pid>/ns/ and the ability to inspect or enter namespaces. Knowing the nsenter command. Understanding that user namespace UID 0 maps to unprivileged host UID. Discussing CVE-2019-5736 as a namespace escape vector.
Red flags: Saying "containers are isolated from the host" without qualification. Believing privileged:true only adds capabilities without understanding it disables seccomp and AppArmor. Not knowing what hostPID:true enables.
Related concepts
Explore topics that connect to this one.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Discussion
Questions? Discuss in the community or start a thread below.
Join DiscordIn-app Q&A
Sign in to start or join a thread.