Copy Fail: From Pod to Host.
Two weeks ago, we disclosed Copy Fail, a new and exceptionally dangerous Linux local-privilege escalation vulnerability.
Copy Fail exploits a kernel memory corruption flaw without injecting code into a running kernel, which makes it small and unusually portable. Copy Fail gives attackers a repeatable, controlled 4-byte write into the Linux page cache backing any readable file; in other words, it allows attackers to rewrite the cached contents of files on a Linux filesystem.
To help operators determine their susceptibility to Copy Fail, we published a proof-of-concept exploit and a model attack path. Our model attack targets the su binary present on most Linux systems. Because su is setuid root, an attacker who can rewrite it and then execute it can escalate to root. Instead of having it ask for and check a root password, the rewritten su skips the paperwork and drops the caller straight into a root shell.
Our proof-of-concept led some to believe that rewriting setuid binaries like su was the extent of the attack. Not so! The capability that Copy Fail and related page cache writing exploits extend to attackers is powerful and versatile. As an example, let’s walk through how to use it to break out of a namespaced container.
To understand this new exploit pattern, you have to understand a little bit about what’s happening under the hood in Copy Fail.
Copy Fail works by confusing the kernel code that handles IPSec ESP Extended Sequence Numbers (authencesn). This code is exposed to unprivileged users via AF_ALG sockets, which are userland’s interface to Linux’s kernel cryptography subsystem.
Specifically, Copy Fail sets the authencesn code up to think it’s looking at disposable scratch memory when it’s really handling a mutable reference to the page cache. It tells the kernel’s cryptography code to decrypt a ciphertext blob, using bytes supplied by a zero-length copy from a pipe using splice(2).
Because the wire format for IPSec ESNs isn’t the implicit format the crypto code operates on, the authencesn code shuffles sequence numbers around. But the code isn’t handling a disposable buffer from a packet; Copy Fail has tricked it into operating on a reference to a cached file.
Cross-container kernel attacks usually corrupt kernel memory: race windows, UAFs, version-bound payloads. These primitives are powerful, as they can allow code execution at the kernel level. But they’re fragile. Copy Fail is deterministic. It’s a more reliable primitive for cross-pod compromise or runtime poisoning, without relying on kernel code execution.
There are two primary attack scenarios:
Scenario 1: cross-container poisoning. From a compromised pod, or from a freshly-launched attacker pod (only
create podsrights required), potentially backdoor co-located pods that access the same vulnerable lower-layer file through the same underlying address_space. Image references can differ; only a layer hash needs to match. The compromise lives only in the kernel page cache so on-disk bytes are unchanged and it is invisible to agent-less disk scanners.Scenario 2: container escape. From inside an unprivileged container, or from a compromised DaemonSet with host-filesystem mounts, get a root shell on the host.
Why the Page Cache Crosses Container Boundaries
The page cache is shared across containers.
No matter what namespace you’re in, every struct file the kernel handles carries an f_mapping pointer, which usually comes from the underlying inode’s i_mapping. That means that any two file descriptors sharing an f_mapping share the same cache data.
The kernel’s representation of contiguous pages of memory is called a “folio”. For ordinary buffered I/O on regular files, a write through one fd updates the cached folios. Subsequent reads, on every related fd, see the updated data (subject to normal concurrency and ordering rules). Copy Fail mutates the same folios via the AF_ALG/splice() path described in Part 1, bypassing the regular write accounting. The visibility property is unchanged: any fd whose f_mapping points at the affected address_space reads the modified bytes on its next page cache hit.
All of this is independent of containers. Container isolation lives in mount, network, PID, user, and IPC namespaces. None of them creates a per-container address_space or page cache. Containers share cached folios when their file accesses reach the same underlying address_space.
A Kubernetes container's root filesystem is commonly an overlayfs mount stitched together from a writable upper layer (usually per-container scratch) and one or more read-only lower layers (image layers). Container runtimes (containerd, CRI-O, others) deduplicate layers by content hash: if two containers on the same node use the same unpacked layer/snapshot, the corresponding lower-layer files can be backed by the same host inode/address_space, regardless of what the images are named. This reuse allows lowering the storage requirements for images by sharing common layers. python:3.12-slim and xint-flask-app:v1 (built FROM python:3.12-slim) share the Python layer. Both share debian:bookworm-slim underneath. A redis:7-bookworm pod on the same node shares the Debian layer with both.
In normal operation on an overlayfs mount, opening a lower-layer file for write access or truncation triggers overlayfs copy-up before writes proceed, allocating a new inode in the pod's upper layer so the change is private. By storing only this small set of differences, containers can reuse their lower layers efficiently while still allowing a writable copy to be presented to applications. However Copy Fail skips the standard write path entirely. The folios it mutates belong to the lower-layer address_space itself, shared host-wide, rather than the upper layers that were meant to store write deltas.
The pods' overlayfs mounts each present what looks like a private /usr/local/lib/python3.12/site-packages/foo.py (or /lib/x86_64-linux-gnu/libc.so.6), but overlayfs delegates file I/O to the real lower backing file. If those backing files are the same lower inode/address_space, the cached folios are shared:
Copy Fail's 4-byte write goes into that one underlying entry. Anything that subsequently reads the same lower-layer file through the same underlying address_space can read the poisoned bytes, until the page is evicted or the layer is dropped.
The on-disk inode is unchanged so of course image-registry scanners, file-integrity monitors examining the disk hash, and offline, snapshot, or block-level scanners that bypass the affected running kernel's page cache see the original content.
Scenario 1: Cross-Container Poisoning
Threat model. Unprivileged attacker, no privileged capabilities, no node access, no admission rights to mutate other workloads. Two ways to start: code execution in a pod the attacker already controls (1-1), or just create pods rights (1-2).
Target. Pick a file in a layer widely shared on the node: a Python site-packages/ module if the node hosts Python-derived workloads, a shared object such as glibc for broader reach, subject to executable mapping, patch alignment, and crash-safety constraints anything inside a Debian/Ubuntu/Alpine base layer. We will use a Python source file for this demo. Pick a module imported during interpreter startup or during a common framework's init, so target pods load it early.
The write. Python files are a good target for a demo because they are easier to read and are more portable than shellcode. Any changes to Python files can of course be made with Copy Fail by chaining our 4 byte write primitives together. The specific choice of a file and the contents to use to replace it are an important part of weaponization. We will only provide a simple proof of concept here.
The trigger. The next time any pod whose image includes the targeted overlayfs layer hash imports the target module. CPython opens the .py file, reads source bytes from the page cache, and compiles the patched bytes. The redirected dispatch resolves to attacker-controlled code already reachable in the image, or to a payload staged via additional Copy Fail invocations elsewhere in the layer.
1-1: Compromised pod sharing a base layer
The "sandbox each microservice in its own container" model assumes that code execution in any one container is bounded by that container's image and its supply chain. Shared lower-layer page cache can break that assumption. This allows not simply compromising a pod itself, but co-located pods that read the same targeted file from a shared lower layer. The target can be the most legitimate workload in the cluster: a metrics exporter, a log shipper, a CI runner, an unaudited debug sidecar. What matters is that it shares a base layer (debian:bookworm-slim, python:3.12-slim) with a hardened backend.
Demo. Pod A is the compromised pod, image python:3.12-slim. Pod B is an unrelated, security-hardened backend on the same node, image payments-api:v1, built FROM python:3.12-slim. The image references differ; the Python layer Pod A poisons is in Pod B's stack. Pod B's deployment imports a library that pulls in the targeted module on init.
A node-scoped disk scanner running outside both pods, hashing files via the host filesystem, sees nothing. A registry scan against the image digest sees nothing. A runtime EDR that hashes resident pages of the running python3 after import, or that watches execve and child processes inside Pod B, is the best bet for detecting the compromise.
1-2: Pod creation rights
In this scenario, the attacker has no existing access to any pod on the cluster. They have the ability to run create pods in some namespace. Common in multi-tenant clusters, CI runners, build agents, and shared-cluster tenancy patterns, as many CI, build, and multi-tenant service accounts are intentionally granted it.
The attack does not depend on luck about layer overlap. Here, if they can read victim pod specs, or otherwise infer the victim image and node placement, pull the relevant base image into the attacker pod, and request scheduling on the victim's node via nodeAffinity or nodeName. Container runtime layer-hash dedup makes the attacker's overlayfs lower-layer the same host inode as the victim's; the page cache write follows.
The interesting consequence is that this attack reaches across namespace and tenant boundaries that RBAC was meant to enforce. A service account with pods/create in its own namespace, no direct rights over the victim workload, can poison a backend in a different namespace by inheriting that backend's base image and landing on the same node.
Sub-case: compromise inside a DaemonSet. A compromised DaemonSet is a higher-leverage exploit path. Most production DaemonSets ship with hostPath mounts for legitimate reasons (CNI (Container Network Interface) agents, CSI (Container Storage Interface) drivers, log forwarders, monitoring agents, security agents). The page cache shared with the host filesystem is therefore directly inside the attacker's reach. This means with the same exploit primitive the lateral target set now includes host-side binaries (/usr/sbin/ipset, iptables, kubelet-spawned helpers), and poisoning them can lead to host-root execution if the host later executes the affected file and the patch is weaponized correctly the next time the host invokes them. Code execution in a DaemonSet is effectively pod-to-host without going through Scenario 2's mechanics.
Scenario 2: Container Escape
Threat model. Same starting position: Unprivileged container, code execution, no privileged caps. The goal is now a shell on the host.
The shared inode. When runc patched CVE-2019-5736, the original fix copied the runc binary to a memfd before each execve so it could not be overwritten from inside the container. A follow-up commit replaced the copy with a read-only bind mount of the host's runc into every container, to take advantage of kernel page cache sharing across the spawn fan-out. Of course, the kernel page cache is the very thing that we are overwriting, which means that design can again expose the host runc mapping to a page-cache write primitive.
The chain. Same shape as Datadog's Dirty Pipe container-escape PoC, with Copy Fail as the write primitive instead of Dirty Pipe.
Step 1: Force runc to run.
When the user exec into the container, kubelet implements it via runc exec into the already-running container, which is the window we need. Container starts, restarts, and init steps don't work: they run on a fresh filesystem, before the entrypoint has planted the trap. Datadog's PoC plants it by overwriting /bin/sh with #!/proc/self/exe. When runc exec's a shell, the kernel resolves the shebang and re-execs /proc/self/exe, which still points to runc mid-exec. That leaves a runc process pinned in the container's PID namespace, alive long enough for Step 2.
Anything that triggers a process inside the container suffices, specifically when running kubectl exec in your container, a container restart, or an init step. To make the wait deterministic, Datadog's PoC overwrites /bin/sh in the container with #!/proc/self/exe, so the next time anyone (or anything) execs a shell inside, runc is invoked.
Step 2: Locate the runc PID. Once runc appears in the container's PID namespace, scan /proc for the process whose /proc/<pid>/exe symlink resolves through the bind mount to the host runc inode.
Step 3: Poison via /proc/<runc_pid>/exe. Open that fd. Run Copy Fail against it; the page cache write lands in the first page backing runc, replacing its ELF header and the rest of the binary with a small malicious ELF. The cached pages are now poisoned and staged for the next invocation.
Step 4: Wait for the next runc. Any subsequent runc invocation on the host maps the cached pages and executes the modified code as root. This includes kubectl exec from an admin, the next pod start, the next probe, and so on. Attackers can often force this to happen by terminating the pod, forcing a restart.
This exploit path follows that of Dirty Pipe very closely. However, Copy Fail covers every kernel from the 2017 in-place commit (72548b093ee3) through the 2026 fix (a664bf3d603d).
PoC. A reverse shell from host context, captured on the listener:
ubuntu@ip-172-26-6-67:~$ nc -l 1234 -v
Listening on 0.0.0.0 1234
Connection received on ec2-43-202-13-255.ap-northeast-2.compute.amazonaws.com 52450
[cwd] /run/containerd/io.containerd.runtime.v2.task/k8s.io/880c5f77aa39359e231f5ea709148f7914584c9986f8636b1201430f842e94c2
[listdir: cwd]
.
..
init.pid
log.json
runtime
options.json
bootstrap.json
shim-binary-path
log
config.json
work
rootfs
[listdir: /]
.bottlerocket
bin
boot
dev
etc
...
x86_64-bottlerocket-linux-gnuTwo things to read off this output. The connecting peer is ec2-43-202-13-255.ap-northeast-2.compute.amazonaws.com, an AWS public DNS name in ap-northeast-2, an EKS worker EC2 instance. The shell's working directory is /run/containerd/io.containerd.runtime.v2.task/k8s.io/880c5f77aa3.../, the containerd shim's per-container runtime state directory on the host. That path does not exist inside any pod's mount namespace; only the host (or a container with the runtime state explicitly mounted) can chdir into it. The shell is on the node, not in a container.
Detection and Mitigation
What does and doesn't catch this:
Control | Effective? |
Image registry scanning (Trivy, Clair, similar) | No. Image bytes unchanged. |
Agent-less disk scanning (sensor-less node scans, snapshot-based scanners) | No. On-disk file unchanged. |
File-integrity monitoring on disk (AIDE, Tripwire) | No. On-disk hash unchanged. |
Runtime EDR hashing in-memory pages of running processes | Yes, in principle. |
Runtime EDR monitoring | Partial. Catches post-exec behavior, not the page cache write. |
Seccomp profile blocking | Yes. Removes the primitive. |
gVisor ( | Yes. Separate user-space kernel, no shared host page cache. |
Kata Containers | Yes. Per-pod VM, separate kernel, separate page cache. |
Managed per-pod microVM (EKS Fargate; equivalents on other clouds) | Yes, by the same mechanism as Kata. Each pod runs in its own microVM with its own kernel and page cache. |
Patched host kernel | Yes. Root cause fixed. |
For mitigating this and future issues:
Patch the host kernel. Pull the fix (a664bf3d603d) through your managed platform's node-image update or via in-place node-OS patching. This is the best way to fix the Copy Fail bug.
Block AF_ALG. A pod seccomp profile that denies
socket(AF_ALG, ...)removes the primitive used in Copy Fail. Most production workloads do not need AF_ALG, but validate this against cryptographic or VPN/storage workloads before enforcing globally and this has the benefit of reducing attack surface against a kernel component that has had several issues besides Copy Fail.Use VMs or similar for tenant boundaries. Workloads that need hard isolation should never rely on containers as a security boundary. Migrating to VMs, microVMs, or systems like gVisor greatly reduce attack surface for guest-to-host attacks.
Agent-less disk scanners are unlikely to catch this compromise because the affected bytes live only in the running kernel's page cache.