top | item 33448021

(no title)

gbrown_ | 3 years ago

Naive question what are the non-nefarious requirements of being able to do this? I get that people have used it to work around things (for good and ill) but what’s the vanilla answer?

discuss

cyphar|3 years ago

This is technically still a work-around, but in container runtimes (specifically runc and LXC) we do this to defend against a class of attacks where the container can overwrite the host runc binary (meaning the next time some container operation is done, the attacker's code is executed as root on the host)[1]. Doing this each time the container starts ensures any such attack will only overwrite its own (short-lived) copy of the binary and allows us to do this without having write access to any filesystem that allows exec.

Unfortunately we don't use this all the time because some Kubernetes unit tests started failing when we first added this protection (the size of the binary is added to the memory usage of each container which caused some Kubernetes unit tests to use more memory than they did before). Ironically this exact protection would've protected us from Dirty COW and other such bugs but it's disabled by default (instead we make a temporary read-only bind-mount that we then exec which is slightly less safe but doesn't add ~10MB to every containers' memory usage).

But the actual answer to your question is that this was not originally intended behaviour (when we mentioned we were doing this to the mm and fs folks they weren't happy) and there have been patches posted recently to make this feature something you have to explicitly opt-in to.

[1]: https://lwn.net/Articles/781013/