What is the benefit of using a remote cache instead of a local ~/.cache directory? Is it only for sharing build results among team members? How do you make sure the build results are not spoofed?
Not just team members; if you make your cache publicly readable, contributors to e.g. your GitHub/GitLab/Whatever project can also use them and get really fast builds, the first time they try to contribute. So a remote cache is nice to have, if it's seamless.
Nix works this way by default (and much of the community operates caches like this) and it can be a massive, massive time saver.
> How do you make sure the build results are not spoofed?
What do you mean "spoofed?" As in, someone put an evil artifact in the cache? Or overwrote an existing artifact with a new one? Or someone just stole your developers access and started shoving shit in there? There's a whole bunch of small details here that really matter to understand what security/integrity properties you want the cache to uphold.
FWIW, I've been looking into this in Buck2/Bazel land, and my understanding is that most large orgs just use some kind of terminating auth proxy that the underlying connection/flow/build artifacts can be correlated back to. So you know this cache artifact was first inserted by build B, done by user X, who authenticated with their key K, etc etc.
Exactly — just like Git, everything is ultimately identified with a key which can tie back to a stable identity thru OIDC or similar mechanisms. At least that’s how we did it.
Sharing with team members, sharing with CI, and the ability to pull from more than just what's on your machine (i.e. a larger addressable cache than you are willing to keep on disk). Cache objects also compound across projects, so it's nice to ship them up somewhere and have them nearby when you need them.
Re/spoofing, obviously it's all protected with API keys and tokens, and we're working on mechanisms to perform end-to-end encryption. In general, build cache objects are usually addressed by a content-addressable-hash, so that also helps because your build typically knows the content it's looking for and can verify.
That isn't true for all tools, though, so we're working to understand where the gaps are and fix them.
IIUC the actual computation (e.g. compiling, linking, ...) happens on client (CI or developer) machines and the results are written to the server-side cache.
By spoofing I meant to say that an authenticated but malicious client (intentionally or not, e.g. a clueless intern) may be able to write malicious contents to the cache. For example, their build toolchain could be contaminated and the resulting build outputs are contaminated. The "action" per se and its hash is still legit, but the hash is only used as the lookup key -- their corresponding value is "spoofed."
The only safe way I can imagine to use such a remote cache is for CI to publish its build results so that they could be reused by developers. The direction from developers to developers or even to CI seems difficult to handle and has less value. But I might be missing some important insights here so my conclusion could be wrong.
But if that's the case, is the most valuable use case to just configure the CI to read from / write to the remote cache, and developers to only read from the remote cache? And given such an assumption, is it much easier to design/implememt a remote cache product?
>In general, build cache objects are usually addressed by a content-addressable-hash
How does that work? I would think the simplest case of a build object that needs to be cached is a .o file created from a .c file. The compiler sees the .c file and can determine its hash, but how can the compiler determine the hash of the .o file to know what to look up in the cache? I think the compiler would need to perform the lookup using the hash of the .c file, which isn't a hash of the data in the cache.
(Fwiw, group conversation encryption tech like MLS is somewhat applicable, and that's the sort of pattern we're looking at, but it would be cool to know if that's moving to you on the problem of safety w.r.t. builds.)
It's for sharing and aggregating. Ccache is useful locally, but really shines when combined with Distcc, a distributed compiler. Every host contributes a cache object that other hosts can use, and every host can use the cache object contributed by other hosts. So you don't even have to built it once yourself to benefit from the cache of everyone else. It therefore speeds up multiple hosts/users builds, distributed builds and the dev experience of individuals.
I built my own build system that does something similar.
I've set it up at work with two S3 buckets: trusted and untrusted. CI/CD read/write from trusted only. Developers read/write from untrusted, and read-only from trusted.
aseipp|2 years ago
Nix works this way by default (and much of the community operates caches like this) and it can be a massive, massive time saver.
> How do you make sure the build results are not spoofed?
What do you mean "spoofed?" As in, someone put an evil artifact in the cache? Or overwrote an existing artifact with a new one? Or someone just stole your developers access and started shoving shit in there? There's a whole bunch of small details here that really matter to understand what security/integrity properties you want the cache to uphold.
FWIW, I've been looking into this in Buck2/Bazel land, and my understanding is that most large orgs just use some kind of terminating auth proxy that the underlying connection/flow/build artifacts can be correlated back to. So you know this cache artifact was first inserted by build B, done by user X, who authenticated with their key K, etc etc.
sgammon|2 years ago
yjftsjthsd-h|2 years ago
sgammon|2 years ago
Re/spoofing, obviously it's all protected with API keys and tokens, and we're working on mechanisms to perform end-to-end encryption. In general, build cache objects are usually addressed by a content-addressable-hash, so that also helps because your build typically knows the content it's looking for and can verify.
That isn't true for all tools, though, so we're working to understand where the gaps are and fix them.
xjia|2 years ago
By spoofing I meant to say that an authenticated but malicious client (intentionally or not, e.g. a clueless intern) may be able to write malicious contents to the cache. For example, their build toolchain could be contaminated and the resulting build outputs are contaminated. The "action" per se and its hash is still legit, but the hash is only used as the lookup key -- their corresponding value is "spoofed."
The only safe way I can imagine to use such a remote cache is for CI to publish its build results so that they could be reused by developers. The direction from developers to developers or even to CI seems difficult to handle and has less value. But I might be missing some important insights here so my conclusion could be wrong.
But if that's the case, is the most valuable use case to just configure the CI to read from / write to the remote cache, and developers to only read from the remote cache? And given such an assumption, is it much easier to design/implememt a remote cache product?
Thorrez|2 years ago
How does that work? I would think the simplest case of a build object that needs to be cached is a .o file created from a .c file. The compiler sees the .c file and can determine its hash, but how can the compiler determine the hash of the .o file to know what to look up in the cache? I think the compiler would need to perform the lookup using the hash of the .c file, which isn't a hash of the data in the cache.
sgammon|2 years ago
throwawaaarrgh|2 years ago
mgaunard|2 years ago
I've set it up at work with two S3 buckets: trusted and untrusted. CI/CD read/write from trusted only. Developers read/write from untrusted, and read-only from trusted.
sgammon|2 years ago
Or, maybe the blobs you’re dealing with are on the bigger end? That would also make sense