As someone who spent way too much time chasing this rabbit, the real answer is Just Don't. GitHub Actions is a CI system that makes it easy to get started with simple CI needs but runs into hard problems as soon as you have more advanced needs. Docker caching is one of those advanced needs. If you have non-trivial Docker builds then you simply need on-disk local caching, period.
Either use Depot or switch to self-hosted runners with large disks.
If there's one thing I've learned over the years, is that we really seldom have advanced needs. Mostly we just want things to work a certain way, and will fight systems to make it behave so. It's easier to just leave it be. Like maven vs gradle; yes, gradle can do everything, but if you need that it's worth taking a step back and assess why the normal maven flow won't work. What's so special with our app compared to the millions working just fine out of the box?
Thanks for the shout-out regarding Depot, really appreciate it. We came to the same conclusion regarding Docker layer cache and thus why we created Depot in the first place. The limitations and performance surrounding GitHub Actions cache leaves a lot to be desired.
> GitHub Actions is a CI system that makes it easy to get started
It's not even that! Coming from GitLab I was quite surprised at how poor the "getting started" experience was. Rather than a simple "on push, run command X" you first have to do a deep dive into actions/events/workflows/jobs/runs, and then figure out what kind of weird tooling is used for trivial things like checking out your code, or storing artifacts.
And then you try to unify your pipeline across several projects because that's what Github is heavily promoting with the whole "uses: actions/checkout" reuse thing - but it turns out to be a huge hassle to get it working because nothing works the way you'd expect it to work.
In the end I did get GHA to do what I was already doing in GitLAb, but it took me ten times as long as it did originally setting it up. I believe GHA is flexible and powerful enough to be well-suited for medium-sized companies, but it's neither easy enough for small companies, nor powerful enough for large companies. It's one of the few Github features I genuinely dislike using.
I got it working, with intermediate layers, too. All to find that I didn’t see that material a performance benefit after taking into account how long it takes to pull from and push to the cache.
On one project that was a bit more involved, I pulled the latest image I've built from the registry before starting the build. That worked well enough for caching in my case.
totally agree, github actions has done an excellent job at this lowest layer of the build pipeline today but is woefully inadequate the minute your org hits north of 50 engineers
Additionally, docker build refuses to cause any side effects to the host system. This makes any kind of caching difficult by design. IMO, if possible, consider doing your build outside of docker and just copying it into a scratch container...
I use self-hosted runners. It wasnt even because we could have large disk for caching. Github pricing for their runners is so bad it was a no brainer to host our own.
This is wild. I've spent the last three weeks working on this stuff for two separate clients.
Important note if you're taking advice: cache-from and cache-to both accept multiple values. Cache to just ouputs the cache data to all the ones specified. cache-from looks for cache hits in the sources in-order. You can do some clever stuff to maximize cache hits with the least amount of downloading using the right combination.
I've spent days trying all of these solution at my company. All of these solutions suck, they are slow and only successful builds get their layers cached. This is a dead end. The only workable solution is to have a self-hosted runner with a big disk.
This is definitely a direction to try. But if its faster Docker image builds and a layer caching system that actually works, you should definitely try out Depot. We automatically persist layer cache to persistent NVMe devices and orchestrate that to be immediately available across builds.
I use namespace’s action runners for this (just a customer, not affiliated in any way). They’re a company with a pretty good product stack. Although the web UI is annoyingly barebones.
Can you share example of github actions?
When i use docker/setup-buildx-action and local runner i can't make it use the cache.
I think it's the "docker-container" runner's fault
There I also explain that IF you use a registry cache import/export, you should use the same registry to which you are also pushing your actual image, and use the "image-manifest=true" option (especially if you are targeting GHCR - on DockerHub "image-manifest=true" would not be necessary).
After years of lurking, I made an account to reply to this
"image-manifest=true" was the magic parameter that I needed to make this work with a non-DockerHub registry (Artifactory). I spent a lot of time fighting this, and non-obvious error messages. Thank you!!
We use a multi-stage build for a DevContainer environment, and the final image is quite large (for various reasons), so a better caching strategy really helps in our use case (smaller incremental image updates, smaller downloads for developers, less storage in the repository, etc)
Docker has been among us for years. Why isn’t efficient caching already implemented out of the box? It’s a leaking abstraction that users have to deal with. Annoying at best.
Please no! Do not use Bazel unless you have a platform team with multiple people who know how to use it - e.g. large Google-like teams.
We had “the Bazel guy” in our mid-sized company that Bazelified so many build processes, then left.
It has been an absolute nightmare to maintain because no normal person has any experience with this tooling. It’s very esoteric. People in our company have reluctantly had to pick up Bazel tech debt tasks, like how the rules_docker package got randomly deprecated and replaced with rules_oci with a different API, which meant we could no longer update our Golang services to new versions of Go.
In the process we’ve broken CI, builds on Mac, had production outages, and all kinds of peculiarities and rollbacks needed that have been introduced because of an over-engineered esoteric build system that no one really cares about or wanted.
Docker layer caching is one of the reasons I moved to Jenkins 2 years ago and have been very happy with it for the most part.
I only need to install utils once and all build time goes to building my software. It even integrates nicely with Github. Result: 50% faster feedback.
However, it needs a bit initial housekeeping and discipline to use correctly. For example using Jenkinsfiles is a must and using containers as agents is desirable.
this is pretty neat—it’s been a while since i’ve tried caching layers with gha. it used to be quite frustrating.
my previous experience was that in nearly all situations the time spent sending and retrieving cache layers over the network wound up making a shorter build step moot. ultimately we said “fuck it” and focused on making builds faster without (docker layer) caching.
it's unfortunate the amount of expertise / tinkering required to get "incrementalism" in docker builds in github actions. we're hoping to solve this with some of the stuff we have in the pipeline in the near future.
The fact that GitHub don't provide a better solution here has to be actually costing them money with the network usage and extra agent time consumed. Right?
It would be possible to offload the caching to Docker Build Cloud transparently, it’s part of the Docker subscription service, every account gets free minutes - 50 free minutes a month so depending on usage, you may be able to get this at zero cost.
With this approach, you’d use buildx and remotely, they would manage and maintain cache amongst other benefits.
It does require a credit card signup (which takes $0 to mitigate fraud). Full transparency, I’m a Docker Captain and helped test this whilst it was called Hydrobuild.
Smaller images are also another way to go. At my last company, image sizes were like 2-3Gb. I was able to prune that down to ~1.5 GB. Boost and a custom clang/llvm build were particular major offenders here.
I have this set up in our pipeline, we also build the image early and use assets to move it between jobs. We've also just switched to self-hosted runners, so might look into shared disk.
But in the long run, as annoying as it is out build pipelines reduced but quite a few minutes per build.
I weep for this period of time where we don't have sticky disks readily available for builds. Uploading the layer cache each time is such a coarse and time-consuming way to cache things.
Maybe building from scratch all the time is a good correctness decision? Maybe stale values in disks is a tricky enough issue to want to avoid entirely?
If you keep a stack of disks around and grab a free one when the job starts you'd end up with good speedup a lot of the time. If cost is an issue you can expire them quickly. I regularly see CI jobs spending >50% of their time downloading the same things, or compiling the same things, over and over. How many times have I triggered an action that compiled the exact same sqlite source code? Tens of thousands?
This is simply false. For starters, GitHub actions by default run on Intel Haswell chips from 2014 (in some cases). Secondly, hardware being faster doesn't obviate the need for caching, especially for docker builds where your layer pulls are purely network bound.
"Computers are very fast now" is largely because of caching. The CPU has a cache, the disk drive has a cache, the OS has a cache, the HTTP client has a cache, the CDN serving the content has a cache, etc. There may be better ways to cache than at the level of Docker image layers, but no caching is the same as a cache miss on every request, which can be dozens, hundreds, or even thousands of times slower than a cache hit.
[+] [-] solatic|2 years ago|reply
Either use Depot or switch to self-hosted runners with large disks.
[+] [-] Arbortheus|2 years ago|reply
GitLab CI is leaps and bounds ahead.
[+] [-] matsemann|2 years ago|reply
If there's one thing I've learned over the years, is that we really seldom have advanced needs. Mostly we just want things to work a certain way, and will fight systems to make it behave so. It's easier to just leave it be. Like maven vs gradle; yes, gradle can do everything, but if you need that it's worth taking a step back and assess why the normal maven flow won't work. What's so special with our app compared to the millions working just fine out of the box?
[+] [-] kylegalbraith|2 years ago|reply
[+] [-] crote|2 years ago|reply
It's not even that! Coming from GitLab I was quite surprised at how poor the "getting started" experience was. Rather than a simple "on push, run command X" you first have to do a deep dive into actions/events/workflows/jobs/runs, and then figure out what kind of weird tooling is used for trivial things like checking out your code, or storing artifacts.
And then you try to unify your pipeline across several projects because that's what Github is heavily promoting with the whole "uses: actions/checkout" reuse thing - but it turns out to be a huge hassle to get it working because nothing works the way you'd expect it to work.
In the end I did get GHA to do what I was already doing in GitLAb, but it took me ten times as long as it did originally setting it up. I believe GHA is flexible and powerful enough to be well-suited for medium-sized companies, but it's neither easy enough for small companies, nor powerful enough for large companies. It's one of the few Github features I genuinely dislike using.
[+] [-] cqqxo4zV46cp|2 years ago|reply
[+] [-] mhitza|2 years ago|reply
[+] [-] adityamaru|2 years ago|reply
[+] [-] candiddevmike|2 years ago|reply
[+] [-] aayushshah15|2 years ago|reply
[+] [-] bushbaba|2 years ago|reply
[+] [-] lispisok|1 year ago|reply
[+] [-] ithkuil|2 years ago|reply
[+] [-] cpfohl|2 years ago|reply
Important note if you're taking advice: cache-from and cache-to both accept multiple values. Cache to just ouputs the cache data to all the ones specified. cache-from looks for cache hits in the sources in-order. You can do some clever stuff to maximize cache hits with the least amount of downloading using the right combination.
[+] [-] Arbortheus|2 years ago|reply
[+] [-] adityamaru|2 years ago|reply
[+] [-] boronine|2 years ago|reply
[+] [-] kylegalbraith|2 years ago|reply
[+] [-] paholg|1 year ago|reply
Yes, nix is complex. But its caching story is soooo much better than docker's, and all the other docker issues just disappear.
https://nix.dev/tutorials/nixos/building-and-running-docker-...
[+] [-] teaearlgraycold|2 years ago|reply
[+] [-] mysza|1 year ago|reply
[+] [-] aayushshah15|2 years ago|reply
[+] [-] mshekow|2 years ago|reply
There I also explain that IF you use a registry cache import/export, you should use the same registry to which you are also pushing your actual image, and use the "image-manifest=true" option (especially if you are targeting GHCR - on DockerHub "image-manifest=true" would not be necessary).
[+] [-] daulis|1 year ago|reply
"image-manifest=true" was the magic parameter that I needed to make this work with a non-DockerHub registry (Artifactory). I spent a lot of time fighting this, and non-obvious error messages. Thank you!!
We use a multi-stage build for a DevContainer environment, and the final image is quite large (for various reasons), so a better caching strategy really helps in our use case (smaller incremental image updates, smaller downloads for developers, less storage in the repository, etc)
[+] [-] remram|2 years ago|reply
Is there really no way to cache the 'cachemount' directories?
[+] [-] tkiolp4|2 years ago|reply
[+] [-] omeid2|2 years ago|reply
What most people need but don't use is base layers that are upstream of their code repo and released regularly, not at each commit.
Containerisation has made reproducible environments so easy that people want to reproduce it at each CI run, a bit too much.
[+] [-] joe0|2 years ago|reply
[+] [-] jpgvm|2 years ago|reply
Use tools like Bazel + rules_oci or Gradle + jib and never spend time thinking about image builds taking time at all.
[+] [-] Arbortheus|2 years ago|reply
We had “the Bazel guy” in our mid-sized company that Bazelified so many build processes, then left.
It has been an absolute nightmare to maintain because no normal person has any experience with this tooling. It’s very esoteric. People in our company have reluctantly had to pick up Bazel tech debt tasks, like how the rules_docker package got randomly deprecated and replaced with rules_oci with a different API, which meant we could no longer update our Golang services to new versions of Go.
In the process we’ve broken CI, builds on Mac, had production outages, and all kinds of peculiarities and rollbacks needed that have been introduced because of an over-engineered esoteric build system that no one really cares about or wanted.
[+] [-] user-|2 years ago|reply
[+] [-] ValtteriL|2 years ago|reply
I only need to install utils once and all build time goes to building my software. It even integrates nicely with Github. Result: 50% faster feedback.
However, it needs a bit initial housekeeping and discipline to use correctly. For example using Jenkinsfiles is a must and using containers as agents is desirable.
[+] [-] notnmeyer|2 years ago|reply
my previous experience was that in nearly all situations the time spent sending and retrieving cache layers over the network wound up making a shorter build step moot. ultimately we said “fuck it” and focused on making builds faster without (docker layer) caching.
[+] [-] aayushshah15|2 years ago|reply
it's unfortunate the amount of expertise / tinkering required to get "incrementalism" in docker builds in github actions. we're hoping to solve this with some of the stuff we have in the pipeline in the near future.
[+] [-] damianh|2 years ago|reply
[+] [-] spurin|1 year ago|reply
With this approach, you’d use buildx and remotely, they would manage and maintain cache amongst other benefits.
It does require a credit card signup (which takes $0 to mitigate fraud). Full transparency, I’m a Docker Captain and helped test this whilst it was called Hydrobuild.
[+] [-] manx|2 years ago|reply
They rethink Dockerfiles with really good caching support.
[+] [-] oftenwrong|2 years ago|reply
A possible exception is the "auto skip" feature for Earthly Cloud, since I do not know how that is implemented.
[+] [-] glenjamin|2 years ago|reply
CircleCI has an implementation that used to use a detachable disk, but that had issues with concurrency
It’s since been replaced with an approach that uses a docker plugin under the hood to store layers in object storage
https://circleci.com/docs/docker-layer-caching/
[+] [-] remram|2 years ago|reply
[+] [-] SJC_Hacker|1 year ago|reply
There's quite a bit of cruft that can be pruned.
[+] [-] tanepiper|2 years ago|reply
But in the long run, as annoying as it is out build pipelines reduced but quite a few minutes per build.
[+] [-] adityajp|2 years ago|reply
Glad it worked really well for you.
What made you switch to self-hosted runners?
[+] [-] ikekkdcjkfke|2 years ago|reply
[+] [-] maxmcd|2 years ago|reply
Maybe building from scratch all the time is a good correctness decision? Maybe stale values in disks is a tricky enough issue to want to avoid entirely?
If you keep a stack of disks around and grab a free one when the job starts you'd end up with good speedup a lot of the time. If cost is an issue you can expire them quickly. I regularly see CI jobs spending >50% of their time downloading the same things, or compiling the same things, over and over. How many times have I triggered an action that compiled the exact same sqlite source code? Tens of thousands?
Maybe this is fine, I dunno.
[+] [-] Cloudef|2 years ago|reply
[+] [-] bagels|2 years ago|reply
[+] [-] dboreham|2 years ago|reply
So much time spent debunking such broken "caching" solutions.
Computers are very fast now. Use proper package/versioning systems (part of the problem here is that those are often also broken/badly designed).
[+] [-] aayushshah15|2 years ago|reply
[+] [-] kbolino|2 years ago|reply