A one-line change decreased our build times by 99%

[+] JOnAgain|5 years ago|reply

I think it takes some real humility to post this. No doubt someone will follow up with an “of course...” or “if you don’t understand the tech you use...” comment.

But thank you for this. It takes a bit of courage to point out you’ve been doing something grotesquely inefficient for years and years.

[+] projektfu|5 years ago|reply

I'd be interested to know how they came to realize what was missing. Did they read the Jenkins docs more thoroughly? Post on a mailing list? See something on StackOverflow? Hire a consultant?

[+] hashkb|5 years ago|reply

[deleted]

[+] bald42|5 years ago|reply

I don't get why they have to clone their repo frequently in the first place - seems to me as a brute force usage of a version control system prone to high cost in the first place.

[+] segfaultbuserr|5 years ago|reply

Better title: A one-line change decreased our "git clone" times by 99%.

It's a bit misleading to use "build time" to describe this improvement, as it makes people think about build systems, compilers, header files, or cache. On the other hand, the alternative title is descriptive and helpful to all developers, not only just builders - people who simply need to clone a branch from a large repository can benefit from this tip as well.

[+] dada78641|5 years ago|reply

This reminds me of my first programming job in 2005, working with Macromedia Flash. They had one other Flash programmer who only worked there every once in a while because he was actually studying in college, and he was working on some kind of project from hell that, among other problems, took about two minutes to build to SWF.

Eventually they stopped asking him to come because he couldn't get anything done, and so I had a look at it. In the Movie Clip library of the project I found he had an empty text field somewhere that was configured to include a copy of almost the entire Unicode range, including thousands of CJK characters, so each time you built the SWF it would collect and compress numerous different scripts from different fonts as vectors for use by the program. And it wasn't even being used by anything.

Once I removed that one empty text field, builds went down to about ~3 seconds.

[+] dusted|5 years ago|reply

This is the most I've ever gotten out of pinterest, other than this, it's just the "wrong site that google turns up, that I can't use because it wants me to create an account just to watch the image I searched for"

[+] saagarjha|5 years ago|reply

Can we not do the thing where we pick an organization from an article and then bring up the most generic complaint you can about it in a way that is entirely irrelevant to the post? We get it, you don't like Pinterest showing up in search results, nobody does. But this has absolutely nothing to do with the article other than it being pattern matching on the word "Pinterest", which is about the least informative comment you can make aside from outright trolling or spam. There are threads that come up from time to time where such comments would be appropriate, if not particularly substantive.

[+] csunbird|5 years ago|reply

I am not sure why google does not penalize this behavior in their search ranking.

[+] randunel|5 years ago|reply

The most frequent search keyword that I use is "-pinterest"

[+] bufferoverflow|5 years ago|reply

That's my experience too. Imagine how many views they have lost over the years, just because they require a login.

And shame on you, Google, for playing along and indexing their shit, when it's not visible when I click through.

[+] some_furry|5 years ago|reply

This fact has forced people to write browser extensions to filter Pinterest out.

I opt for the "teach non-tech people how to dork" route instead: https://soatok.blog/2020/07/21/dorking-your-way-to-search-re...

[+] jeromenerf|5 years ago|reply

This is one situation where a duckduckgo search is objectively of a better signal/noise ratio.

[+] sercankd|5 years ago|reply

Yeah i always believed it was some kind of lone evil ai that lives through search results.

[+] Joker_vD|5 years ago|reply

Y'know, I actually made a Pinterest account once because of one particular picture I really wanted. Guess what, even with an account you can't have it. Oh well, guess I'll just let it go.

[+] syncsynchalt|5 years ago|reply

They also created/maintain the kotlin linter, "ktlint".

[+] mcv|5 years ago|reply

On my first job, 20 years ago, we used a custom Visual C framework that generated one huge .h file that connected all sorts of stuff together. Amongst other things, that .h file contained a list of 10,000 const uints, which were included in every file, and compiled in every file. Compiling that project took hours. At some point I wrote a script that changed all those const uints to #define, which cut our build time to a much more manageable half hour.

Project lead called it the biggest productivity improvement in the project; now we could build over lunch instead of over the weekend.

If there's a step in your build pipeline that takes an unreasonable amount of time, it's worth checking why. In my current project, the slowest part of our build pipeline is the Cypress tests. (They're also the most unreliable part.)

[+] ravishi|5 years ago|reply

At my second job in the industry I worked on a Python project that had to be deployed in a kind of sandboxed production environment where we had no internet access.

Deploys were painful, as any missing dependency had to be searched in our notebooks over 3G, then copied to an external storage, then plugged into a Windows machine, uploaded to the production server through SCP and then deployed manually over SSH. Sometimes we spent hours doing this again and again until all dependencies were finally resolved.

I worked there for almost a year, did many cool gigs and learned a lot. But my most valuable contribution came when at some point, tired of that unpredictable torture that were the deploys, started researching into solutions. I set up a pypi proxy into one of our spare office machines and routed all my daily package installs through that. Then I copied that entire proxy content into the production machine before every deploy, and voila, no more surprises.

I left this job a few weeks later, but have heard that this solution was very useful for many devs that joined the team afterwards.

[+] renke1|5 years ago|reply

> If there's a step in your build pipeline that takes an unreasonable amount of time, it's worth checking why. In my current project, the slowest part of our build pipeline is the Cypress tests. (They're also the most unreliable part.)

Would you say the (slow and unreliable) Cypress tests are worth it still?

[+] holtalanm|5 years ago|reply

> In my current project, the slowest part of our build pipeline is the Cypress tests

Oh man, I feel your pain.

[+] aidanhs|5 years ago|reply

I sympathise a lot with this post! Git cloning can be shockingly slow.

As a personal anecdote, clones of the Rust repository in CI used to be pretty slow, and on investigating we found out that one key problem was cloning the LLVM submodule (which Rust has a fork of).

In the end we put in place a hack to download the tar.gz of our LLVM repo from github and just copy it in place of the submodule, rather than cloning it. [0]

Also, as a counterpoint to some other comments in this thread - it's really easy to just shrug off CI getting slower. A few minutes here and there adds up. It was only because our CI would hard-fail after 3 hours that the infra team really started digging in (on this and other things) - had we left it, I suspect we might be at around 5 hours by now! Contributors want to do their work, not investigate "what does a git clone really do".

p.s. our first take on this was to have the submodules cloned and stored in the CI cache, then use the rather neat `--reference` flag [1] to grab objects from this local cache when initialising the submodule - incrementally updating the CI cache was way cheaper than recloning each time. Sadly the CI provider wasn't great at handling multi-GB caches, so we went with the approach outlined above.

[0] https://github.com/rust-lang/rust/blob/1.47.0/src/ci/init_re...

[1] https://github.com/rust-lang/rust/commit/0347ff58230af512c95...

[+] sxp|5 years ago|reply

> Even though we’re telling Git to do a shallow clone, to not fetch any tags, and to fetch the last 50 commits ...

What is the reason for cloning 50 commits? Whenever I clone a repo off GitHub for a quick build and don't care about sending patches back, I always use --depth=1 to avoid any history or stale assets. Is there a reason to get more commits if you don't care about having a local copy of the history? Do automated build pipelines need more info?

[+] mehrdadn|5 years ago|reply

Some tools (like linters) might need to look at the actual changes that occurred for various reasons, such as to avoid doing redundant work on unmodified files. To do that, you need all the merge bases... which can present a kind of a chicken-and-egg problem because, to figure this out with git, you need the commits to be there locally to begin with. I'm sure you can find a way around it if you put enough effort into scripting against the remote git server, but you might need to deal with git internals in the process, and it's kind of a pain compared to just cloning the whole repo.

[+] MarkSweep|5 years ago|reply

I can’t speak for the original post, but I’ve seen other people[1] increase the commit count because part of the build process looks for a specific commit to checkout after cloning. If you have pull requests landing concurrently and you only clone the most recent commit, there is a race condition between when you queue the build with a specific commit id and when you start the clone.

All that being said, I don’t know why you would need you build agents to clone the whole damn repo for every build. Why not keep a copy around? That’s what TFS does.

One other thing I've seen to reduce the Git clone bottleneck is to clone from Git once, create a Git bundle from the clone, upload the bundle to cloud storage, and then have the subsequent steps use the bundle instead of cloning directly. See these two files for the .NET Runtime repo[2][3]. I assume they do this because the clone step is slow or unreliable and then the subsequent moving around of the bundle is faster and more reliable. It also makes every node get the exact same clone (they build on macOS, Windows, and Linux).

Lastly, be careful with the depth option when cloning. It causes a higher CPU burden on the remote. You can see this in the console output when the remote says it is compressing objects. And if you subsequently do a normaly fetch after a shallow clone, you can cause the server to do ever more work[4].

1: https://github.com/dotnet/runtime/pull/35109

2: https://github.com/dotnet/runtime/blob/693c1f05188330e270b01...

3: https://github.com/dotnet/runtime/blob/693c1f05188330e270b01...

4: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...

[+] globular-toast|5 years ago|reply

Tags. All of my builds use `git describe` to get a meaningful version number for the build.

[+] AdamJacobMuller|5 years ago|reply

I expected this to be some micro-optimization of moving a thing from taking 10 seconds to 100ms.

> Cloning our largest repo, Pinboard went from 40 minutes to 30 seconds.

This is both very impressive as well as very disheartening. If a process in my CI was taking 40 minutes I would be investigating sooner than a 40-minute delay.

I don't mean to throw shade on the pintrest engineering team, but, it speaks to an institutional complacency with things like this.

I'm sure everyone was happy when the clone took 1 second.

I doubt anyone noticed when the clone took 1 minute.

Someone probably started to notice when the clone took 5 minutes but didn't look.

Someone probably tried to fix it when the clone was taking 10 minutes and failed.

I wonder what 'institutional complacencies' we have. Problems we assume are unsolvable but are actually very trivial to solve.

[+] nemothekid|5 years ago|reply

I'm not sure this is complacency - this just seems like regular old tech debt. The build takes 40 minutes but everyone has other things to do and there is no time to tend to the debt. Then one day someone has some cycles and discovers a one line change fixes the underlying issue.

I'm sure many engineering projects have similar improvements that just get a ticket/issue opened and never revisited due to the mountain of other seemingly pressing issues. From IPO to the start of the year Pinterest stock price had been trending downwards - I'm sure there was more external pressure to increase profitability than to fix CI build times. The stock has completely turned around since COVID, so I'm sure that changes things

[+] fn1|5 years ago|reply

> I wonder what 'institutional complacencies' we have. Problems we assume are unsolvable but are actually very trivial to solve.

I spend a lot of time optimizing builds, because the effect is a multiplicator for everything else in development.

But it is not an easy task. One issue with performance-monitoring is that you have to carefully plan your work, or you will sit around and wait for results a lot:

Try the build: 40 minutes. Maybe add profiling statements, because you forgot them: another 40 minutes. Change something and try it out: no change, 40 minutes. Find another optimization which decreases time locally and try it out: 39.5 minutes, because on the build-server that optimization does not work that well. etc.

You just spent 160 minutes and shaved 0.5 minutes off the build.

I'm not saying it's not worth it, but that line of work is not often rewarding.

On the flip-side I once took two hours to write a java-agent which caches File.exists for class-loading and managed to decrease local startup time by 500% because the corporate virus-scanner got active less often.

[+] innagadadavida|5 years ago|reply

Considering the build host does this hundreds of times every day, a better solution would be to simply have a git repo cache locally, should be secure and reliable given git’s object store design?

Any simple wrappers for git that can do this transparently?

[+] NikhilVerma|5 years ago|reply

I doubt that they started off with a 40 mins delay. It probably crept slowly as the repo got bigger and no one noticed it because of the gentle gradient. And they didn't have the time/resources to look into it.

[+] edoceo|5 years ago|reply

You're confusing full clone, which for a huge repo is OK to be long as the fix which was to specify one RefSpec so they don't clone the full repo in CI.

[+] bluedino|5 years ago|reply

People probably did complain, but they were met with, "We're cloning a 20GB repo! It's not going to happen in an instant!"

[+] gpapilion|5 years ago|reply

I’ve found as an industry we’ve moved to more complex tools, but haven’t built the expertise in them to truly engineer solutions using them. I think lots of organizations could find major optimizations, but it requires really learning about the technology you’re utilizing.

[+] chinhodado|5 years ago|reply

When I first joined one of my previous jobs, the build process had a checkout stage where it was blowing away the git folder and checked out from scratch the whole repo every time (!). Since the build machine was reserved for that build job I simply made some changes to do git clean -dfx & git reset --hard & git checkout origin branch. It shaved off like 15 minutes of the build time, which was something like 50% of the total build time.

[+] SamuelAdams|5 years ago|reply

> In the case of Pinboard, that operation would be fetching more than 2,500 branches.

Ok, I'll ask: why does a single repository have over 2,500 branches? Why not delete the ones you no longer use?

[+] dahfizz|5 years ago|reply

Where I work doesn't delete branches, because there is no reason to. Git branches have essentially zero overhead and deleting them is just extra complexity in the CI toolchain. Deleting branches also deletes context in some scenarios. When dealing with an old codebase its nice to be able to checkout the exact version of the code at some point without having to dig through the log to get hashes and then dealing with a detached head.

The example in the article is a bit of a special case. It is a huge, and old, monorepo. In the typical case, fetching everything and fetching master is equivalent because all commits in all branches make their way into master anyway. If you have a weird branching strategy where you maintain multiple, significantly diverged branches at once, but only care about one of those branches at build time, then this optimization would save you time.

[+] tetris11|5 years ago|reply

If you have several releases with different targets, and want to make future security updates accessible to all

[+] DoingIsLearning|5 years ago|reply

They could already be doing that.

That is if we assume they copy google's philosophy of a single monolith repository.

Pinterest has about 2000 employees, assuming 20% are active developers thats about 400 people, that gives you roughly 6 branches per developer which wouldn't be outrageous.

[+] hn_throwaway_99|5 years ago|reply

Because they use a monorepo. With monorepos at large companies the individual git repositories will be much larger and contain a ton more branches than if you have a repository-per-project model.

[+] casperb|5 years ago|reply

Probably because they have 1600 employees and the 2500 branches are the active ones.

[+] est|5 years ago|reply

monorepo culture.

[+] jniedrauer|5 years ago|reply

One of the (many) things that drives me batty about Jenkins is that there are two different ways to represent everything. These days the "declarative pipelines" style seems to be the first class citizen, but most of the documentation still shows the old way. I can't take the code in this example and compare it trivially to my pipelines because the exact same logic is represented in a completely different format. I wish they would just deprecate one or the other.

[+] chrisweekly|5 years ago|reply

I find the self-congratulatory tone in the post kind of off-putting, akin to "I saved 99% on my heating bill when I started closing doors and windows in the middle of winter."

If your repos weigh in at 20GB in size, with 350k commits, subject to 60k pulls in a single day, having someone with half a devops clue take a look at what your Jenkinsfile is doing with git is not exactly rocket science or a needle in a haystack. (Here's hoping they discover branch pruning too; how many of those 2500 branches are active?)

As a consultant I've seen plenty of apallingly poor workflows and practices, so this isn't all that remarkable... but for me the post seems kind of pointless.

[+] YokoZar|5 years ago|reply

Can someone explain the intended meaning behind calling six different repositories "monorepos"?

It sounds to me like you don't have a monorepo at all and instead have six repositories for six project areas.

[+] muststopmyths|5 years ago|reply

I'm a git noob, so I'm sorry if this sounds dumb but wouldn't

git clone --single-branch

achieve the same thing (i.e, check out only the branch you want to build) ?

Also, why would you not only check out one branch when doing CI ?

[+] tracer4201|5 years ago|reply

I truly appreciate articles like this — it’s warming to see other companies running into the kinds of issues I’ve ran into or had to deal with, and more so that their culture openly discusses and shares these learnings with the broader community.

The most effective organizations I’ve worked at built mechanisms and processes to disseminate these kinds of learnings and have regular brown bags on how a particular problem was solved or how others can apply their lessons.

Keep it up Pinterest engineering folks.

[+] uglycoyote|5 years ago|reply

He says that "Pinboard has more than 350K commits and is 20GB in size when cloned fully." I'm not clear though, exactly what "cloned fully" means in context of the unoptimized/optimized situation.

He says it went from 40 minutes to 30 seconds. Does this mean they found a way to grab the whole 20GB repo in 30 seconds? seems pretty darn fast to grab 20GB, but maybe on fast internal networks?

Or maybe they meant that it was 20GB if you grabbed all of the many thousands of garbage branches, when Jenkins really only needed to test "master", and finding a solution that allowed them to only grab what they needed made things faster.

I'm also curious about the incremental vs "cloning fully" aspect of it. Does each run of Jenkins clone the repo from scratch or does it incrementally pull into a directory where it has been cloned before? I could see how in a cloning-from-scratch situation the burden of cloning every branch that ever existed would be large, whereas incrementally I would think it wouldn't matter that much.

[+] bluedino|5 years ago|reply

My similar story goes like this: We had CRM software that let you setup user defined menu options. Someone at our organization decided to make a set of nested menu options where you could configure a product, with every possible combination being assigned a value!

So if you had a large, blue second generation widget with a foo accessory and option buzz, you were value 30202, and if was the same one except red, it was 26420...

Every time the CRM software started up, it cycled through the options, generated a new XML file with all the results, this took about a minute and created like a 60MB file.

The fix was to basically version the XML file and the options definition file. If someone had already generated that file, just load the XML file instead of parsing and looping through the options file. Started up in 5 seconds!

What was the excuse that it took so long in the first place? "The CRM software is written in Java, so it's slow."

[+] saagarjha|5 years ago|reply

Seems like there's a lot of hostility towards the title, which might be considered the engineering blog equivalent of clickbait. If the authors are around, the post was quite informative and interesting to read, but I'm sure it would have been much more palatable with a more descriptive title.

But back on topic: does anyone have any insight into when git fetches things, and what it chooses to grab? It is just "when we were writing git we chose these things as being useful to have a 'please update things before running this command' implicitly run before them"? For example, git pull seems to run a fetch for you, etc.

[+] sambe|5 years ago|reply

Ok, I'll ask the obvious question: why did setting the branches option to master not already do this?

EDIT

https://www.jenkins.io/doc/pipeline/steps/workflow-scm-step/ makes it sounds like the branches option specifies which branches to monitor for changes, after which all branches are fetched. This still seems like a counter-intuitive design that doesn't fit the most common cases.

243 comments