This is super interesting, as I maintain a 1M commits / 10GB size repo at work, and I'm researching ways to have it cloned by the users faster. Basically for now I do a very similar thing manually, storing a "seed" repo in S3 and having a custom script to fetch from S3 instead of doing `git clone`. (It's faster than cloning from GitHub, as apart from not having to enumerate millions of objects, S3 doesn't throttle the download, while GH seem to throttle at 16MiB/s.)
Semi-related: I always wondered but never got time to dig into what exactly are the contents of the exchange between server and client; I sometimes notice that when creating a new branch off
main (still talking the 1M commits repo), with just one new tiny commit, the amount of data the client sends is way bigger than I expected (tens of MBs). I always assumed the client somehow established with the server that it has a certain sha, and only uploads missing commit, but it seems it's not exactly the case when creating a new branch.
Funny you say this. At my last job I managed a 1.5TB perforce depot with hundreds of thousands of files and had the problem of “how can we speed up CI”. We were on AWS, so I synced the repo, created an ebs snapshot and used that to make a volume, with the intention of reusing it (as we could shove build intermediates in there too.
It was faster to just sync the workspace over the internet than it was to create the volume from the snapshot, and a clean build was quicker from the just sync’ed workspace than the snapshotted one, presumably to do with however EBS volumes work internally.
We just moved our build machines to the same VPC as the server and our download speeds were no longer an issue.
Have you looked into Scalar? It's built into MSFT git and designed to deal with repos that are much larger internally.
microsoft/git is focused on addressing these performance woes and making the monorepo developer experience first-class. The Scalar CLI packages all of these recommendations into a simple set of commands.
To try this feature out, you could have the server advertise a bundle ref file made with `git bundle create [bundle-file] --branches` that is hosted on a server within your network - it _should_ make a pretty big difference in local clone times.
I can't imagine you haven't looked at this, but I'm curious: Do shallow clones help at all, or if not what was the problem with them? I'm willing to believe that there are usecases that actually use 1M commits of history, but I'd be interested to hear what they are.
I have a vague recollection that GitHub is optimized for whole repo cloning and they were asking projects not to do shallow fetching automatically, for performance reasons
I believe there is a bit of a footgun here because if you don't git clone then you don't fetch all branches, just the default. Can be very confusing and annoying if you know a branch exists on remote but don't have it locally (the first time you hit it, at least).
A commit which does nothing more than change permissions of a file would probably beat that, from an information theory perspective.
You might say, "nay! the octal triple of a file's unix permissions requires 3+3+3 bits, which is 9, which is greater than the 8 bits of a single ascii character!"
But, actually, Git does not support any file permissions other than 644 and 755. So a change from one to the other could theoretically be represented in just one bit of information.
I did find it a little funny that my patch was so small but my commit message was so long. Also, I haven't successfully landed it yet, I keep being too lazy to roll more versions.
As someone who used and administered p4 for ages, I regard git as a regression in this regard. Making git a fully distributed system is really expensive for certain use cases. My current employer still uses p4 for large integrated-circuit workflow assets.
A previous workplace was trying to migrate from svn to git, when we realized that every previous official build had checked in the resulting binaries. A sane thing to do in svn, where the cost is only on the server, but would have resulted in a naive conversion costing 50Gb on every client.
Not the only mercurial feature where that's the case.. sad, I keep rooting for the project to implement mercurial frontend over a git db, but they seem to be limited by missing git features.
how did it solve them, and how are mercurial's bundles better than git's ones?
if I am reading the manpage right, the feature set seems pretty compatible. "hg bundle" looks pretty identical to "git bundle".. and "hg clone"'s "-R" option seems pretty similar to "git clone"'s "--reference".
One consequence of git clone is that if you have mega repos, it kind of ejects everything else from your cache for no win.
You'd actually rather special case full clones and instruct the storage layer to avoid adding to the cache for the clone. But this isn't always possible to do.
Git bundles seem like a good way to improve the performance of other requests, since they punt off to a CDN and protect the cache.
[+] [-] jakub_g|1 year ago|reply
Semi-related: I always wondered but never got time to dig into what exactly are the contents of the exchange between server and client; I sometimes notice that when creating a new branch off main (still talking the 1M commits repo), with just one new tiny commit, the amount of data the client sends is way bigger than I expected (tens of MBs). I always assumed the client somehow established with the server that it has a certain sha, and only uploads missing commit, but it seems it's not exactly the case when creating a new branch.
[+] [-] maccard|1 year ago|reply
It was faster to just sync the workspace over the internet than it was to create the volume from the snapshot, and a clean build was quicker from the just sync’ed workspace than the snapshotted one, presumably to do with however EBS volumes work internally.
We just moved our build machines to the same VPC as the server and our download speeds were no longer an issue.
[+] [-] captn3m0|1 year ago|reply
[0]: https://www.kernel.org/best-way-to-do-linux-clones-for-your-...
[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/mricon/k...
[+] [-] bastardoperator|1 year ago|reply
https://github.com/microsoft/git
[+] [-] schacon|1 year ago|reply
[+] [-] yjftsjthsd-h|1 year ago|reply
[+] [-] schacon|1 year ago|reply
[+] [-] djfivyvusn|1 year ago|reply
[+] [-] sunnybeetroot|1 year ago|reply
[+] [-] ks2048|1 year ago|reply
According to SO, newer versions of git can do,
[+] [-] acheong08|1 year ago|reply
[+] [-] jes5199|1 year ago|reply
[+] [-] bobbylarrybobby|1 year ago|reply
[+] [-] autarch|1 year ago|reply
Hah, got you beat: https://github.com/eki3z/mise.el/pull/12/files
It's one ASCII character, so a one-byte patch. I don't think you can get smaller than that.
[+] [-] wavemode|1 year ago|reply
You might say, "nay! the octal triple of a file's unix permissions requires 3+3+3 bits, which is 9, which is greater than the 8 bits of a single ascii character!"
But, actually, Git does not support any file permissions other than 644 and 755. So a change from one to the other could theoretically be represented in just one bit of information.
[+] [-] retroflexzy|1 year ago|reply
It took the group several years to narrow in on.
[+] [-] timdorr|1 year ago|reply
[+] [-] schacon|1 year ago|reply
I did find it a little funny that my patch was so small but my commit message was so long. Also, I haven't successfully landed it yet, I keep being too lazy to roll more versions.
[+] [-] ZeWaka|1 year ago|reply
[+] [-] pR0Ps|1 year ago|reply
I feel like a bit of a fraud because this was the PR that got me the "Mars 2020 Contributor" badge...
[+] [-] geenat|1 year ago|reply
mercurial had it for ages.
svn had it for ages.
perforce had it for ages.
just keep the latest binary, or last x versions. Let us purge the rest easily.
[+] [-] pjc50|1 year ago|reply
A previous workplace was trying to migrate from svn to git, when we realized that every previous official build had checked in the resulting binaries. A sane thing to do in svn, where the cost is only on the server, but would have resulted in a naive conversion costing 50Gb on every client.
[+] [-] GrantMoyer|1 year ago|reply
[+] [-] Cthulhu_|1 year ago|reply
[+] [-] robertlagrant|1 year ago|reply
[+] [-] schacon|1 year ago|reply
[+] [-] andrewshadura|1 year ago|reply
[+] [-] capitainenemo|1 year ago|reply
[+] [-] dgfitz|1 year ago|reply
It is superior, and it’s not even much of a comparison.
[+] [-] theamk|1 year ago|reply
if I am reading the manpage right, the feature set seems pretty compatible. "hg bundle" looks pretty identical to "git bundle".. and "hg clone"'s "-R" option seems pretty similar to "git clone"'s "--reference".
[+] [-] nine_k|1 year ago|reply
[+] [-] mbac32768|1 year ago|reply
You'd actually rather special case full clones and instruct the storage layer to avoid adding to the cache for the clone. But this isn't always possible to do.
Git bundles seem like a good way to improve the performance of other requests, since they punt off to a CDN and protect the cache.
[+] [-] jedimastert|1 year ago|reply
[+] [-] jwpapi|1 year ago|reply
[deleted]