Git 2.11 has been released

This is a really well written partial set of release notes. I was curious and looked at the full release notes [1], and I think these are pretty well written as well. I'm very impressed, especially given that git has such a large set of contributors.

[1]: https://github.com/git/git/blob/v2.11.0/Documentation/RelNot...

Another nice writeup from Atlassian:

https://news.ycombinator.com/item?id=13066516

I liked this gem in L547:

> The code that we have used for the past 10+ years to cycle 4-element ring buffers turns out to be not quite portable in theoretical world.

In Git Rev News edition 20 (https://git.github.io/rev_news/2016/10/19/edition-20/) there are also articles about some changes in v2.11:

- Changing the default for “core.abbrev”?

- Prepare the sequencer for the upcoming rebase -i patches

(I am a Git Rev News editor.)

They were off by a factor of 10 with the likelihood of being struck and killed by lightning, according to the nws website.

To clarify: the likelihood of being merely struck by lightning is ~ 1/1,000,000 per year. The likelihood of being struck and killed is 1/10,000,000 , or about 1/2^23.25

Given this, you would only have to be struck and killed by lightning 6.8 years in a row to equal a sha1 hash collision probability.

    > Given this, you would only have to be struck and killed
    > by lightning 6.8 years in a row to equal a sha1 hash
    > collision probability.

Oh, well that changes everything! ;)

Oh, you're right. I misread the chart (and just fixed the blog post).

There are a lot of other caveats, too. Such as the idea of each year being an independent probability.

More importantly, the comparison is useless. The odds of running into issues with SHA-1 collisions in Git is a very different question from just the odds of two random SHA-1 hashes colliding.

I'm not sure those events are independent... ;-)

I think the probability of being killed by lightning even two years in a row is already zero.

The situational comparison originally used by Linus for a sha1 collision was all the members of your development team being killed and eaten by wolves. I'm not sure if he gave a time-frame that would need to occur in.

Great write-up! I love the focus on performance in this release.

I've put together another write-up of the Git 2.11 release that discusses some of the other new features (and goes into a little more detail on some of the 'sundries'): https://medium.com/@kannonboy/whats-new-in-git-2-11-64860aea...

That is a nice writeup. One of the interesting things for me was to see which topics you decided to cover and which to omit. For instance, I noted `clone --reference --recurse-submodules` as a potential topic of interest, but I am afraid to point anybody to the `--reference` option due to its hidden dangers.

I'm also curious how you came up with 19,290 for a birthday paradox on a 7-hex hash. I think it's 16,384, but probability can sometimes be tricky. :)

Great writeup! Btw what tool did you use to make the merge diagrams?

I'm curious but not motivated enough to really search for it, but for ambiguous hash abbreviations, why not select the oldest, since presumably it was unique at the time it was created?

edit: I guess that information must not exist or I assume they'd be doing it.

As others have noted, there's not always an unambiguous date for some object types (the best you can do for blobs is to find the first commit in which they appeared, and use its date).

However, there'a more complicated issue with timestamps, with is that you care about what was in the repository of the person who generated the sha1, at the time of generation. So you could merge in history that includes older commits, and invalidate your sha1s with "older" objects.

So the timestamp of interest is not the one in the objects themselves, but when they entered some particular repository (and not even some well-known repository; the local clone of whoever happened to generate the sha1). That being said, those two things correlate a lot in practice, and auto-picking the oldest commit might be a useful heuristic.

It would be a fun project to implement as an option for `core.disambiguate`.

That heuristic would work most of the time. But it rests on the assumption that "if commit A has an older timestamp than commit B, then any user who saw commit B must have also seen commit A", which is not reliable in a distributed version control system.

It seems safer to just explicitly tell the user when they're trying to work with an ambiguous hash.

Yeah, I guess dates are only available for commits but not blobs or trees.

This is suggested by the disambiguation listing; if there were dates they would be displayed I hope.

I think approximate dates might be inferred, but since it might be misleading and costlier to determine it makes sense to leave it out - at least in this version of git.

Dates are available for some objects, see http://stackoverflow.com/a/39930978/1832154 for more details about what is/isn't available.

The issue is a different one, I believe you're considering one specific situation while there are others to ponder. What would happen if someone copy & pasted part of the hash, or had some tool that always reduced that output to the first few digits, or other situations like these, how would you be able to tell that the user was actually after the oldest commit? It seems much easier to indicate there's a problem, a conflict, and let the user solve it.

Converting to and from base 36 (or 32) would probably do more to help the problem than any heuristics. Compare:

  66c22ba6fbe0724ecce3d82611ff0ec5c2b0255f

to:

  c04bo5604v5qsp6asgasjp9y4paxu8v

That's approx a 25% gain in compactness.

https://github.com/thanatos/baseunicode

That's an interesting thought, and I don't feel there's any advantage in the text being in hex.

The only problem would by now probably too much code expects hex, so I'm not sure the gain is big enough to go through the pain of the switch.

Participate in Atlassian Research

My name is Angela and I do research for Bitbucket. I’m kicking off a round of discussions with people who use Git tools. Ideally, I’d like to talk to people that sit on a team of 3 or more. If this is you, I would love to talk to you about your experience with <using> Git tools, or just some of the pain points that are keeping you up at night when doing your jobs.

  We’ll just need 30 mins of your time, and as a token of my thanks to those that participate, I’d like to offer a US$50 Amazon gift voucher.   

If you’re interested, just shoot me an email with your availability over the next few weeks and we can set up a time to chat for 30 minutes. Please also include your timezone so we can schedule a suitable time (as I’m located in San Francisco). Hope to talk to you soon!   

Cheers,  Angela Guo [email protected]

It's been a while since I looked into what Git was up to in the latest version.

The release notes mentioned protocol improvements with git-filter that can dramatically speed up git-LFS (the large file storage plugin).

Does anyone know if there are any plans to make git-LFS part of the base instead of an add on?

There is work going on about external object database support that could help in the long run:

https://github.com/git-lfs/git-lfs/issues/1702

(I am working on this for GitLab.)

I'd assume there's licensing issues before anything else: git-lfs is MIT licensed, git GPLv2.

I'm curious if anyone knows if the optimizations Twitter made to improve fetch performance for large, active repos have made it upstream yet? I don't work there anymore and neither do any of the people who were originally doing that work, but it was a pretty impressive speed up (I could git pull thousands of commits and be done in under a second on a 3GB repo with no large objects). I know the watchman support made it in, which was the other half of what made large repos perform well, but I haven't seen mention of the log-structured patch queue stuff that helped the server by eliminating most of the work to calculate what to send on a fetch. Anyone know?

Hexadecimal dumps of binary data are the worst of all worlds if used as keys/references. Hard to memorize, hard to type, look ugly, aren't compact.

Better alternatives:

Base64 without padding: compact.

Grouped decimals: slightly less compact than hexadecimal, but extremely easy to type and pronounce. E.g. 577-467-341-467

Case-insensivity is important for some to be able to reliably remember a string. I won't easily retain the difference between 'b4dQbFs31' and 'b4DqBfs31'.

Same thing when speaking it out loud. 'B four D capital Q B capital F s thirty-one' is way more convoluted and error-prone than 'B four D Q B F S thirty-one'.

The best thing I've found that fits this criterion is Crockford's Base 32 [1], basically the extension of hex digits, removing letters ILOU.

But Base 32 (case-insensitivity by proxy) constrains us to 5 bits, which is only a 20% reduction over the 4 bits of base 16. So instead of the 20 bits `1ab2f` we could express them with something like `1qm3`.

Or we could be using words...

[1]: http://www.crockford.com/wrmg/base32.html

Being able to easily copy/paste the string is important. The string should be a word so double clicking selects the full string. Maybe base62?

Still better is grouped alphanumeric without potentially ambiguous characters (ie. generate number-1's but not letter-I's, generate number-0's but not letter-O's). I wrote a disambiguation library based upon the checksums present in IBAN @ https://github.com/globalcitizen/php-iban .. it's surprising how accurate the mistranscription suggestions are.

In general, pure-number systems are better if possible, but where you have issues squashing in enough data in suitably compact form, transitioning to alphanumeric is better.

You can also consider the use prefix-based systems, either utilizing temporal epochs or node-specific prefixes, both of which can utilize readable aliases.

Finally, for anything expecting human transcription, checksum systems are awesome!

It's interesting that Git 2.11 shortened the delta chains on aggressive repacks, when mercurial happily creates chains of > 1000 deltas (afaik, it doesn't have a hard limit, it stops using deltas when the size of the required deltas is larger than the full text).

Although it's worth noting mercurial and git use different delta formats.

Edit: This is apparently what chooses to store a delta or not in mercurial: https://www.mercurial-scm.org/repo/hg/file/9e29d4e4e08b/merc... self._maxchainlen is not set by default.

Jesus christ Git's interface design is horrible.

  Master Coder: Hmm. We refer to changes by a long, totally non-human-parseable string of characters that nobody can memorize,
  and when we abbreviate it, it doesn't work 100% of the time. What can we do about it?
  
  Novice Apprentice: Well... how about we stop using a long totally non-human-parseable string of characters that no 
  human can can memorize just to briefly refer to specific changes in human-readable output?
  
  MC: What?! HERESY. Making a human use a cryptographic hash to reference a single random logical reference point in a mass of
  logical binary objects among millions of others is clearly the best way to go. We just need some quick fixes.
  
  NA: But... you can't reference it via speech, it doesn't work reliably via text when abbreviated, and it gives absolutely no
  context whatsoever as to what it is. What's the point of using a cumbersome, inhuman reference for something you
  only need to talk about briefly through a computer interface?
  
  MC: SILENCE FOOL! ME DESIGN GOOD. YOU MAKE CODE MASTER ANGRY.
  
  NA: Err... but what if we just let the program rename the references temporarily to human-parseable short strings, and resolve
  what they are in between logs and commits?
  
  MC: I SAID SILENCE!! Just for that, i'm going to make you explain to a new user why we force people to regularly clean
  our their repositories after doing complicated things with them, like merging.

> Git is full of some of the worst design decisions in modern software history.

It's also full of some of the best software design decisions. The internals of Git are simple and elegant and they work like a charm. There have been very little changes to the internal workings since the first commit of Git.

I agree that the user interface is inconsistent, ugly and hard to grasp. But if you have a solid understanding of the internals, with the help of the git manpages it's pretty easy to achieve what you want.

There's software that's intended to work like a black box, just poke around the user interface and you can get stuff done. Git is not one of those. You need to understand the internal model, accept that the UI sucks, embrace the manpages, quit whining and get shit done.

I disagree with you, I think the hash is a very good auto-generated unique identifier to every commit. What do you suggest to use instead? You do have the option to tag commits with human friendly names.

Big fan of the negative parent selector for merge commits. Also enjoyed the writeup of the algorithm improvements for the various caches.

What a great writeup.

When a non-ambiguous short-hash _becomes_ ambiguous, can't it be disambiguated by simply disregarding those not in existence at time of reference?

Imagine if you merge a branch of old commits from another repo or something, which introduce short hash collisions. Then you copy/paste a short hash, and Git doesn't know when that reference is from or which branch it might refer to.

Why doesn't git upon commit ensure the sha-1 is unique by having some nonce as part of it?

I guess if you only work with rebases instead of merges, it should be possible, right?

It would defeat the purpose of a content-addressable storage system. The fact that the same file and the same tree will always have the same hash is important for speeding up diffs, merges, and other operations.

I think the answer is "it doesn't need to", because the probability of a collision is too low.

The issue is with collisions in the truncated hashes which are often used to refer to objects in emails and such. Not the full SHA.

Git is the worst

65 comments