This is a really well written partial set of release notes. I was curious and looked at the full release notes [1], and I think these are pretty well written as well. I'm very impressed, especially given that git has such a large set of contributors.
They were off by a factor of 10 with the likelihood of being struck and killed by lightning, according to the nws website.
To clarify: the likelihood of being merely struck by lightning is ~ 1/1,000,000 per year. The likelihood of being struck and killed is 1/10,000,000 , or about 1/2^23.25
Given this, you would only have to be struck and killed by lightning 6.8 years in a row to equal a sha1 hash collision probability.
More importantly, the comparison is useless. The odds of running into issues with SHA-1 collisions in Git is a very different question from just the odds of two random SHA-1 hashes colliding.
The situational comparison originally used by Linus for a sha1 collision was all the members of your development team being killed and eaten by wolves. I'm not sure if he gave a time-frame that would need to occur in.
That is a nice writeup. One of the interesting things for me was to see which topics you decided to cover and which to omit. For instance, I noted `clone --reference --recurse-submodules` as a potential topic of interest, but I am afraid to point anybody to the `--reference` option due to its hidden dangers.
I'm also curious how you came up with 19,290 for a birthday paradox on a 7-hex hash. I think it's 16,384, but probability can sometimes be tricky. :)
I'm curious but not motivated enough to really search for it, but for ambiguous hash abbreviations, why not select the oldest, since presumably it was unique at the time it was created?
edit: I guess that information must not exist or I assume they'd be doing it.
As others have noted, there's not always an unambiguous date for some object types (the best you can do for blobs is to find the first commit in which they appeared, and use its date).
However, there'a more complicated issue with timestamps, with is that you care about what was in the repository of the person who generated the sha1, at the time of generation. So you could merge in history that includes older commits, and invalidate your sha1s with "older" objects.
So the timestamp of interest is not the one in the objects themselves, but when they entered some particular repository (and not even some well-known repository; the local clone of whoever happened to generate the sha1). That being said, those two things correlate a lot in practice, and auto-picking the oldest commit might be a useful heuristic.
It would be a fun project to implement as an option for `core.disambiguate`.
That heuristic would work most of the time. But it rests on the assumption that "if commit A has an older timestamp than commit B, then any user who saw commit B must have also seen commit A", which is not reliable in a distributed version control system.
It seems safer to just explicitly tell the user when they're trying to work with an ambiguous hash.
Yeah, I guess dates are only available for commits but not blobs or trees.
This is suggested by the disambiguation listing; if there were dates they would be displayed I hope.
I think approximate dates might be inferred, but since it might be misleading and costlier to determine it makes sense to leave it out - at least in this version of git.
The issue is a different one, I believe you're considering one specific situation while there are others to ponder. What would happen if someone copy & pasted part of the hash, or had some tool that always reduced that output to the first few digits, or other situations like these, how would you be able to tell that the user was actually after the oldest commit?
It seems much easier to indicate there's a problem, a conflict, and let the user solve it.
My name is Angela and I do research for Bitbucket. I’m kicking off a round of discussions with people who use Git tools. Ideally, I’d like to talk to people that sit on a team of 3 or more. If this is you, I would love to talk to you about your experience with <using> Git tools, or just some of the pain points that are keeping you up at night when doing your jobs.
We’ll just need 30 mins of your time, and as a token of my thanks to those that participate, I’d like to offer a US$50 Amazon gift voucher.
If you’re interested, just shoot me an email with your availability over the next few weeks and we can set up a time to chat for 30 minutes. Please also include your timezone so we can schedule a suitable time (as I’m located in San Francisco). Hope to talk to you soon!
I'm curious if anyone knows if the optimizations Twitter made to improve fetch performance for large, active repos have made it upstream yet? I don't work there anymore and neither do any of the people who were originally doing that work, but it was a pretty impressive speed up (I could git pull thousands of commits and be done in under a second on a 3GB repo with no large objects). I know the watchman support made it in, which was the other half of what made large repos perform well, but I haven't seen mention of the log-structured patch queue stuff that helped the server by eliminating most of the work to calculate what to send on a fetch. Anyone know?
Case-insensivity is important for some to be able to reliably remember a string. I won't easily retain the difference between 'b4dQbFs31' and 'b4DqBfs31'.
Same thing when speaking it out loud. 'B four D capital Q B capital F s thirty-one' is way more convoluted and error-prone than 'B four D Q B F S thirty-one'.
The best thing I've found that fits this criterion is Crockford's Base 32 [1], basically the extension of hex digits, removing letters ILOU.
But Base 32 (case-insensitivity by proxy) constrains us to 5 bits, which is only a 20% reduction over the 4 bits of base 16. So instead of the 20 bits `1ab2f` we could express them with something like `1qm3`.
Still better is grouped alphanumeric without potentially ambiguous characters (ie. generate number-1's but not letter-I's, generate number-0's but not letter-O's). I wrote a disambiguation library based upon the checksums present in IBAN @ https://github.com/globalcitizen/php-iban .. it's surprising how accurate the mistranscription suggestions are.
In general, pure-number systems are better if possible, but where you have issues squashing in enough data in suitably compact form, transitioning to alphanumeric is better.
You can also consider the use prefix-based systems, either utilizing temporal epochs or node-specific prefixes, both of which can utilize readable aliases.
Finally, for anything expecting human transcription, checksum systems are awesome!
It's interesting that Git 2.11 shortened the delta chains on aggressive repacks, when mercurial happily creates chains of > 1000 deltas (afaik, it doesn't have a hard limit, it stops using deltas when the size of the required deltas is larger than the full text).
Although it's worth noting mercurial and git use different delta formats.
Master Coder: Hmm. We refer to changes by a long, totally non-human-parseable string of characters that nobody can memorize,
and when we abbreviate it, it doesn't work 100% of the time. What can we do about it?
Novice Apprentice: Well... how about we stop using a long totally non-human-parseable string of characters that no
human can can memorize just to briefly refer to specific changes in human-readable output?
MC: What?! HERESY. Making a human use a cryptographic hash to reference a single random logical reference point in a mass of
logical binary objects among millions of others is clearly the best way to go. We just need some quick fixes.
NA: But... you can't reference it via speech, it doesn't work reliably via text when abbreviated, and it gives absolutely no
context whatsoever as to what it is. What's the point of using a cumbersome, inhuman reference for something you
only need to talk about briefly through a computer interface?
MC: SILENCE FOOL! ME DESIGN GOOD. YOU MAKE CODE MASTER ANGRY.
NA: Err... but what if we just let the program rename the references temporarily to human-parseable short strings, and resolve
what they are in between logs and commits?
MC: I SAID SILENCE!! Just for that, i'm going to make you explain to a new user why we force people to regularly clean
our their repositories after doing complicated things with them, like merging.
> Git is full of some of the worst design decisions in modern software history.
It's also full of some of the best software design decisions. The internals of Git are simple and elegant and they work like a charm. There have been very little changes to the internal workings since the first commit of Git.
I agree that the user interface is inconsistent, ugly and hard to grasp. But if you have a solid understanding of the internals, with the help of the git manpages it's pretty easy to achieve what you want.
There's software that's intended to work like a black box, just poke around the user interface and you can get stuff done. Git is not one of those. You need to understand the internal model, accept that the UI sucks, embrace the manpages, quit whining and get shit done.
I disagree with you, I think the hash is a very good auto-generated unique identifier to every commit. What do you suggest to use instead? You do have the option to tag commits with human friendly names.
Imagine if you merge a branch of old commits from another repo or something, which introduce short hash collisions. Then you copy/paste a short hash, and Git doesn't know when that reference is from or which branch it might refer to.
It would defeat the purpose of a content-addressable storage system. The fact that the same file and the same tree will always have the same hash is important for speeding up diffs, merges, and other operations.
[+] [-] mixedmath|9 years ago|reply
[1]: https://github.com/git/git/blob/v2.11.0/Documentation/RelNot...
[+] [-] stablemap|9 years ago|reply
https://news.ycombinator.com/item?id=13066516
[+] [-] jakub_g|9 years ago|reply
> The code that we have used for the past 10+ years to cycle 4-element ring buffers turns out to be not quite portable in theoretical world.
[+] [-] chriscool|9 years ago|reply
- Changing the default for “core.abbrev”?
- Prepare the sequencer for the upcoming rebase -i patches
(I am a Git Rev News editor.)
[+] [-] godson_drafty|9 years ago|reply
To clarify: the likelihood of being merely struck by lightning is ~ 1/1,000,000 per year. The likelihood of being struck and killed is 1/10,000,000 , or about 1/2^23.25
Given this, you would only have to be struck and killed by lightning 6.8 years in a row to equal a sha1 hash collision probability.
[+] [-] OJFord|9 years ago|reply
[+] [-] peff|9 years ago|reply
There are a lot of other caveats, too. Such as the idea of each year being an independent probability.
[+] [-] cakoose|9 years ago|reply
[+] [-] lutorm|9 years ago|reply
[+] [-] alkonaut|9 years ago|reply
[+] [-] godson_drafty|9 years ago|reply
[+] [-] kannonboy|9 years ago|reply
I've put together another write-up of the Git 2.11 release that discusses some of the other new features (and goes into a little more detail on some of the 'sundries'): https://medium.com/@kannonboy/whats-new-in-git-2-11-64860aea...
[+] [-] peff|9 years ago|reply
I'm also curious how you came up with 19,290 for a birthday paradox on a 7-hex hash. I think it's 16,384, but probability can sometimes be tricky. :)
[+] [-] boundlessdreamz|9 years ago|reply
[+] [-] elevensies|9 years ago|reply
edit: I guess that information must not exist or I assume they'd be doing it.
[+] [-] peff|9 years ago|reply
However, there'a more complicated issue with timestamps, with is that you care about what was in the repository of the person who generated the sha1, at the time of generation. So you could merge in history that includes older commits, and invalidate your sha1s with "older" objects.
So the timestamp of interest is not the one in the objects themselves, but when they entered some particular repository (and not even some well-known repository; the local clone of whoever happened to generate the sha1). That being said, those two things correlate a lot in practice, and auto-picking the oldest commit might be a useful heuristic.
It would be a fun project to implement as an option for `core.disambiguate`.
[+] [-] teraflop|9 years ago|reply
It seems safer to just explicitly tell the user when they're trying to work with an ambiguous hash.
[+] [-] emmelaich|9 years ago|reply
This is suggested by the disambiguation listing; if there were dates they would be displayed I hope.
I think approximate dates might be inferred, but since it might be misleading and costlier to determine it makes sense to leave it out - at least in this version of git.
[+] [-] ludbb|9 years ago|reply
The issue is a different one, I believe you're considering one specific situation while there are others to ponder. What would happen if someone copy & pasted part of the hash, or had some tool that always reduced that output to the first few digits, or other situations like these, how would you be able to tell that the user was actually after the oldest commit? It seems much easier to indicate there's a problem, a conflict, and let the user solve it.
[+] [-] transfire|9 years ago|reply
[+] [-] based2|9 years ago|reply
[+] [-] CJefferson|9 years ago|reply
The only problem would by now probably too much code expects hex, so I'm not sure the gain is big enough to go through the pain of the switch.
[+] [-] guomanmin|9 years ago|reply
My name is Angela and I do research for Bitbucket. I’m kicking off a round of discussions with people who use Git tools. Ideally, I’d like to talk to people that sit on a team of 3 or more. If this is you, I would love to talk to you about your experience with <using> Git tools, or just some of the pain points that are keeping you up at night when doing your jobs.
We’ll just need 30 mins of your time, and as a token of my thanks to those that participate, I’d like to offer a US$50 Amazon gift voucher.
If you’re interested, just shoot me an email with your availability over the next few weeks and we can set up a time to chat for 30 minutes. Please also include your timezone so we can schedule a suitable time (as I’m located in San Francisco). Hope to talk to you soon!
Cheers, Angela Guo [email protected]
[+] [-] MBCook|9 years ago|reply
The release notes mentioned protocol improvements with git-filter that can dramatically speed up git-LFS (the large file storage plugin).
Does anyone know if there are any plans to make git-LFS part of the base instead of an add on?
[+] [-] chriscool|9 years ago|reply
https://github.com/git-lfs/git-lfs/issues/1702
(I am working on this for GitLab.)
[+] [-] shakna|9 years ago|reply
[+] [-] sulam|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] atemerev|9 years ago|reply
Better alternatives:
Base64 without padding: compact.
Grouped decimals: slightly less compact than hexadecimal, but extremely easy to type and pronounce. E.g. 577-467-341-467
[+] [-] joallard|9 years ago|reply
Same thing when speaking it out loud. 'B four D capital Q B capital F s thirty-one' is way more convoluted and error-prone than 'B four D Q B F S thirty-one'.
The best thing I've found that fits this criterion is Crockford's Base 32 [1], basically the extension of hex digits, removing letters ILOU.
But Base 32 (case-insensitivity by proxy) constrains us to 5 bits, which is only a 20% reduction over the 4 bits of base 16. So instead of the 20 bits `1ab2f` we could express them with something like `1qm3`.
Or we could be using words...
[1]: http://www.crockford.com/wrmg/base32.html
[+] [-] gabrielhn|9 years ago|reply
[+] [-] contingencies|9 years ago|reply
In general, pure-number systems are better if possible, but where you have issues squashing in enough data in suitably compact form, transitioning to alphanumeric is better.
You can also consider the use prefix-based systems, either utilizing temporal epochs or node-specific prefixes, both of which can utilize readable aliases.
Finally, for anything expecting human transcription, checksum systems are awesome!
[+] [-] glandium|9 years ago|reply
Although it's worth noting mercurial and git use different delta formats.
Edit: This is apparently what chooses to store a delta or not in mercurial: https://www.mercurial-scm.org/repo/hg/file/9e29d4e4e08b/merc... self._maxchainlen is not set by default.
[+] [-] peterwwillis|9 years ago|reply
[+] [-] exDM69|9 years ago|reply
It's also full of some of the best software design decisions. The internals of Git are simple and elegant and they work like a charm. There have been very little changes to the internal workings since the first commit of Git.
I agree that the user interface is inconsistent, ugly and hard to grasp. But if you have a solid understanding of the internals, with the help of the git manpages it's pretty easy to achieve what you want.
There's software that's intended to work like a black box, just poke around the user interface and you can get stuff done. Git is not one of those. You need to understand the internal model, accept that the UI sucks, embrace the manpages, quit whining and get shit done.
[+] [-] nickez|9 years ago|reply
[+] [-] epberry|9 years ago|reply
[+] [-] emmelaich|9 years ago|reply
[+] [-] OJFord|9 years ago|reply
[+] [-] farnsworth|9 years ago|reply
[+] [-] algesten|9 years ago|reply
I guess if you only work with rebases instead of merges, it should be possible, right?
[+] [-] nhaehnle|9 years ago|reply
[+] [-] Manishearth|9 years ago|reply
The issue is with collisions in the truncated hashes which are often used to refer to objects in emails and such. Not the full SHA.
[+] [-] supercoder|9 years ago|reply