top | item 2765844

Linus Torvalds proposes a change to the Git commit object format

153 points| avar | 14 years ago |spinics.net | reply

45 comments

order
[+] breckinloggins|14 years ago|reply
OK, finally found a decent explanation of what a Git generation number actually IS:

http://www.spinics.net/lists/git/msg161165.html

[+] pak|14 years ago|reply
Beisdes analyzing properties of the commit graph, which doesn't seem to be a common task, what about them is so useful that they need to be stored in commit headers now? For instance, what day-to-day operations do they actually make faster? At first, it sounded like it might be for easier navigability (e.g. expanding git-rev-parse) but I don't see that in this description.
[+] tonfa|14 years ago|reply
Interesting. It's also nice to see that Mercurial got this right from the start (it is not generation numbers, but the rev numbers can be used in the same way to stop the exploration).

Some mistakes were made while designing Mercurial, this wasn't one of them.

[+] andrewflnr|14 years ago|reply
So it's almost, but not quite, like the revision numbers everyone else has always had?
[+] rlpb|14 years ago|reply
What I like about git is that it stores only the minimum amount of information, and this makes it easy to explain. A commit hash is a hash of canonical information, not of derived information.

It seems really ugly to store derived information in a commit (specifically, that the hash would be altered by it).

It seems that Jeff has said the same thing, but Linus disagrees. Vocally.

http://www.spinics.net/lists/git/msg161336.html

[+] ryannielsen|14 years ago|reply
From my understanding, they're essentially adding this as an additional bit of information that's minimally required. The currently used timestamps are error prone and thus will be replaced by generation numbers which are more robust. They're still adhering to the principle of only storing the minimum amount of information, they're just adding generation numbers to that set.

In fact, you could make the argument that timestamps are the derived information that git has been storing all along while generation numbers are the canonical information which should have been stored from the beginning. Generation numbers are a result of the state of the tree, while timestamps are derived from the ambient (and potentially incorrect!) environment from which the commit was made.

[+] derrickpetzold|14 years ago|reply
>It seems really ugly to store derived information in a commit

I don't understand how generation numbers are derived information. They are used to find the position of the commit in relation to another. That makes them information that is essential to the commit. The problem was to get around them not being there timestamps were compared and that is not reliable for obvious reasons. So I really don't understand why any one would complain about this.

[+] mscarborough|14 years ago|reply
I don't generally come across Linus' dev threads, but it's usually in the context of some linkbaity 'watch Linus smack this dude down' or something of that nature.

This reads like a really productive thread from my limited understanding of git internals. It's pretty cool how much good engineering thought is going into this proposal.

Maybe that's why git rocks so hard.

[+] pyre|14 years ago|reply
I like the suggestion of storing the generation numbers in the pack index. When you generate a pack you're already parsing the entire tree. That makes more sense than requiring all future git objects to have 'generation numbers' jammed into them. Especially because it introduces an incompatibility with current git objects, which it would probably be best to avoid.
[+] nplusone|14 years ago|reply
Change last name to 'Torvalds' (edit: name in title changed)
[+] cypherpunks01|14 years ago|reply
What operations would be sped up by having generation numbers?

I see Jeff King's message that they would make certain bounding traversals faster, but when do bounding traversals need to be computed when I'm using git day-to-day?

[+] Rauchg|14 years ago|reply
It's also about making git not error-prone, which the current timestamp approach seems to do.
[+] derrickpetzold|14 years ago|reply
I was wondering how they got along without generation numbers for so long. It was by comparing timestamps and those are unreliable because systems can be misconfigured. How they are going to handle legacy repos with that problem I still don't get. I am guessing that history is f'd.
[+] mdwrigh2|14 years ago|reply
New versions of git will actually go back and generate this information for old commits. This will lead to git being slightly slower when in old repositories until all the commits contain the generation information, but that should happen fairly quickly.
[+] breckinloggins|14 years ago|reply
Can someone explain what generation numbers are? Googling "git generation numbers" pulls up mostly this discussion thread.

I'm assuming they're easy-to-remember incremental numbers tied to commit? Like 1, 2, 3, or tied to commit and branch, like master/1, etc.?

[+] Kliment|14 years ago|reply
Here is how I understand the problem.

At the moment, each commit stores a reference to the parent tree. By parsing that tree and reading the entire history you can obtain a hierarchy of commits. Because you need to order commits in many situations, reading the entire history is extremely inefficient, so git uses timestamps to determine the ordering of commits. This of course fails if the system clock on a given machine is off. With a generation number, you can get an ordering locally from the latest commits, without having to rely on timestamps or read the entire tree.

When you have a commit with generation n, any later commits that include it wound have generation >n, so to tell the relation between commits, you only need look as far back as n, and you can immediately get the order of any intermediate commits. It has nothing to do with "easy to remember". It's about making git more efficient and robust.

[+] Peaker|14 years ago|reply
Why call it a "generation number" and not "depth"?
[+] caf|14 years ago|reply
Perhaps because "Generation" continues the Parent/Child allegorical language.
[+] rs|14 years ago|reply
Think they're using "generation" here in the context of number of parents (yes, "depth" would work fine as well, but is a more general term)
[+] beza1e1|14 years ago|reply
Since gitk and others put new commits on top, i'd propose "height" instead of "depth" ;)
[+] macrael|14 years ago|reply
I'd love an explanation of what "generation numbers" are.
[+] pufuwozu|14 years ago|reply
Looking at the diff, it seems like a generation number is just a number the parents that a commit has. For example:

Commit a553af has no parents, it has generation number of 0.

Commit c464e0 is the next commit, it has generation number 1.

And so on. Branches count independently of one another. When commits have multiple parents (e.g. merges), the generation number starts counting from the previous maximum.