top | item 32637643

Git’s database internals I: packed object store

216 points| todsacerdoti | 3 years ago |github.blog | reply

64 comments

[+] masklinn|3 years ago|reply

This article gives me whiplash, it starts at a high level, dives, stops just before it touches upon the implementation details (and all the weird stuff which makes you realise the developers were human after all) then shoots back up again.

> Git does not currently have the capability to update a packfile in real time without shutting down concurrent reads from that file. Such a change could be possible, but it would require updating Git’s storage significantly. I think this is one area where a database expert could contribute to the Git project in really interesting ways.

I'm not sure it would be super useful, because of the delta-ification process. Adding "literal" (non-delta) files to a pack isn't much of a gain (file contents are zlib-compressed either way).

[+] kevincox|3 years ago|reply

Also presumably "shutting down concurrent reads from that file" is not much of a problem because the file can just be unlinked and it will be GCed when the last reader finishes naturally. Other than hung processes the only downside is extra disk space usage and some cache.

[+] spapas82|3 years ago|reply

I have written a small python library to read info directly from the .git folder (without using the git executable): https://github.com/spapas/python-git-info/

Decoding the pack files was notoriously difficult, probably there are still cases that I don't handle properly. The packfile is a really complex/messy format and in the top of that it lacks any proper documentation!

This article explains a lot of the concepts very good but doesn't go into implementation details (where the big snakes are). Another great resource for unpacking packfiles is this post: https://codewords.recurse.com/issues/three/unpacking-git-pac... and of course reading the source code of other git clients.

Unfortunately it seems that the official packfile format documentation is the git source code :|

[+] masklinn|3 years ago|reply

> The packfile is a really complex/messy format and in the top of that it lacks any proper documentation!

Maybe it’s changed since you'd checked it, but I found the pack files to be documented, through nowhere near as well as index files (now that was a pleasure, the index file format is straightforward, pretty logical, and very well documented).

The problem I had with the pack files documentation is that it’s non-linear, so if you read it linearly as you implement it you hit areas which turn out to be documented a few screens later. Furthermore it doesn’t necessarily define or spell out what it’s talking about, so it can take a while to realise that it has 3 (.5?) different formats of varints, or that the size it provides is for decompressed content, and that it relies on zlib to discover the end of the compressed stream (and good luck to you if your implementation doesn’t expose that).

But in my experience, it’s really nothing compared to the documentation of the wire format. That leaves even more details out, some of the explanations are outright misleading (I spent hours convinced the v2 protocol was using an HTTP connection as a half-duplex socket and wondering how I would manage), and with TODOs covering half the protocol.

[+] josephg|3 years ago|reply

One of the things that surprises me about Git is that it doesn’t store diffs. It just stores the full content of every file every time it’s modified and committed.

I work on concurrent editing systems (using CRDTs). People always ask about how we deal with the problem that if we store changes forever, the filesize grows indefinitely. Modern CRDTs are massively more space efficient than git but nobody seems to notice or care about the size of our git repositories. I think git supports shallow clones but I don’t know anyone who bothers with them.

[+] pocketarc|3 years ago|reply

It's a reminder that sometimes it's OK to ignore hypothetical problems.

"How does git deal with the problem that if we store changes forever, the file size grows indefinitely?"

Even though Git uses deltas for diffs to reduce space usage, it still technically grows indefinitely, same as CRDTs. So the answer is it doesn't, and it doesn't matter.

If the size of something actually becomes a problem, -then- it can be handled (e.g. Git LFS).

With CRDTs, depending on what you're doing, it might never become a problem. Keeping all versions of a document is entirely inconsequential, as an example. The size might grow indefinitely but how many edits is a document going to have, realistically?

[+] tomstuart|3 years ago|reply

I know what you mean, but — as the article explains [0] — Git’s packfile format uses delta compression, which is essentially “storing diffs”.

[0] https://github.blog/2022-08-29-gits-database-internals-i-pac...

[+] Nzen|3 years ago|reply

Ah, thanks. You've reminded me about the 2016 debacle [0] wherein github rate limited the cocoapods repository because their practice of using a shallow clone and later fetches impacted github's performance. I'm going to warc that now, so I don't forget again.

[0] https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...

[+] https://github.com/orgs/Homebrew/discussions/225

[+] masklinn|3 years ago|reply

> Modern CRDTs are massively more space efficient than git but nobody seems to notice or care about the size of our git repositories.

The packing process explained in the article is why: rather than make compression an intrinsic part of the system's model (which most VCS do / have done), git moves it to a later "optimisation" phase. This is an additional and somewhat odd costs, but on the other hand it can be extremely efficient (at $DAYJOB we've got files whose compression ratio is 99%).

An other interesting bit (which makes it very annoying to inspect the raw contents of a git repo) is that git DEFLATEs object contents when it stores them on-disk. But while that's clearly documented for packfiles (although the lack of end-of-data marker is implicit and extremely annoying), it's not really spelled out for loose object, you get to find out when you can't make heads or tails of loose objects, and git barfs at your hand-crafted loose objects.

[+] giancarlostoro|3 years ago|reply

It's easy to forget that code is... so small, but I guess if your codebase on its own was hundreds of megabytes big, then that would become noticed more quickly. I never realized it stored a copy every time, I assumed it used diffs. So git uses the same solution some of us use when we're about to do something on git we're unfamiliar with... you copy the entire directory housing your code + git repo into a new location just in case it fails miserably.

[+] MonkeyMalarky|3 years ago|reply

I have used shallow clones to speed up and shrink what gets pulled into a docker image when building it. Of course it was a hack to make up for past mistakes but I was glad the option was available.

[+] HWR_14|3 years ago|reply

I still don't grok how CRDTs deal with conflicts. Or is it just "this happens so rarely I'll just use last update"?

[+] Existenceblinks|3 years ago|reply

It kinda store diff in `rr-cache/` the preimage | postimage | thisimage combined is a kind of diff.

[+] mandeepj|3 years ago|reply

> It just stores the full content of every file every time it’s modified and committed.

That's not true; otherwise each git branch folder would be huge, but it's not. It stores the snapshot of changes in branch folders

[+] vaibhavsagar|3 years ago|reply

I also have my own implementation of packfiles [1], which I implemented as part of a toy Git implementation in Haskell [2]. I personally found the packfile format underdocumented, e.g. the official reference for the packfile generation strategy is an IRC transcript [3].

1: https://github.com/vaibhavsagar/duffer/blob/bad9c6c6cf09f717...

2: https://vaibhavsagar.com/blog/2017/08/13/i-haskell-a-git/

3: https://git-scm.com/docs/pack-heuristics/

[+] srathi|3 years ago|reply

Shameless plug: I gave a tech talk on this at my last company, and implemented some of these in Golang as a learning exercise [0]. While it is not mandatory to know the internals, but doing so helps a lot when you occasionally encounter a git-gotcha!

[0] https://github.com/ssrathi/gogit

[+] legulere|3 years ago|reply

> What if Git had a long-running daemon that could satisfy queries on-demand, but also keep that in-memory representation of data instead of needing to parse objects from disk every time

I guess that would be performance-wise the same as switching from running the compiler in a one-off mode to a LSP analyzer.

[+] tex0|3 years ago|reply

Packfiles give me PTSD. They are so notoriously hard to maintain on the large scale server side.

I really wish git wasn't using them.

[+] gbrown_|3 years ago|reply

Could you elaborate on "maintain" please?