top | item 39721589

(no title)

V-2 | 1 year ago

PGN is a highly redundant format, but it has the inherent advantage of being human readable. The problem is interesting, but I think it falls on the side of "fun" more than "profit". Storage is cheap, and PGN files are still small. An average PGN is still below 1 kilobyte. So one movie in BlueRay quality = about 20 million games. That's a lot. The practical problem is not storage, it's computation. Basically, querying the game database quickly. Compression gets in the way of that.

For example, I've just played a game, now I want to go through the opening and fetch all games from the database that went through the same initial moves/positions (that's not the same thing, as a game may arrive at the same position through a different order of moves; AKA transposition). Let's say, all the way until move 15 or 20, because it will only be at that point that a decent game finally becomes unique by deviating from all the recorded games in the database (AKA a novelty was played).

Or I want to find all games where an endgame of a Queen and a pawn against a lonely Queen occurred. There is actually a query language for that, named (surprise, surprise) Chess Query Language: https://www.chessprogramming.org/Chess_Query_Language

I feel that whatever a superior alternative to PGN might be, its strength would likely be better queryability rather than higher storage efficiency as such.

discuss

marcusbuffett|1 year ago

The problem I’m facing with storing roughly 600 million shorter PGNs is that the database is 100GB or so, and I’m grabbing thousands of them sort of at random. This makes the query IO bound, even though the finding the pages they’re on is virtually instant with the indexes. So a smaller database means less pages read when I do these large reads, ideally. I also have other ideas on ordering the database in a smarter way, but hoping this part helps.

V-2|1 year ago

Sure, I see your point. Obviously a wasteful format is also getting in the way of queryability. My point is that the main goal should be to improve for queryability, which inherently requires some optimizing for storage, but that's secondary. As opposed to optimizing exlusively for data size.

Because in the former case it may still be best to accept some compromise (in the form of redundancy/simplicity) to hit the sweet spot.

Especially in the context of many comments that seem to have taken an extremely "code golf"-like approach towards the problem.