DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
thanks for the link, but i think i found my favored solution right now, which is extracting the archive into raw files (and header, hashtable, blocktable) and then reassembling the files on demand into a byte-for-byte equal archive on demand (or via virtual file system).
this will block align everything, give people access to raw assets and is flexible and performant on the filesystem because of hardlinking.
appreciate your help though :)
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
jdupes is quite powerful and allows you to create different types of links and even more.
i used it for creating hardlinks, since this was the most efficient for my use case.
from what i read, there is currently also a rewrite of jdupes in progress, which will introduce significant performance improvements... you can see a post about it here:
https://www.jdupes.com/
Working toward jdupes 2.0
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
shouldn't be the case, and i would implement a proper verify when exploding the MPQ file, by running a reassamble and hash comparison right afterwards.
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
a BTRFS subvolume can be exported as a snapshot, and this can be passed around rather easily.
i know the concern mounting BTRFS, but there are windows drivers for that as well, and you can also mount it via WSL nowadays, to have proper linux tooling.
the approach of having a custom storage blob with pointer references which is something i consider as well, will play around with that during the holidays and do some experimenting.
thanks for your input.
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
thanks for mentioning those 2 projects, will check them out over the holidays and do some experimenting ^^
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
it's a question of both "i can store this on my 4 TiB NVMe disk" and "i can share this with other people over the internet in a timely manner"
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
This is correct - the main goal is to have this rather compact, but still having good read times.
This will allow me to store it on my new 4 TiB NVMe drive.
A lot of iterative scanning will happen, because I search for interesting information, which helps reverse engineering.
Also it allows me to share this with other people over the internet before I kick the bucket in a few decades... transferring 10.4 TiB would be rather boring :D
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
World of Warcraft - from the earliest (publicly leaked) Alpha up and including the very last Wrath of the Lich King version.
Other people are already working on getting backups of the next 2 expansions working again to add those client versions to the archive as well.
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
will probably write the MPQ blobs down to disk and deduplicate via hardlinks and additionally on block level.
i don't know about restic (or borg, which was also recommended), but i will read up on it and doe some tests with it, regardless, since this seems to be a very nice tool for a lot of problem scenarios.
thanks for the input!
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
thanks for the link - i will write my extraction code though, since the format is very simplistic, and it gives me fine grained control over how things are done.
appreciate your help though!
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
prying apart the MPQ file into it's parts and writing them down to disk is at the moment probably the strongest contender for my solution.
it will cause the parts to be block aligned, be able to be hardlinked and cut down on metadata and improve performance (same inode when hardlinked).
only thing i need in this case is to have a script to reassamble those extracted ones into the original MPQ archives, which have to match byte-by-byte to the original content ofc.
extracting them into it's distinct parts also allows to access the contents directly, if so desired, without needing to extract them on demand (some people wanna look up assets in specific versions).
these distinct parts can then additionally be deduplicated on block level as well.
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
even though I do not like it too much, i think i will have to pry apart the MPQ files into it's distinct parts, and write them down to the filesystem individually (and then deleting the original file) - basically what i wanted to do with the extents, but instead have them as distinct files.
this then can be reversed via script to assemble the original archive file on demand to get a byte-by-byte equal file again.
writing the parts down to the filesystem will cause the parts to be properly block aligned, and be able to be hardlinked, if they exist multiple times on the filesystem - this cuts down on metadata even more and also boosts performance when doing block/extent deduplication, since a single inode is only processed once in most proper deduplication programs.
the MPQ files range from a few MiB to around 2.5 GiB.
since the access should be rather fast, pairing them as an archive file is not an option for me.
thanks about the hint of the merkle trees, i will read up on what that is... always good to know about different approaches to a problem :)
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
i tried FICLONERANGE via a python wrapper btw - it turns out, that i can only clone ranges aligned to block boundaries :(
BTRFS is very neat per se, but documentation and help (most of all in very niche cases like this one here lol) is not that easy to come by.
my plan would be to properly process the data set, and then make it available as a BTRFS snapshot... you can export btrfs send as a file as well for storage etc.
if all my tries to use BTRFS fail, i might to write my own tooling and virtual filesytem as well, but optimized for my use case (MPQ files and such).
thanks for your input so far.
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
you are absolutely on point - i would prefer having a real filesystem with deduplication (not compression), which offers data in a compact form, with good read speed for further processing.
i was already brainstorming of writing a custom purpose-built archive format, which would allow me to have more fine grained control over how i can lay out data and reference it.
the thing is that this archive is most likely not absolutely final (additional versions being added) - having a plain filesystem allows for easier adding of new entries.
an archive file might have to be rewritten.
if i go the route of custom archive, i can in theory write a virtual filesystem for it to access it read only like it would be a real filesystem... and if i design it properly, maybe even write it.
still would prefer to use a btrfs filesystem tbh ^^
will brainstorm a bit more over the next days - thanks for your input!
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
duperemove has "--dedupe-options=partial" which also enables this, not just full extents.
the issue still is, that the data within the archive is not block aligned, thus preventing me from deduplicating them properly
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
this is either a very big coincidence, or you are in the datamining discord as well.
the original archive i base my project on uses RMAN to store everything :D
---
thanks for the hint about the FICLONERANGE ioctl... it seems to be fine grained enough to allow me deduplicate on arbitrary offsets, not just whole blocks.
will give it a go.
DrFrugal
|
1 year ago
|
on: Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)
this was not helpful at all, and i think you also did not read the goals of this project