Hello Hacker News! I'm been developing this piece of software for about a week now, to serve as a fast and easy to use replacement for par2cmdline and zfec. Now that it is in good and presentable state, I'm releasing it to the world to get users, feedback, testing on architectures that aren't x86[-64], etc. If you have any feedback, questions, or find any bugs/problems, do let me know.
You should at least be benchmarking against par2cmdline-turbo instead (stock par2cmdline isn't exactly performance-oriented). Also, you need to list the parameters used as they significantly impact performance of PAR2.
Your benchmark also doesn't list the redundancy %, as well as how resilient it is against corruption.
One thing I note is that both ISA-L and zfec use GF8, whilst PAR2 uses GF16. The latter is around twice as slow to compute, but allows for significantly more blocks/shards.
I'd recommend explaining even a tiny bit what erasure coding is. I had to look it up as I didn't know the term. It's really cool, explain it yourself, why you're excited about it!
Somebody could compile the (probably extremely short, if not nil) list of pronounceable sequences of phonemes, that are not vulgar, sexist, demeaning, insulting, "hurtful", or otherwise objectionable in any language.
I was not, as I am a english speaker with no knowledge of Dutch. I find it a funny coincidence, but if I change the name bef now it'll mess with my muscle memory.
I'm not familiar with zbackup, but from a google search appears to be a tool to deduplicate and encrypt data. The process envisioned while making this was to use a series of pipes unix style to make the backup, e.g.
tar c dir | zstd | gpg -e | bef -c -o backup.tar.zst.gpg.bef
and then to get back that file with the terribly long filename
bef -d -i backup.tar.zst.gpg.bef | unzstd | gpg -d | tar x
You mention how the parameters are all customizable but I want to ask almost the opposite: is there a recommend set of defaults for xxx situation that the user can apply, so they don't have to be experts to figure out usage?
e.g. a recommended option for "sharing over the internet" vs "burning to a dvd" vs "writing to tape"
(I'm aware that these have their own redundancies/error control, but obviously I do not consider them sufficient.)
Currently there is just one set of defaults, but I could very well add multiple different defaults dedicated to certain use cases such as the ones you have. I imagine it'd be something like this
This is pretty cool and I appreciate the comparison to par2. I have a (personal) backup workflow using par2 now, and this looks like an interesting replacement.
The dependency for doing erasure codes is itself pretty interesting[1]. It has a number of backends. I've used one of those, ISA-L, in the past at a major storage vendor for Reed Solomon parity blocks.
Sadly my tool currently doesn't account for that type of corruption, as it doesn't know what is good data or bad data when reading. So if bad data is inserted between symbols/fragments, rather than corrupting the symbols/fragments themselves, the tool would read them naively and exit with an error when the hashes don't add up. I'm sure there's a clever way of defending against that, but as of this moment I'm not entirely certain how best to do so.
edit. ok 10 minutes later if have persuaded automake/conf/etc to create a makefile. now xxhash wont compile because src/bef.c:278:2: error: unknown type name ‘XXH128_hash_t’; did you mean ‘XXH32_hash_t’?
edit ok many more minutes later i purged ubuntu xxhash and installed my own copy and re-negotiatied with automake/conf/etc
edit lol downvoted for asking how to build.
edit ok now that its built, havent the foggiest how to use it. no example or hello world is given in readme.
> i bef'ed the Makefile, removed a character, tried to 'deconstruct' it, the output is zero bytes
I can't reproduce this. These are the commands I used, with it freshly compiled on a ubuntu docker container, both the v0.1 release and the master tree.
edit: oh, I see, you 'removed a character'. Depending on what character you removed or corrupted from the output, you could've either hit the issue described above with inserting noise, but this time removing information, or you caused corruption in the header by either corrupting the magic number, hash type, or hash itself. The command line utility automatically truncates the output before calling the deconstruct function. The header is sadly the biggest single point of failure in the tool/format, which is why I introduced the --raw flag for those who don't want it.
gbletr42|2 years ago
Nyan|2 years ago
Your benchmark also doesn't list the redundancy %, as well as how resilient it is against corruption.
One thing I note is that both ISA-L and zfec use GF8, whilst PAR2 uses GF16. The latter is around twice as slow to compute, but allows for significantly more blocks/shards.
speps|2 years ago
nehal3m|2 years ago
I don’t know how that figures into your decision to name it this but at least now you’re aware.
stragies|2 years ago
gbletr42|2 years ago
frutiger|2 years ago
rizky05|2 years ago
[deleted]
alchemist1e9|2 years ago
What is a better sequence of steps in your opinion?
https://github.com/MarcoPon/SeqBox
gbletr42|2 years ago
tar c dir | zstd | gpg -e | bef -c -o backup.tar.zst.gpg.bef
and then to get back that file with the terribly long filename
bef -d -i backup.tar.zst.gpg.bef | unzstd | gpg -d | tar x
ComputerGuru|2 years ago
You mention how the parameters are all customizable but I want to ask almost the opposite: is there a recommend set of defaults for xxx situation that the user can apply, so they don't have to be experts to figure out usage?
e.g. a recommended option for "sharing over the internet" vs "burning to a dvd" vs "writing to tape"
(I'm aware that these have their own redundancies/error control, but obviously I do not consider them sufficient.)
gbletr42|2 years ago
bef -c --default share -i input -o output, bef -c --default dvd -i input -o output, bef -c --default tape -i input -o output, etc.
It seems like a good idea and wouldn't exactly be hard to implement.
loeg|2 years ago
The dependency for doing erasure codes is itself pretty interesting[1]. It has a number of backends. I've used one of those, ISA-L, in the past at a major storage vendor for Reed Solomon parity blocks.
[1]: https://github.com/openstack/liberasurecode
mkeedlinger|2 years ago
zcw100|2 years ago
loeg|2 years ago
PhilipRoman|2 years ago
gbletr42|2 years ago
doubloon|2 years ago
edit. ok 10 minutes later if have persuaded automake/conf/etc to create a makefile. now xxhash wont compile because src/bef.c:278:2: error: unknown type name ‘XXH128_hash_t’; did you mean ‘XXH32_hash_t’?
edit ok many more minutes later i purged ubuntu xxhash and installed my own copy and re-negotiatied with automake/conf/etc
edit lol downvoted for asking how to build.
edit ok now that its built, havent the foggiest how to use it. no example or hello world is given in readme.
edit nevermind figured it out. ./bef -c -i bef -o bef.test
edit so i still dont understand it. i bef'ed the Makefile, removed a character, tried to 'deconstruct' it, the output is zero bytes
gbletr42|2 years ago
I can't reproduce this. These are the commands I used, with it freshly compiled on a ubuntu docker container, both the v0.1 release and the master tree.
./bef -c -i Makefile -o Makefile.bef
dd if=/dev/zero of=Makefile.bef bs=1 oseek=300 count=1
./bef -d -i Makefile.bef -o Makefile2
cmp Makefile Makefile2 || echo "failed!"
edit: oh, I see, you 'removed a character'. Depending on what character you removed or corrupted from the output, you could've either hit the issue described above with inserting noise, but this time removing information, or you caused corruption in the header by either corrupting the magic number, hash type, or hash itself. The command line utility automatically truncates the output before calling the deconstruct function. The header is sadly the biggest single point of failure in the tool/format, which is why I introduced the --raw flag for those who don't want it.
unknown|2 years ago
[deleted]