Show HN: bef – a tool that encodes/decodes interleaved erasure coded streams

gbletr42|2 years ago

Hello Hacker News! I'm been developing this piece of software for about a week now, to serve as a fast and easy to use replacement for par2cmdline and zfec. Now that it is in good and presentable state, I'm releasing it to the world to get users, feedback, testing on architectures that aren't x86[-64], etc. If you have any feedback, questions, or find any bugs/problems, do let me know.

Nyan|2 years ago

You should at least be benchmarking against par2cmdline-turbo instead (stock par2cmdline isn't exactly performance-oriented). Also, you need to list the parameters used as they significantly impact performance of PAR2.

Your benchmark also doesn't list the redundancy %, as well as how resilient it is against corruption.

One thing I note is that both ISA-L and zfec use GF8, whilst PAR2 uses GF16. The latter is around twice as slow to compute, but allows for significantly more blocks/shards.

speps|2 years ago

I'd recommend explaining even a tiny bit what erasure coding is. I had to look it up as I didn't know the term. It's really cool, explain it yourself, why you're excited about it!

nehal3m|2 years ago

I am genuinely not trying to be a vulgar dingus but in my native language (Dutch) this is the verb for orally pleasuring women.

I don’t know how that figures into your decision to name it this but at least now you’re aware.

stragies|2 years ago

Somebody could compile the (probably extremely short, if not nil) list of pronounceable sequences of phonemes, that are not vulgar, sexist, demeaning, insulting, "hurtful", or otherwise objectionable in any language.

gbletr42|2 years ago

I was not, as I am a english speaker with no knowledge of Dutch. I find it a funny coincidence, but if I change the name bef now it'll mess with my muscle memory.

frutiger|2 years ago

Do you happen to know its etymology?

rizky05|2 years ago

[deleted]

alchemist1e9|2 years ago

Super neat!! I currently take encrypted zbackup files par2 split them and then seqbox and burn them to M-discs.

What is a better sequence of steps in your opinion?

https://github.com/MarcoPon/SeqBox

gbletr42|2 years ago

I'm not familiar with zbackup, but from a google search appears to be a tool to deduplicate and encrypt data. The process envisioned while making this was to use a series of pipes unix style to make the backup, e.g.

tar c dir | zstd | gpg -e | bef -c -o backup.tar.zst.gpg.bef

and then to get back that file with the terribly long filename

bef -d -i backup.tar.zst.gpg.bef | unzstd | gpg -d | tar x

ComputerGuru|2 years ago

This is really nice work, great job.

You mention how the parameters are all customizable but I want to ask almost the opposite: is there a recommend set of defaults for xxx situation that the user can apply, so they don't have to be experts to figure out usage?

e.g. a recommended option for "sharing over the internet" vs "burning to a dvd" vs "writing to tape"

(I'm aware that these have their own redundancies/error control, but obviously I do not consider them sufficient.)

gbletr42|2 years ago

Currently there is just one set of defaults, but I could very well add multiple different defaults dedicated to certain use cases such as the ones you have. I imagine it'd be something like this

bef -c --default share -i input -o output, bef -c --default dvd -i input -o output, bef -c --default tape -i input -o output, etc.

It seems like a good idea and wouldn't exactly be hard to implement.

loeg|2 years ago

This is pretty cool and I appreciate the comparison to par2. I have a (personal) backup workflow using par2 now, and this looks like an interesting replacement.

The dependency for doing erasure codes is itself pretty interesting[1]. It has a number of backends. I've used one of those, ISA-L, in the past at a major storage vendor for Reed Solomon parity blocks.

[1]: https://github.com/openstack/liberasurecode

mkeedlinger|2 years ago

Hey this is very cool! And something I've looked for multiple times before

zcw100|2 years ago

This sounds similar to fountain codes https://en.m.wikipedia.org/wiki/Fountain_code

loeg|2 years ago

Fountain codes are a class of erasure code, but not a specific tool.

PhilipRoman|2 years ago

Thats a really cool program. Is it possible to do it the opposite way - recover data with occasional noise inserted between symbols?

gbletr42|2 years ago

Sadly my tool currently doesn't account for that type of corruption, as it doesn't know what is good data or bad data when reading. So if bad data is inserted between symbols/fragments, rather than corrupting the symbols/fragments themselves, the tool would read them naively and exit with an error when the hashes don't add up. I'm sure there's a clever way of defending against that, but as of this moment I'm not entirely certain how best to do so.

doubloon|2 years ago

how do i build it?

edit. ok 10 minutes later if have persuaded automake/conf/etc to create a makefile. now xxhash wont compile because src/bef.c:278:2: error: unknown type name ‘XXH128_hash_t’; did you mean ‘XXH32_hash_t’?

edit ok many more minutes later i purged ubuntu xxhash and installed my own copy and re-negotiatied with automake/conf/etc

edit lol downvoted for asking how to build.

edit ok now that its built, havent the foggiest how to use it. no example or hello world is given in readme.

edit nevermind figured it out. ./bef -c -i bef -o bef.test

edit so i still dont understand it. i bef'ed the Makefile, removed a character, tried to 'deconstruct' it, the output is zero bytes

gbletr42|2 years ago

> i bef'ed the Makefile, removed a character, tried to 'deconstruct' it, the output is zero bytes

I can't reproduce this. These are the commands I used, with it freshly compiled on a ubuntu docker container, both the v0.1 release and the master tree.

./bef -c -i Makefile -o Makefile.bef

dd if=/dev/zero of=Makefile.bef bs=1 oseek=300 count=1

./bef -d -i Makefile.bef -o Makefile2

cmp Makefile Makefile2 || echo "failed!"

edit: oh, I see, you 'removed a character'. Depending on what character you removed or corrupted from the output, you could've either hit the issue described above with inserting noise, but this time removing information, or you caused corruption in the header by either corrupting the magic number, hash type, or hash itself. The command line utility automatically truncates the output before calling the deconstruct function. The header is sadly the biggest single point of failure in the tool/format, which is why I introduced the --raw flag for those who don't want it.

unknown|2 years ago

[deleted]

44 comments