(no title)
jonas-w | 1 year ago
But seletskiy is, as you can see in his nonce which tells us that he uses an RTX4090 and has 18 Giga Hashes per second. I'm really curious how he wrote that program, as I have a Rx 7900 XTX and would love to use it for this competition haha
seletskiy|1 year ago
The code runs at steady rate of 18.00-18.40 GH/s at cloud GPU. In fact it's not hashes-per-second, but actually messages-per-second checked.
It launches a 64⁶ kernels in a loop, where each launch checks first two bytes of the SHA of a message concatenated with unique 6-byte nonce per kernel + 6-byte incremental nonce for each launch. There is only one block, so SHA algorithm is heavily trimmed. Also, most of the message is hard-coded, so pre-calculated SHA state is used; it's 10 loops less than needed to encode whole block. Since we only 2 bytes of the hash to check, last two loops also unrolled by hand to exclude parts that we wouldn't need. All code also in big-endian since SHA is, so message hardcoded in big-endian as well.
Base64-encoding is pretty time-consuming, so I've optimized it a bit by reordering the alphabet to be in ASCII-ascending order. I've got to the point where single binary-op optimization can earn 100-500 MH/s speed-up, and I don't really know what else here is remaining.
I don't have RTX4090, so instead I just rented 4090 GPU to run code on. It's less than $0.3 per hour.
I've tried GPT-4 to get some directions for optimization, but every single proposal was useless or straight wrong.
I by no means a C/GPU programmer, so probably it can be optimized much more by someone who more knowledgeable of CUDA.
GPU's are ridiculously fast. It freaks me out that I can compute >18,000,000,000 non-trivial function calls per second.
Anyways, if you want to chat, my e-mail is in the profile.
jffry|1 year ago
If you limit your variable portion to a base16 alphabet like A-P, radix encoding would just be `nibble + 0x41`. Sure you're encoding 4x as many values, but with full branch predictability. You'd have to experimentally determine if that performs any better.
You could also do something theoretically clever with bitmasking the 4 nibbles then adding a constant like 0x4141 or 0x00410041 or something (I'm too tired to think this through) and then you can encode twice as many per op assuming 32-bit bitwise ops are similar in speed to 16-bit bitwise ops.
Anyways this has been a cool challenge. You might also enjoy hunting for amulets - short poems whose sha256 hash contains a substring that's all 8: https://news.ycombinator.com/item?id=26960729
jonas-w|1 year ago
wdfx|1 year ago
Also have been using ChatGPT to do the code translations.
I'm currently stuck in yak shaving, I cannot compile CUDA programs here yet as on fedora 40 I need an earlier gcc (13.2) which I am now also having to compile from source.
jonas-w|1 year ago