(no title)
icelancer | 1 month ago
I performed a similar analysis to you and found it very difficult to imagine sub-1000. Your comment I think convinced me that it may be possible, though. Interesting.
I'm below the threshold for recruiting but not below Claude at the moment. Not sure where I am going wrong.
amirhirsch|1 month ago
For the first several rounds (when every tree value is in use) Combine the stage 5 XOR with the subsequent round’s tree XORs. You can determine even/odd in hash stage 5 starting with a ^ (a>>16) without Xoring the constant, then you can only need one XOR, this saves you a ton of XORs
Create separate instruction bundles for the first round, rounds 1-5 (combining hash stages 5 XOR with next round tree XORs) and 6-9 (not every tree node is used anymore), round 10 round 11-14 and round 15 and combine them.
you can use add_imm in parallel to load consts. stage 0 you have to do load the tree first and the vals, by later stages when everything is in scratch, you could use 12 scalar XORs and 6 vector XORs on scratch. once you vload vals, you can start to do XORs but can only advance so much at a time, so I’m starting to work on getting hash stages moving to different rounds faster to hide the initial vloads and get to the heavy load section sooner and spread the load pain.