(no title)
isaacimagine | 9 months ago
I profiled just now:
| instrs (aarch64) | time 100k (s) | conway samples (%) |
| -O0 | 606 | 19.10s | 78.50% |
| -O3 | 135 | 3.45 | 90.52% |
The 3.45s surprises me, because it's faster than the 4.09s I measured earlier. Maybe I had a P core vs an E core. For -O0, the compiler is emitting machine code like: 0000000100002d6c ldr x8, [sp, #0x4a0]
0000000100002d70 ldr x9, [sp, #0x488]
0000000100002d74 orn x8, x8, x9
0000000100002d78 str x8, [sp, #0x470]
Which is comically bad. If I try with e.g. -Og, I get the same disassembly as -O3. Even -01 gives me the same disassembly as -O3. The assembly (-0g, -01, -03) looks like a pretty direct translation of the C. Better, but also nothing crazy (e.g. no autovectorization): 0000000100003744 orr x3, x3, x10
0000000100003748 orn x1, x1, x9
000000010000374c and x1, x3, x1
0000000100003750 orr x3, x8, x17
Looking more closely, there's actually surprisingly little register spilling.I think the real question you're asking is, as I wrote:
> If we assume instruction latency is 1 cycle, we should expect 2,590 fps. But we measure a number nearly 10× higher! What gives?
Part of this is due to counting the instructions in the dissassembly wrong. In the blogpost I used 349 instructions, going off Godbolt, but in reality it's 135. If I redo the calculations with this new numbers, I get 2.11 instructions per bit, 0.553 million instrs per step, dividing out 3.70 gcycles/s gives 6,690 fps. Which is better than 2,590 fps, but still 3.6x slower than 24,400. But I think 3.6x is a factor you can chalk up to instruction-level parallelism,.
Hope that answers your questions. Love your writing Gwern.
gwern|9 months ago
A fullstrength symbolic optimization framework like a SMT solver might be able to boil the logic gates down into something truly optimal, which would then be a very interesting proof of concept to certain people, but I expect that might be for you an entire project in its own right and not something you could quickly check.
Still, something to keep in mind: there's an interesting neurosymbolic research direction here in training logic gates to try to extract learned 'lottery tickets' which can then be turned into hyper-optimized symbolic code achieving the same task-performance but possibly far more energy-efficient or formally-verifiably.
JonChesterfield|9 months ago
isaacimagine|9 months ago
Re:neurosymbolics, I'm sympathetic to wake-sleep program synthesis and that branch of research; in a draft of this blog post, I had an aside about the possibility of extracting circuits and reusing them, and another about the possibility of doing student-teacher training to replace stable subnets of standard e.g. dense relu networks with optimized DLGNs during training, to free up parameters for other things.