Except that the optimization seem to be quite spectacular.
Compiling pantalaimon's code, there doesn't seem to be any significative difference between the two algorithms with the default compilation parameters of gcc.
However compiling with gcc -O3, Knuth's algorithm is almost twice as fast on an Intel i7-7700HQ, old tricks still work well:
naive search
found: 0, took 597719 ns
found: 0, took 601291 ns
found: 0, took 593332 ns
found: 0, took 592513 ns
found: 0, took 592799 ns
found: 0, took 592409 ns
found: 0, took 592563 ns
found: 0, took 592289 ns
found: 0, took 631602 ns
found: 0, took 592928 ns
average: 597944.50 ns
knuth search
found: 0, took 300860 ns
found: 0, took 318691 ns
found: 0, took 317502 ns
found: 0, took 317348 ns
found: 0, took 351612 ns
found: 0, took 317625 ns
found: 0, took 318217 ns
found: 0, took 312106 ns
found: 0, took 397532 ns
found: 0, took 318284 ns
average: 326977.69 ns
Just for fun, here is the dump of assembler code for function search_knuth:
0x0000000000001220 <+0>: add %rsi,%rdx
0x0000000000001223 <+3>: mov %edi,%eax
0x0000000000001225 <+5>: mov %dil,(%rdx)
0x0000000000001228 <+8>: cmp (%rsi),%dil
0x000000000000122b <+11>: je 0x1238 <search_knuth+24>
0x000000000000122d <+13>: nopl (%rax)
0x0000000000001230 <+16>: add $0x1,%rsi
0x0000000000001234 <+20>: cmp %al,(%rsi)
0x0000000000001236 <+22>: jne 0x1230 <search_knuth+16>
0x0000000000001238 <+24>: cmp %rsi,%rdx
0x000000000000123b <+27>: setne %al
0x000000000000123e <+30>: retq
And here is the dump of assembler code for function search_naive:
0x00000000000011e0 <+0>: add %rsi,%rdx
0x00000000000011e3 <+3>: mov %edi,%eax
0x00000000000011e5 <+5>: cmp %rdx,%rsi
0x00000000000011e8 <+8>: je 0x1205 <search_naive+37>
0x00000000000011ea <+10>: cmp (%rsi),%dil
0x00000000000011ed <+13>: jne 0x11fc <search_naive+28>
0x00000000000011ef <+15>: jmp 0x1210 <search_naive+48>
0x00000000000011f1 <+17>: nopl 0x0(%rax)
0x00000000000011f8 <+24>: cmp %al,(%rsi)
0x00000000000011fa <+26>: je 0x1210 <search_naive+48>
0x00000000000011fc <+28>: add $0x1,%rsi
0x0000000000001200 <+32>: cmp %rsi,%rdx
0x0000000000001203 <+35>: jne 0x11f8 <search_naive+24>
0x0000000000001205 <+37>: xor %eax,%eax
0x0000000000001207 <+39>: retq
0x0000000000001208 <+40>: nopl 0x0(%rax,%rax,1)
0x0000000000001210 <+48>: mov $0x1,%eax
0x0000000000001215 <+53>: retq
Compiling pantalaimon's code, there doesn't seem to be any significative difference between the two algorithms with the default compilation parameters of gcc.
However compiling with gcc -O3, Knuth's algorithm is almost twice as fast on an Intel i7-7700HQ, old tricks still work well:
naive search found: 0, took 597719 ns found: 0, took 601291 ns found: 0, took 593332 ns found: 0, took 592513 ns found: 0, took 592799 ns found: 0, took 592409 ns found: 0, took 592563 ns found: 0, took 592289 ns found: 0, took 631602 ns found: 0, took 592928 ns average: 597944.50 ns knuth search found: 0, took 300860 ns found: 0, took 318691 ns found: 0, took 317502 ns found: 0, took 317348 ns found: 0, took 351612 ns found: 0, took 317625 ns found: 0, took 318217 ns found: 0, took 312106 ns found: 0, took 397532 ns found: 0, took 318284 ns average: 326977.69 ns
Just for fun, here is the dump of assembler code for function search_knuth:
And here is the dump of assembler code for function search_naive: