top | item 44272242

(no title)

pjz | 8 months ago

The problem with the 'characters are just numbers' approach is that they're not _just_ numbers.. with the advent of unicode, they're _sequences of numbers_... so bytes thus can mean different things when part of a a sequence than when standalone.

That said, since they're numbers, we should use the most efficient checks for them... which are likely vectorized SIMD assembly instructions particular to your hardware. And which I've seen no one mention.

discuss

glangdale|8 months ago

Yes, that's it. Vectorized SIMD annihilates this problem, a space I've been working in since 2006 and it wasn't all that new even then. A close second would be a heavily optimized (pipelined and less branchy) table or bitvector lookup. Doing anything that involves lots of control flow, like the grandparent post, will be slow as a wet week with our without bit manipulation tricks due to the inherently unpredictable nature of the branches (subject to our input).