top | item 44271330

(no title)

kiru_io | 8 months ago

Can you elaborate or show a better way?

discuss

order

fedsocpuppet|8 months ago

   julia> using Random, BenchmarkTools
   julia> function isvowel(c)
            idx = (c | 0x20) - Int('a')
            return (0x00104111 & (1 << idx)) != 0
         end

   julia> hasvowel(str) = any(c -> isvowel(Int(c)), str)

   julia> @btime hasvowel(s) setup=(s=randstring("bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ0123456789", 10))
   15.739 ns (0 allocations: 0 bytes)
   false
some 77 thousand times faster

camel-cdr|8 months ago

The usual way to do this in SIMD is via vpermb(lut, str) == str, or potentially slightly faster vpshufb(str>>1, lut) == str.

str|0x32 if you want case-insensitive.

This works because the lowest five bits, more specifically bits two to five, of the vowels are distinct.

The lut is zeroed except for the indices corresponding to the vowels, which contain it self: lut[vowel&31]=vowel or lut[(vowel&31)>>1]=vowel

teo_zero|8 months ago

> The lut is zeroed except for the indices corresponding to the vowels, which contain it self

So in the comparison vs. the original str, vowels get a true and consonants get a false. Doesn't the "==" then returns true if all the chars are vowels?

ninkendo|8 months ago

I was curious if there were any interesting bit-masking patterns in ASCII for vowels that could be exploited, so I shamelessly asked ChatGPT and it gave a pretty nice insight:

- Setting bit 5 high forces lowercase for the input letter

- masking with `& 31` gives an index from 0-25 for the input letter

Then you can the `bt` instruction (in x86_64) to bit-test against the set of bits for a,e,i,o,u (after lowercasing) and return whether it matches, in a single instruction.

It came up with this, which I thought was pretty nice: https://godbolt.org/z/KjMdz99be

I'm sure there's other cool ways to test multiple vowels at once using AVX2 or AVX-512, I didn't really get that far. I just thought the bit-test trick was pretty sweet.

Chat transcript is here (it failed pretty spectacularly the first couple times, tripping over AT&T syntax and getting an off-by-one error, but still pretty good) https://chatgpt.com/share/684c8b39-a9c4-8012-8bb6-74e1f8b6d0...

mystified5016|8 months ago

See my other comment.

You can do extremely trivial binary checks on ASCII, UTF-8 (and most? other) encoding schemes. All vowels including W and Y contain a 1 in the lowest bit. Then by comparing bits 1-4, you can do a trivial case-insensitive comparison. You detect case by checking bit 5. 0 is upper, 1 is lower.

rurban|8 months ago

How does that fly with foreign scripts? It certainly won't work with Greek, Arabic, Korean, Chinese, Japanese and the Indian languages.

tough|8 months ago

Rust, Go, C?

dmoy|8 months ago

Or even Lua if you don't want to go to a statically compiled typed language?