top | item 32700315

High speed Unicode routines using SIMD

126 points| Peter5 | 3 years ago |github.com

20 comments

order
[+] jansan|3 years ago|reply
I do not know about the real world implications of this, but just reading of a 20x performance increase for standard cases makes me excited.
[+] dhosek|3 years ago|reply
There are some basic things that can give huge performance increases without even parallelization: I wrote a Unicode crate because I needed a different interface to getting grapheme clusters from a string than the existing crates offered. I wrote a character category crate as well because I thought I would need it for this purpose (I didn’t). I managed to get double the performance of the existing segmentation crate and 10–20x the performance of the existing Unicode category crate.

https://crates.io/crates/finl_unicode

[+] vanderZwan|3 years ago|reply
The readme has a link to a technical paper on arxiv that was uploaded last year[0], has that perhaps been discussed before?

(not meant as a complaint that this might have been submitted before already, I'm just curious about what might have already been said about it)

[0] https://arxiv.org/abs/2109.10433

[+] Peter5|3 years ago|reply
I searched for it and couldn't find a match. I chose the code instead of the document, as the repository links to the paper too, and almost always, code is more welcome than a paper.
[+] dragontamer|3 years ago|reply
I dunno much about Unicode, but I imagine it is a regular language? (Aka: acception / rejection can be determined by regular expressions / finite state machines)

If so, regular languages are one of those 'surprising things that can be parallelized'. It doesn't seem possible at first thought though.

[+] lifthrasiir|3 years ago|reply
Just in case, here "Unicode routines" refer to Unicode encoding routines, not other more complex things like normalization. They are pretty regular but their automata would be relatively simple. (By the way, SIMD-accelerated regular expression parsing is indeed a thing [1], if anyone wondered.)

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...

[+] yxhuvud|3 years ago|reply
That a regular language may be parallelized does not mean all can be.

That said, there are techniques around to evaluate a context free grammars (and thus also regular grammars) in parallel. One way to do it is as follows:

Break up the input string into n separate strings. The processing of the first string starts at a single state, the starting state. Each other string starts at the set of possible states, with the output being a mapping between state at the first character to the state after the last.

Then at the end the output is stitched together.

[+] Genbox|3 years ago|reply
I'm at the point where I can no longer see any reason to use UTF-16. UTF-8 is used everywhere today and constantly converting between the two is not only inefficient, but also introduce a risk of bugs/corruption.
[+] jahewson|3 years ago|reply
Yes it’s annoying that we’re stuck with UTF-16 inside many programming languages. Still, I think it’s better than having two types of string (C, C++) or introducing a breaking change (looking at you Python 3).