top | item 46366687

A SIMD coding challenge: First non-space character after newline

3 points| zokrezyl | 2 months ago

I’m working on a SIMD parser for a YAML-like language and ran into what feels like a good SIMD coding challenge.

The task is intentionally minimal:

detect newlines (\n)

for each newline, identify the first non-space character that follows

Scanning for newlines alone is trivial and runs at memory bandwidth. As soon as I add “find the first non-space after each newline,” throughput drops sharply.

There’s no branching, no backtracking, no variable-length tokens. In theory this should still be a linear, bandwidth-bound pass, but adding this second condition introduces a dependency I don’t know how to express efficiently in SIMD.

I’m interested in algorithmic / data-parallel approaches to this problem — not micro-optimizations. If you treat this as a SIMD coding challenge, what approach would you try?

Another formulation:

# Bit-Parallel Challenge: O(1) "First Set Bit After Each Set Bit"

Given two 64-bit masks `A` and `B`, count positions where `B[i]=1` and the nearest set bit in `A|B` before position `i` is in `A`.

Equivalently: for each segment between consecutive bits in `A`, does `B` have any bit set?

*Example:* `A=0b10010000`, `B=0b01100110` → answer is 2 (positions 1 and 5)

Newline scan alone: 90% memory bandwidth. Adding this drops to 50%.

Is there an O(1) bit-parallel solution using x86 BMI/AVX2, or is O(popcount(A)) the lower bound?

13 comments

clausecker|2 months ago

You can do it like this, assuming A is the mask of newlines and B is the mask of non-spaces.

1. Compute M1 = ~A & ~B, which is the mask of all spaces that are not newlines 2. Compute M2 = M1 + (A << 1) + 1, which is the first non-space or newline after each newline and then additional bits behind each such newline. 3. Compute M3 = M2 & ~M1, which removes the junk bits, leaving only the first match in each section

Here is what it looks like:

    10010000 = A
    01100110 = B
    00001001 = M1 = ~A & ~B
    00101010 = M2 = M1 + (A << 1) + 1
    00100010 = M3 = M2 & ~M1

Note that this code treats newlines as non-spaces, meaning if a line comprises only spaces, the terminating NL character is returned. You can have it treat newlines as spaces (meaning a line of all spaces is not a match) by computing M4 = M3 & ~A.

zokrezyl|2 months ago

Thanks for the suggestion. Indeed, that would be the first approach. That is how I started. This is however not considering the state (am I inside a 'statement' or not)

statement meaning string from first non-space till next EOL or EOF.

Problem starts when you need to cover the "corner cases". Without the corner cases the algo is not algo.

camel-cdr|2 months ago

The problem should be equivalent to: https://www.reddit.com/r/simd/comments/1hmwukl/mask_calculat...

Falvyu's and bremac's solution seems to be the best.

zokrezyl|2 months ago

thanks for pointing out! I tried the borrowing trick from the previous segment, was pretty obvious, but for some reason failed as could not avoid at least one conditinonal... will try again.

pestatije|2 months ago

wheres the code?...have a look at codereview[5], the whole site is geared for this kind of challenges

[5] codereview.stackexchange.com

zokrezyl|2 months ago

I do not have one "implementation" but have been trying with different approaches that all delivered under 50% of memory bandwith... I guess if anyone can purpose a solution should be from scratch... The problem is that all approaches I tried end up generating unpredictable branches that do not allow the CPU to optimally keep loading text from memory.

zokrezyl|2 months ago

https://godbolt.org/z/3YMbaeEGh

One approach....

zokrezyl|2 months ago

[deleted]