top | item 37473673

(no title)

rntz | 2 years ago

This depends heavily on how repetition is implemented.

With a backtracking-search implementation of regexes, bounded iteration is pretty easy.

But the linked webpage appears to compile regexes to finite state machines (it shows you their finite-state-machine, for instance), and eg [a-z]{1,256} will have 256 states: 256 times the 1 state needed for [a-z]. If [a-z] were a complex regex, you could get a combinatorial explosion.

This alone probably isn't the issue? 256 is not a very large number. But I suspect there are follow-on algorithmic issues. This is just speculation, but I wouldn't be surprised if that 256-state machine were computed by applying DFA minimization, an algorithm with worst-case exponential running time, to a more naively generated machine.

discuss

order

d66|2 years ago

you're right. inclusion/intersection/etc. aren't actually computed via DFA but instead are computed directly on the regular expression representation itself. and large disjunctions (with 256 branches) are what is very heavy.

(it's possible to instead do these operations on DFAs but at the time i found it hard to get from an automata back to a reasonable-looking regular expression.)