(no title)
osaariki | 3 years ago
PCRE semantics was indeed the big thing that required new techniques. Basically, you have to modify the derivation function such that the derivatives model what paths through the pattern a backtracking engine would consider before stopping at the current position.
The big thing derivatives buys you is the ability to apply rewrite rules lazily during the construction. For example, when we can prove that a regex R subsumes a regex T, then an alternation R|T can be rewritten to just R, since T is already included in it. These kinds of rewrites often result in the DFA that gets constructed being minimal or close to so. Of course, you do pay somewhat for the machinery to do this, so best-case construction times suffer compared to in traditional NFA+lazy-DFA engines like RE2 or Rust's, but a larger class of patterns get to stay in the DFA world with derivatives.
I hope our work ignites a new interest in regex matching with derivatives. I believe the ability to apply these syntactic rewrites on-the-fly is really powerful and I'd love to see how far folks like you with extensive experience in optimizing regex matchers can take this.
burntsushi|3 years ago
> For example, when we can prove that a regex R subsumes a regex T, then an alternation R|T can be rewritten to just R, since T is already included in it.
This doesn't compose though, right? For example, if you have `sam|samwise`, then you can do that transformation, but if you have `\b(?:sam|samwise)\b` then you can't.
> but a larger class of patterns get to stay in the DFA world with derivatives.
Can you say more about this?
osaariki|3 years ago
>This doesn't compose though, right? For example, if you have `sam|samwise`, then you can do that transformation, but if you have `\b(?:sam|samwise)\b` then you can't.
You'd get subsumption when you have something like '(?:sam)?wise|wise' and in fact this kind of subsumption due to a "nullable prefix" is the main one currently detected (because we encountered patterns that motivated it). And you're right that all these rewrites should compose regardless of context so that they can be eagerly applied in the AST constructors.
>> but a larger class of patterns get to stay in the DFA world with derivatives. >Can you say more about this?
Yeah, the easiest example I can point at is from that tool paper [1], where a subsumption-based rewrite for counted repetitions can turn an exponential blow-up into a linear one. Off the top of my head I think a pattern like '.*a[ab]{0,1000}' would have a 2^1000 blow-up when determinized into a DFA but stays linear in size with derivatives. However, that loop subsumption rule isn't valid as-is under PCRE semantics, so it still needs some work to be ported to the .NET engine.
Before we get that PLDI paper out the best resource is probably just the code under [2]. It's fairly well commented, but of course that's no substitute for a good write-up.
[1]: https://www.microsoft.com/en-us/research/uploads/prod/2019/0... [2]: https://github.com/dotnet/runtime/tree/main/src/libraries/Sy...