top | item 35059706

(no title)

osaariki | 3 years ago

We have an early tool paper [1] for a previous version of the engine, but that's short and with POSIX semantics, so doesn't include a lot of the interesting stuff. The most relevant bit there is the handling of Unicode.

>This doesn't compose though, right? For example, if you have `sam|samwise`, then you can do that transformation, but if you have `\b(?:sam|samwise)\b` then you can't.

You'd get subsumption when you have something like '(?:sam)?wise|wise' and in fact this kind of subsumption due to a "nullable prefix" is the main one currently detected (because we encountered patterns that motivated it). And you're right that all these rewrites should compose regardless of context so that they can be eagerly applied in the AST constructors.

>> but a larger class of patterns get to stay in the DFA world with derivatives. >Can you say more about this?

Yeah, the easiest example I can point at is from that tool paper [1], where a subsumption-based rewrite for counted repetitions can turn an exponential blow-up into a linear one. Off the top of my head I think a pattern like '.*a[ab]{0,1000}' would have a 2^1000 blow-up when determinized into a DFA but stays linear in size with derivatives. However, that loop subsumption rule isn't valid as-is under PCRE semantics, so it still needs some work to be ported to the .NET engine.

Before we get that PLDI paper out the best resource is probably just the code under [2]. It's fairly well commented, but of course that's no substitute for a good write-up.

[1]: https://www.microsoft.com/en-us/research/uploads/prod/2019/0... [2]: https://github.com/dotnet/runtime/tree/main/src/libraries/Sy...

discuss

burntsushi|3 years ago

Nice, thank you!