This and the associated series of regex posts from Russ Cox are probably the best content on practical regular expression engine internals you can get. If you have a cursory understanding of NFAs and DFAs, it's some of the best technical writing I've ever had the pleasure to read.
These posts are responsible for my love of regular expressions :)
Many years ago, as a fresh out of college young coder, I had to come up with a way to check types on an text input file in C++ -- long before C++ had even a standard String type let alone regex libraries like pcre. The standard conversion functions at the time expected you to have done some type detection beforehand.
Each field in the file could contain a string, an int, or a float. In particular to figure out the floats, I remember sitting down for a few hours and coming up with a regular expression, testing it in Perl on a bunch of test cases, and then went about the hard work of slowly converting the regex into an NFA to a DFA and then finally to an adjacency graph in a 2d matrix. A little bit of code to read in things a byte at a time and some simple arithmetic and I had a stupid fast
bool isFloat(char *)
Nobody at my work understood the code, they just sort of knew that if they submitted a character array with a field value into it, something happened with a big matrix, and it you let them know if it was a float or not and then they could call the appropriate conversion function -- I think stof() or whatever was in fashion back then.
I had to make one revision to it for an edge case I missed and then it found its way into pretty much all code for the company after that.
The tuition for the class that taught me how to do that payed for itself with that one function.
And if you don't have a cursory understanding of NFAs and DFAs, they're still great for getting that cursory understanding.
These articles were my introduction to finite automata (and almost the only place I've learned about them aside from practical use) and my intuition for them seem to often be better than my peers'.
> As mentioned earlier, no one knows how to implement regular expressions with backreferences efficiently, though no one can prove that it's impossible either. (Specifically, the problem is NP-complete, meaning that if someone did find an efficient implementation, that would be major news to computer scientists and would win a million dollar prize.)
It really needs to be noted how NP-complete and an algorithm efficient in practice have nothing to do with each other. If an algorithm takes C * 1.000...eighty zeroes...1^n time it is exponential but for any n you will encounter in real life the time it takes will be indistinguishable from C.
And the converse as well -- an algorithm that is 10,000,000 * n^17 is going to be effectively intractable but falls neatly into P.
And the reality is that most NP-complete problems are well-solved by heuristics except in strange combinatorial corners. It's why we can't just create encryption systems based on knowing the correct answer to an NP-complete problem; most of the problem instances will be too easy to solve.
This is a fine introduction to automata-based regexps but the framing of comparing "good" automata with "bad" perl-style is misguided. Automata and perl-style regexps are different beasts and solve different problems. The problem seems to be one of terminology: the perl style should never have been called regexps. That's not what they are. It's a pattern language that happens to have a variant of regexps as a subset.
I've recently discovered Lua Parsing Expression Grammars (LPEG)[1]. They are built on a different approach to parsing and are much more capable than conventional regexes, able to handle recursive sequences, able to include code, have debugging utilities, are often much more performant than regexes and they are an absolute delight to work with. It also has a module called re [2] which uses a similar syntax to regexes.
It's easy to miss in the article, but the reason most regex engines exhibit Perl's behavior is due to features like back referencing. (And I assume look behinds and look aheads.)
PCRE and theoretically pure regular expressions are different things.
FWIW, it seems pretty likely the same multi-NFA algorithm described in the article could be applied to other backtracking scenarios (and therefore to back references and lookaround). Since I’m just speculating, I’ll hazard a guess that it would result in a meaningfully more complex implementation to derive each state machine, and potentially much slower (while still “linear” complexity).
Regexes can be fast and simple, but sometimes it’s faster and simpler to use normal string search. I recently replaced a regex with two finds for 10x speedup.
There's pretty wide variety in the performance of regular expressions between different languages, and efficient implementations will often recognize when your regular expression has a literal suffix/prefix, and optimize accordingly.
That doesn't help you if you're stuck using a library that doesn't perform those optimizations, but it means you need to be careful about importing your assumptions about regex performance from one language to another.
Sounds like the regex library was not implemented as Russ describes... a linear find should be equivalent to a well implemented regex search for a specific string of characters.
In most cases the parser will not be shorter (code) than the regex, but it will most certainly be simpler (cognitively; it's hard to understand all that regexes do, they're close to magic) and more testable (you can test the parts of the parser, like the number parser or the string parser, individually), and often faster too.
[+] [-] jjice|1 year ago|reply
These posts are responsible for my love of regular expressions :)
[+] [-] bane|1 year ago|reply
Each field in the file could contain a string, an int, or a float. In particular to figure out the floats, I remember sitting down for a few hours and coming up with a regular expression, testing it in Perl on a bunch of test cases, and then went about the hard work of slowly converting the regex into an NFA to a DFA and then finally to an adjacency graph in a 2d matrix. A little bit of code to read in things a byte at a time and some simple arithmetic and I had a stupid fast
Nobody at my work understood the code, they just sort of knew that if they submitted a character array with a field value into it, something happened with a big matrix, and it you let them know if it was a float or not and then they could call the appropriate conversion function -- I think stof() or whatever was in fashion back then.I had to make one revision to it for an edge case I missed and then it found its way into pretty much all code for the company after that.
The tuition for the class that taught me how to do that payed for itself with that one function.
[+] [-] kqr|1 year ago|reply
These articles were my introduction to finite automata (and almost the only place I've learned about them aside from practical use) and my intuition for them seem to often be better than my peers'.
[+] [-] joe_guy|1 year ago|reply
It's an indepth look at where regex in dotnet was prior to and after version 7.
[+] [-] chx|1 year ago|reply
It really needs to be noted how NP-complete and an algorithm efficient in practice have nothing to do with each other. If an algorithm takes C * 1.000...eighty zeroes...1^n time it is exponential but for any n you will encounter in real life the time it takes will be indistinguishable from C.
[+] [-] bubblyworld|1 year ago|reply
On the other hand, it's remarkable that so many algorithms _do_ have reasonable constants/exponents. So the concept works pretty well.
[+] [-] andrewla|1 year ago|reply
And the reality is that most NP-complete problems are well-solved by heuristics except in strange combinatorial corners. It's why we can't just create encryption systems based on knowing the correct answer to an NP-complete problem; most of the problem instances will be too easy to solve.
[+] [-] plesner|1 year ago|reply
[+] [-] Aerbil313|1 year ago|reply
1: https://www.inf.puc-rio.br/~roberto/lpeg/
2: https://www.inf.puc-rio.br/~roberto/lpeg/re.html
[+] [-] kayodelycaon|1 year ago|reply
PCRE and theoretically pure regular expressions are different things.
[+] [-] eyelidlessness|1 year ago|reply
[+] [-] adrianN|1 year ago|reply
[+] [-] hyperpape|1 year ago|reply
That doesn't help you if you're stuck using a library that doesn't perform those optimizations, but it means you need to be careful about importing your assumptions about regex performance from one language to another.
See also: https://github.com/burntSushi/rebar
[+] [-] zeckalpha|1 year ago|reply
[+] [-] cies|1 year ago|reply
In most cases the parser will not be shorter (code) than the regex, but it will most certainly be simpler (cognitively; it's hard to understand all that regexes do, they're close to magic) and more testable (you can test the parts of the parser, like the number parser or the string parser, individually), and often faster too.
[+] [-] secondcoming|1 year ago|reply
But yes, you're right. Sometimes simple finds are sufficient.
[+] [-] rurban|1 year ago|reply
[+] [-] ChrisArchitect|1 year ago|reply
https://news.ycombinator.com/item?id=20310669
[+] [-] dang|1 year ago|reply
Regular Expression Matching Can Be Simple and Fast (2007) - https://news.ycombinator.com/item?id=20310669 - June 2019 (33 comments)
Regular Expression Matching Can Be Simple and Fast (2007) - https://news.ycombinator.com/item?id=16341519 - Feb 2018 (20 comments)
Regular Expression Matching Can Be Simple and Fast (2007) - https://news.ycombinator.com/item?id=9374858 - April 2015 (16 comments)
Regular Expression Matching Can Be Simple And Fast [2007] - https://news.ycombinator.com/item?id=6593331 - Oct 2013 (1 comment)
Regular Expression Matching Can Be Simple And Fast - https://news.ycombinator.com/item?id=820201 - Sept 2009 (15 comments)
Regular Expression Matching Can Be Simple And Fast - https://news.ycombinator.com/item?id=466845 - Feb 2009 (17 comments)