This regular expression is due to Abigail of Perl fame. Who is still very active, see http://search.cpan.org/~abigail/ for more.
However while cute, it is very slow. It tries every possible factorization as a pattern match. When it succeeds, on a string of length n that means that n times it tries to match a string of length n against a specific pattern. This is O(n^2). Try it on primes like 35509, 195341, 526049 and 1030793 and you can observe the slowdown. (Some regular expression engines may employ tricks that speed this up over the naive analysis, but none will be super speedy about it.)
It's even worse, because the speed of factoring algorithms is measured against the number of digits of the binary expansion, and this is a unary expansion. So it's really O(2^n^2)
For some this will be obvious, but for others it may still be interesting:
If this was a "true" regular expression, in the sense that it describes a regular language (http://en.wikipedia.org/wiki/Regular_language), then checking whether the number is prime or not, by checking if the expression matches, would run in linear time. This would be worth a few nobel prizes at least.
Unfortunately, it is an "extended" regular expression, i.e. not regular at all in the strictest sense, and (as the article says) requires backtracking.
It's also worth noting that what this expression matches are not primes given in decimal notation; rather, the language it accepts is "111...111" with any prime number of 1s. And we know straight-up that no "true" regular expression matches this language, by the pumping lemma (http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_langu...).
But more importantly, when we talk about the running time of algorithms, we talk about the time in relation to how many bits it takes to describe the number. So linear in the size of the number is exponential in the number of bits, and that is not very interesting.
And a better primality test would not rock the world of cryptography. At the moment http://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_... is demonstrably good enough as a prime test that nobody really is hurting for better ones. The thing that you want is a better factoring algorithm. And the General Number Sieve is already subexponential, and so better than any regular expression could be.
It has been posted before [2], so I'll repost my comment from back then [3]:
It's cool, but it's regexp, not regular expression -- regular expressions cannot test for primeness. The proof is a simple application of pumping lemma[1]: we want to show that language { a^p: p prime } is not regular. Suppose it is. By pumping lemma we have numbers N, n >= 1, such that for every k >= 0, we have a^(N+nk) in our language. But that means that for every k, N+nk is prime, and it fails for k = N, since N+nN = N(1+n) is not a prime, since both N and 1+n are greater than 1.
This is indeed beautiful. But the author limits itself to a low-level unpacking of the regex and running a few tests cases, all the while remaining mystified as to why this works.
A high-level description helps makes obvious what's happening:
To check the primality of a number N with string matching,
repeat a string (say, "1") N times
and declare N prime:
UNLESS the repetitions match the string 0 or 1 times
OR the repetitions can be EVENLY matched into groups of MORE than one string.
I wouldn't say he remained mystified. His exploration of the 9 case basically explained how it worked, just the author chose to leave it a bit implied for the reader to figure out. I would agree for a blog post it's usually better to hammer the point home though.
Uses back-references, so it doesn't describe a regular language in the strict sense. Still, it's pretty cool to use them to trick a string-processing engine to do integer factoring.
Also, it's easily provable (using pumping lemma) that the language: { 1^n | n is a prime number } is non-regular, so no "regular expression" (referring to strict mathematical definition here) can express it.
The thing that made this click for me was realizing the "1" in the regex was not special. The regex is just looping through integers (starting with 2) and checking if the target number is divisible by any of them. It can use any digit if the target string uses the same digit. Ruby example using 5 instead of 1:
def is_prime(n); str = "5"*n; str !~ /^5?$|^(55+?)\1+$/; end
The regex essentially says, for integer n, "does 2 ones fit evenly into n ones", then "does 3 ones fit into n ones", etc., which is the same as "does 2 fit into n", "does 3 fit into n", etc.
Thanks for sharing. This is a post I wrote way back in 2007 and, even though, as many have pointed out, the algorithm is very inefficient, it is still a good illustration of how powerful regular expressions are.
When I am explaining the basics of regular expressions to people, I always refer them to this post to make them realize that regular expressions are not only glorified search terms. In general, they are amazed :-)
An aspect of the syntax is confusing me here. I would have assumed that the expression 'x+?' would be parsed as '((x)+)?', ie. 'x*'. However, it seems the parser special-cases this as (x)(+?) instead, and interprets that as a non-greedy operator. Isn't this horribly inconsistent?
If there were, there would be finitely many repunit primes (via the pumping lemma and the notion that there are arbitrary large non-prime repunits).
Edit: that's not quite correct. The above would imply some simple structure in repunit primes. If you add the known asymptotical behaviour of the 'number of primes < N)' function, I think it would imply it.
Also, in binary, it would either imply there are finitely many Mersenne primes, or that there is a N such that all Mersenne numbers larger than N are prime (edit: Mersenne numbers are repunits in binary)
[+] [-] btilly|11 years ago|reply
However while cute, it is very slow. It tries every possible factorization as a pattern match. When it succeeds, on a string of length n that means that n times it tries to match a string of length n against a specific pattern. This is O(n^2). Try it on primes like 35509, 195341, 526049 and 1030793 and you can observe the slowdown. (Some regular expression engines may employ tricks that speed this up over the naive analysis, but none will be super speedy about it.)
Usual arithmetic is much more efficient. :-)
[+] [-] j2kun|11 years ago|reply
[+] [-] mathattack|11 years ago|reply
[+] [-] davidw|11 years ago|reply
[+] [-] anyfoo|11 years ago|reply
If this was a "true" regular expression, in the sense that it describes a regular language (http://en.wikipedia.org/wiki/Regular_language), then checking whether the number is prime or not, by checking if the expression matches, would run in linear time. This would be worth a few nobel prizes at least.
Unfortunately, it is an "extended" regular expression, i.e. not regular at all in the strictest sense, and (as the article says) requires backtracking.
[+] [-] Chinjut|11 years ago|reply
[+] [-] btilly|11 years ago|reply
First of all there is no Nobel in math/CS.
But more importantly, when we talk about the running time of algorithms, we talk about the time in relation to how many bits it takes to describe the number. So linear in the size of the number is exponential in the number of bits, and that is not very interesting.
And a better primality test would not rock the world of cryptography. At the moment http://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_... is demonstrably good enough as a prime test that nobody really is hurting for better ones. The thing that you want is a better factoring algorithm. And the General Number Sieve is already subexponential, and so better than any regular expression could be.
[+] [-] wyager|11 years ago|reply
Proving this is a common assignment in automata theory classes.
[+] [-] marvin|11 years ago|reply
[+] [-] xyzzyz|11 years ago|reply
It's cool, but it's regexp, not regular expression -- regular expressions cannot test for primeness. The proof is a simple application of pumping lemma[1]: we want to show that language { a^p: p prime } is not regular. Suppose it is. By pumping lemma we have numbers N, n >= 1, such that for every k >= 0, we have a^(N+nk) in our language. But that means that for every k, N+nk is prime, and it fails for k = N, since N+nN = N(1+n) is not a prime, since both N and 1+n are greater than 1.
[1] - http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_langu... [2] - https://news.ycombinator.com/item?id=3391547 [3] - https://news.ycombinator.com/item?id=3391572
[+] [-] elzr|11 years ago|reply
A high-level description helps makes obvious what's happening:
[+] [-] city41|11 years ago|reply
[+] [-] andrewflnr|11 years ago|reply
[+] [-] anuragbiyani|11 years ago|reply
[+] [-] ubernostrum|11 years ago|reply
It's probably time to give up on that terminology nit-pick.
[+] [-] tom-lord|11 years ago|reply
[+] [-] arielpts|11 years ago|reply
Very usefull!
[+] [-] charlesacknin|11 years ago|reply
AFAICT the way the string is going to be parsed is equivalent to this:
A parser implementation may optimize and do "while (n - i) >= 1" instead, which cuts in half the number of iterations.[+] [-] jdmichal|11 years ago|reply
[+] [-] fxn|11 years ago|reply
https://github.com/fxn/math-with-regexps/blob/master/one-lin...
[+] [-] nsajko|11 years ago|reply
[+] [-] logicallee|11 years ago|reply
As it stands that 1 x shift is super confusing. who would have thought 1 would become a literal! we wouldn't even write it that way in english.
[+] [-] bantic|11 years ago|reply
def is_prime(n); str = "5"*n; str !~ /^5?$|^(55+?)\1+$/; end
The regex essentially says, for integer n, "does 2 ones fit evenly into n ones", then "does 3 ones fit into n ones", etc., which is the same as "does 2 fit into n", "does 3 fit into n", etc.
[+] [-] avinash|11 years ago|reply
When I am explaining the basics of regular expressions to people, I always refer them to this post to make them realize that regular expressions are not only glorified search terms. In general, they are amazed :-)
[+] [-] ekimekim|11 years ago|reply
[+] [-] _lce0|11 years ago|reply
here a JS implementation
[+] [-] maemilius|11 years ago|reply
[+] [-] Amorymeltzer|11 years ago|reply
[+] [-] TheLoneWolfling|11 years ago|reply
[+] [-] Someone|11 years ago|reply
Edit: that's not quite correct. The above would imply some simple structure in repunit primes. If you add the known asymptotical behaviour of the 'number of primes < N)' function, I think it would imply it.
Also, in binary, it would either imply there are finitely many Mersenne primes, or that there is a N such that all Mersenne numbers larger than N are prime (edit: Mersenne numbers are repunits in binary)
[+] [-] geertj|11 years ago|reply
[+] [-] okey|11 years ago|reply