Email Regex that works 99.99%

[+] tom-lord|11 years ago|reply

I believe this falls under the category of "things that may be fun to play around with but should never be used in a real system".

Unfortunately, I bet there are thousands of "real systems" employing regexes like this... How many problems does this solve? Probably zero. How many does (/will) it cause? Probably much more than zero.

[+] lucaspiller|11 years ago|reply

Not sure why this has been downvoted, but yes these sort of things do cause more problems than they solve.

Case in point, at work the other day I found a bug in a service I manage. It consists of a front end form (built by one team), which submits data to another system (built by my team), which then passes the data to a third party. The third party was rejecting the data we were trying to send them as the email addresses were apparently invalid. The validation they were doing didn't match the validation the front end form did, so to the user everything seemed fine.

[+] falcolas|11 years ago|reply

After extensive research, I have finally come up with a way to improve upon 99.99%. I have come up with a regex which will work for 100% of email addresses, and a significant number of regex engines, as an added bonus! Behold:

    .*@.*

/s

[+] moe|11 years ago|reply

  $ nc localhost 25
  220 uhura.z ESMTP
  MAIL FROM:me
  250 2.1.0 Ok
  RCPT TO:root
  250 2.1.5 Ok
  DATA
  354 End data with <CR><LF>.<CR><LF>
  Look ma, no @!
  .
  250 2.0.0 Ok: queued as 8DB653E0065
  quit
  221 2.0.0 Bye

[+] madaxe_again|11 years ago|reply

Your verbosity is outlandish.

.*

will match all email addresses, and as an additional feature, all other strings too!

As we all know, more functionality with less code is better, therefore, this is clearly superior to all other regexes.

[+] fphilipe|11 years ago|reply

    .+@.+

Should do the trick

[+] arthurfm|11 years ago|reply

What regex would you suggest to match all e-mail addresses contained within an arbitrary body of text (as opposed to a single text field where you don't have worry about other text strings)?

[+] peteretep|11 years ago|reply

You know the different languages match different sets of email addresses, right? The reason the Perl ones are so much longer, is that they work for _all_ RFC5322 addresses, where the JS match a subset.

[+] michaelmcmillan|11 years ago|reply

I simply don't validate emails up front anymore. The only thing I check for is if the string contains an @-char, I only do that to be nice if it's left out by accident. Instead of having a monstrous regex pattern in my code I simply email a confirmation link the user must confirm.

[+] hobs|11 years ago|reply

That's the thing, the only way to validate an email address anyway is to actually have the user take some action to do so. Otherwise, I would question how important setting up the email thing is in the first place.

[+] jimsmart|11 years ago|reply

Jeez, that's 100 email addresses in a million that it won't work on. Plus it's a pain in the butt. [Edit: Though I suspect that 99.99% figure was made up]

Just get people to enter their email twice (which filters out most mistakes where people are entering their names or somesuch), don't validate it with regex, during the signup process make sure you tell them to expect an email which they must confirm before they are added / before their account is activated. Send a confirmation email with a clickable link. If people don't get it, and the service is important to them, they'll try again or contact you through another means.

(I was involved with the running of a mailing list with well over 1m double-opt-in subscribers. Less than 100 of these turned out to be invalid [Edit: yeah, that's a guess, like the OP's 99.99%], and we dealt with it easily at our end, by properly handling any bounces)

[+] bshimmin|11 years ago|reply

Utterly pointless. An email regex tells you that the email address (probably) conforms to a pattern that means it might be a valid email address (for now, until new weird TLDs emerge and the patterns have to change...), but it has no way of telling you whether that address can actually receive mail. `foo@bar` fails these regular expressions and `[email protected]` passes them, but neither will receive mail.

As I have told people for many years: if you must do this, check at most if there's an @ and (perhaps) a dot somewhere after the @, which is enough to stop someone who has accidentally put their name in the email address field, or a similar user error. Anything else is a waste of brainwidth and will result in more problems than it solves.

[+] erglkjahlkh|11 years ago|reply

The dot isn't mandatory either... There are also local (for example company internal) email services.

[+] PeterWhittaker|11 years ago|reply

Completely useful, as part of a two step process:1

1. Filter with the regex - what's left has a valid format, making step 2 much saner.

2. Extract and validate the domain name - super simple now, because the domain component is known to be sane.

(Optional but good idea 3: Handle exceptions....)

Step 1 is almost always the hardest part, now it's mostly done.

[+] wtbob|11 years ago|reply

> `foo@bar` fails these regular expressions and `[email protected]` passes them, but neither will receive mail.

foo@bar could still receive email, if you had a host in your DNS domain named bar, with a user named foo.

[+] Tepix|11 years ago|reply

It's a very good offline check. If that's not enough, you have to do online checking (DNS, RCPT TO, actual mail with confirmation link, etc.)

[+] tokenizerrr|11 years ago|reply

It's excellent to extract email addresses from a text.

[+] babo|11 years ago|reply

This would be valuable only with a proper test suite, nothing fancy but two files with valid and invalid addresses. I don't trust these and very hard to debug a complex regex, it's wway easier to argue about test cases.

[+] chrisfarms|11 years ago|reply

Yeah this is not great practice.

We built an app where sending a validation email upfront was not a practical option some time ago, and the best strategy I found for ensuring the email was valid was to lookup the MX records and ask the mailserver for the given domain, by issuing RCPT commands. Many mailservers will just drop connections when RCPT is for someone who doesn't exist or can't be routed to which was a good indicator of a typo or invalid address. And of course if the MX lookup fails the domain is incorrect.

Still wouldn't recommend this method either though really.

[+] IgorPartola|11 years ago|reply

Did you remember to fall back on a A and AAAA lookup if there is no MX record? That is what you are supposed to do.

[+] h1fra|11 years ago|reply

Only way to check email.

Lookup for an '@' & parse response log from provider to know if addresse works.

[+] imron|11 years ago|reply

So in 1 million emails, there are 10,000 valid emails that will be rejected. I guess if you use this regex then you'd better hope your service doesn't become popular.

[+] ThePadawan|11 years ago|reply

Your math is off by a factor of 100.

[+] shdon|11 years ago|reply

Title here has an extra 9, compared to what's on the site. In either case, how would one back up those percentages anyway?

[+] steventhedev|11 years ago|reply

Nice idea. But a blind implementation of the grammar set out in the RFC is not performant. It's better to drop the obsolete syntax rules and folding white space. I have yet to see a user legitimately try to input an email with a comment mid-domain.

I implemented this in Ruby a while back, but I also went the next step and added a DNS check for a MX record. That way you can ensure there's a mail server to receive an email. Heck, I even wrote a blog post about it.

http://stevenkaras.github.io/blog/verifying-email-addresses

We've had pretty good feedback so far, but I've also spotted a few emails that people enter that are clearly fake (e.g. asdf at test.com).

[+] lutoma|11 years ago|reply

Meh. I wish people would just give up on trying to validate email adresses all together (except for maybe basic stuff like checking for an @). They'll almost always forget about some edge case.

I use a pretty ununsual TLD (.su, the old TLD for the Soviet Union, which still remains in the root zone), and from time to time, I come across a site that won't accept my email address because of that. Most of those sites turn out to be generally crappy though, so not much of a loss…

Also, many sites don't accept + in email adresses, which is annoying as hell if you want to use the address extension feature of Postfix et al.

[+] exratione|11 years ago|reply

Does't work on the dreaded "I am a terrible but nonetheless valid email address"@example.com.

I echo the advice of everyone else - validate with something very simple like .+@.+ and then by sending an email. Trying to recapture the complexity of the email system via a tool like regular expressions is tilting at windmills. It's like trying to develop a regular expression to determine whether a name is real or not.

https://www.exratione.com/2012/09/what-constitutes-an-accept...

[+] quailman|11 years ago|reply

Note that these regexes aren't even matching the same thing, so who knows what these things are matching. Whatever it is, it's probably not 99.99% of the world's emails, and either way nobody is going to check that. Especially nobody is going to parse that Perl beast.

As a tangential anecdote, I always thought it would be interesting to drop a backdoor into some canonical piece of code like this that noobs are bound to copy-paste. It might be the most efficient way to worm your way into the largest number of computers worldwide.

[+] dfhoughton|11 years ago|reply

There are two Perl regexes. The beast is for 5.8 (check your Perl version; it's probably above 5.12). The other is basically a BNF grammar and is trivial to parse. The only easier ones are those that throw up their hands and just look for an @ with characters before and after.

[+] ricardobeat|11 years ago|reply

These have wildly different behaviour. The .NET and Javascript ones are even exchangeable (both are valid js). They will also not match internationalized domain names unless converted to punycode before validation.

IMO you either implement the RFC, or use the absolute dumbest validation possible: 1) there is one '@' character present 2) there is at least one dot on the right side. Anything else will exclude some valid addresses, and you're unlikely to ever hear feedback/complaints from someone who had their sign-up email rejected.

[+] rascul|11 years ago|reply

I haven't bothered with validating an email address for awhile, but I have used https://isemail.info successfully in the past.

[+] nailer|11 years ago|reply

In JS the regex is built into the browser: you can leverage HTML5 .checkValidity() on any type="email" input.

https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Form...

Note this is super liberal, so user@domainwithnodots (which is RFC valid, but probably also a user error) is still considered valid.

[+] comeonnow|11 years ago|reply

Any idea on who is responsible for this micro-site?

I find it strange that there's no information on the author, sources, references, attribution, or credits on the page at all (other than the WordPress theme attribution).

[+] Piskvorrr|11 years ago|reply

Three nines is now the new four nines? No data to back it up? Bah humbug!

61 comments