top | item 1387418

Perfect email regex finally found

310 points| mildweed | 16 years ago |fightingforalostcause.net | reply

112 comments

order
[+] percept|16 years ago|reply
I've accepted that it's best to treat people like grown-ups and if there's '@' and '.' and it's retyped then it passes. Someone can easily submit a fake name or phone number or street address, and e-mail's no different.

If they get it wrong, intentionally or not, then they don't get their receipt, confirmation, validation link, etc. and I believe in most cases the incentive is there for them to get it right.

In the rare case where there's some incentive to circumvent the system and this has some measurable impact on a site, then more validation may be warranted. Otherwise, why worry about it?

Also: HTML5. ;)

[+] fhars|16 years ago|reply
Your recommendation is almost as wrong as the regexp in the article, as both will reject the perfectly valid postmaster@ai address (check it out, "dig -t mx ai" will return a result, so that address must exist). The only thing you can be sure about in a non-local email address is that it contains at least one @ sign. More are allowed (for source routing), but hopefully no server is still configured to honor that, so you might get away with requiring exactly one @. Everything else is evil.

Oh, and don't forget to make sure that one component in your spam^W email processing chain correctly encodes unicode charaters in the domain part into punycode.

[+] sgk284|16 years ago|reply
I also do simple email verification:

1) Send an email to the address.

2) If it succeeds, the address was valid.

[+] pornel|16 years ago|reply
Retyped!? Grown-ups can read what they write.

Retyping only makes sense for password field, which is obfuscated and doesn't allow copy&paste.

[+] jrockway|16 years ago|reply
My thoughts exactly. There are so many websites that tell me my valid email address is invalid that it's not even funny. These people then have to deal with phonecalls and lost business, because their form won't even submit without a valid email address (and why should I change if they're the ones that suck).

BTW, the email address that doesn't work is "[email protected]". The .us confuses people and the - confuses people. WTF!?

[+] Confusion|16 years ago|reply

  If they get it wrong, intentionally or not, then they
  don't get their receipt, confirmation, validation link,
  etc. and I believe in most cases the incentive is there
  for them to get it right.
Well, I hope you have a large enough support team, because with any reasonable amount of customer growth, you'll soon be swamped with support emails that say "I didn't get my ..., where is it? You suck!". Validating the email address (which includes checking for common typos in domains) is a service that reduces frustration and the amount of customer support needed.
[+] jbr|16 years ago|reply
Along those lines, I've settled on the following overly permissive regex: /^[^\s@]+@[^\s@]+\.[^\s@]{2,}$/ -- it makes sure it looks something like an email address ([email protected])
[+] richardw|16 years ago|reply
There are many other uses for this regex than registration systems. We have at least two situations where a staff member would type in a customer mail address, and some get it wrong. Confirming won't help, and getting it wrong costs lots of time to fix. Using a better regex costs very little.
[+] Zev|16 years ago|reply
There are more reasons to check for an email than simply validation. Perhaps you want to detect emails in a field and let people click/tap on them to send that person an email? Don't want to let people send email off to [email protected] now, do we?
[+] Chilijoe|16 years ago|reply
You assume an email regex is only used for validating addresses. It can also be used to parse text for email addresses (for formatting or mining purposes maybe).
[+] gojomo|16 years ago|reply
Nothing is finished, nothing is permanent, and nothing is perfect.

In particular, one of the evaluation tests used here is wrong: it requires failure-to-match on a TLD with a digit in it:

[email protected]

In fact, IDN TLDs will have digits in them. An internet-draft is in the works to replace RFC1123's IDN-unfriendly implication that digits in TLDs are illegal:

http://tools.ietf.org/html/draft-liman-tld-names-02

[+] sigzero|16 years ago|reply
If you say "will have" and "in the works" means it isn't the standard and the current test is valid.
[+] pornel|16 years ago|reply
Not perfect: doesn't support IDN without punycode. Doesn't support IDN TLDs at all.

Users of http://موقع.وزارة-الاتصالات.مصر won't be pleased :)

[+] maw|16 years ago|reply
At the risk of sounding flip, not supporting punycode sounds like a feature. Getting internationalization right is clearly important, but punycode as a means of doing so? It's one of the many things that make me weep for my industry.

In fact, I often wonder if punycode is a prank that got out of hand.

UTF-8, on the other hand, would have been excellent for this purpose.

[+] Periodic|16 years ago|reply
It does not look like any of these can detect or were tested against quoted-local-part addresses. As I understand it, the local part can be quoted to allow illegal characters to be used, e.g. "John Doe"@example.com

I fully understand that these are not in common use, but they are part of the RFC and may be in use somewhere.

[+] Terretta|16 years ago|reply
His test also considers the % sign invalid, yet I've both sent and received email that required the % for the mail to be routed/relayed correctly (think "in care of" or c/o).

Granted, that was 17 years ago, but who's to say it's not in use somewhere?

[+] eli|16 years ago|reply
Uh, what happens if a new TLD has more than 6 characters?

I don't get this preoccupation with making sure addresses look valid. The ONLY way to validate an email address is to send it a message.

[+] generalk|16 years ago|reply
I never understood the purposes of the "perfectly valid by the RFC" email regex. You may be able to with 100% accuracy say that something should be an email address, but you'll never be able to tell if it's a valid account on the server, or if the server even exists.
[+] augustl|16 years ago|reply
Email validation is user input validation; you are protecting the user from some cases of erronous input, and yourself from stuff that doesn't even look like an email.

I find these large catch-all email regexps silly for two reasons:

1. They are hard to write, hard to understand, and hard to maintain.

2. Most importantly, they are difficult to understand for users. "You entered an invalid email". Now what? the user asks.

This is why e-mail validation should be done in steps. Here is some Rails pseudocode:

  validates_format_of :email,
    :with => /@/,
    :message => "Needs to contain an @."
  
  validates_format_of :email,
    :with => /\.[^\.]+$/,
    :message => "Has to end with .com, .org, .net, etc."
  
  validates_format_of :email,
    :with => /^.+@/,
    :message => "Must have an address before the @"
  
  validates_format_of :email,
    :with => /^[^@]+@[^@]+$/,
    :message => "Must be of the format '[email protected]'"
Much easier to write, much easier to maintain, and much better error messages to the users.
[+] snprbob86|16 years ago|reply
So many web services do this wrong, that it isn't even worth doing it right: no one is going to complain that your service doesn't accept their "wacky! quoted"@email.address

Sometimes, you aren't validating a whole string, you are searching for email addresses in a sea of text, or an arbitrarily delimited, user-entered list of contacts.

Support usernames with alphanumerics, dashes, underscores, periods, and plus signs; Require a single @; Support domains with alphanumerics, dashes, and at least one period. Screw anyone with something more complex than that. Done deal.

[+] huherto|16 years ago|reply
I see your point. If somebody is using something weird, they probably have problems everywhere.
[+] andrewcooke|16 years ago|reply
I have an implementation of RFC3696 http://www.faqs.org/rfcs/rfc3696.html (which is the spec for validating emails) in Python here - http://www.acooke.org/lepl/api/lepl.apps.rfc3696-module.html

That is part of Lepl - http://www.acooke.org/lepl/ - and although it's implemented in a recursive decent parser, much is compiled to regular expressions for efficiency. So you get the best of all worlds: regexp efficiency; parser accuracy; standards based.

A blog post on the compilation to regexps is here - http://www.acooke.org/cute/LEPLOptimi0.html

[+] aphyr|16 years ago|reply
It's certainly more concise than my previous favorite,

      qtext = '[^\\x0d\\x22\\x5c\\x80-\\xff]'
      dtext = '[^\\x0d\\x5b-\\x5d\\x80-\\xff]'
      atom = '[^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-' +
        '\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+'
      quoted_pair = '\\x5c[\\x00-\\x7f]'
      domain_literal = "\\x5b(?:#{dtext}|#{quoted_pair})*\\x5d"
      quoted_string = "\\x22(?:#{qtext}|#{quoted_pair})*\\x22"
      domain_ref = atom
      sub_domain = "(?:#{domain_ref}|#{domain_literal})"
      word = "(?:#{atom}|#{quoted_string})"
      domain = "#{sub_domain}(?:\\x2e#{sub_domain})*"
      local_part = "#{word}(?:\\x2e#{word})*"
      addr_spec = "#{local_part}\\x40#{domain}"
      pattern = Regexp.new "\\A#{addr_spec}\\z", nil, 'n'
[+] billturner|16 years ago|reply
This is the same one I've been using the last couple of years, and I wish I could remember where I first came across it.
[+] skoob|16 years ago|reply
"Now you have two problems."

Seriously, I can't think of a single good reason why you would want to check whether an email address is "valid". What you should be concerned about is whether or not the address works (and usually, can/does the person who just signed up actually read and reply to email to that address).

Hypothetically, if an invalid address works (due to bugs in mail systems) -- then it works, and the only problem with accepting such an address is that the bugs might get fixed. If an address is valid, there's no reason to assume that it will work or that it belongs to the person who signed up. It isn't even a good way of detecting typos; transpose two characters in an email address and it will most likely still pass your validation.

[+] mvalle|16 years ago|reply
Seriously, I can't think of a single good reason why you would want to check whether an email address is "valid"!

Because it's fun.

[+] angelbob|16 years ago|reply
It makes me happy that a contest of this kind is hosted at the domain "fighting for a lost cause" :-)
[+] jemfinch|16 years ago|reply
How in the world did this get 172 points?

After this I almost stopped paying attention: "It's my philosophy that it's better to accept a few invalid addresses than reject any valid ones, so I'm shooting for 0 false-positives and as few false-negatives as possible."

But then I looked at the regexps and they miss an absolutely trivial fact: valid email addresses can end in a dot. "[email protected]." is just as valid (more so, in fact) than "[email protected]".

[+] pbhjpbhj|16 years ago|reply
>How in the world did this get 172 points?

Things don't have to be well done, nor do you have to agree with them for them to be worth consideration/stimulating.

[+] DanBlake|16 years ago|reply
&*=?^+{}'[email protected]:2000 really looks strange, even though its a valid email.

Good luck trying to register on any site with it though :)

[+] petercooper|16 years ago|reply
What's the deal with using a port number in an e-mail address? I can't imagine many systems supporting that. Anyone got any more info on it that doesn't involve me digging through 101 pages of RFCs? :-)
[+] fragmede|16 years ago|reply
It thinks [email protected] is valid, and thinks decimaldomain@2130706433 (localhost) is not valid. It also thought user@3ffe:1900:4545:3:200:f8ff:fe21:67cf as invalid.
[+] perplexes|16 years ago|reply
Um, UTF-8? Unicode domains? Regexing email is a waste of time.
[+] mey|16 years ago|reply
I assume this breaks with the new non-latin TLDs or IPV6
[+] edw519|16 years ago|reply
As a typical optimizer, I was always trying to reduce my source code a few more bytes or speed up my processes a few nanaseconds here and there. I was so proud of myself. Until I tried to maintain my own slickness.

There's a fine line between clever and practical. Why do I have a gut feeling that this approach is way over that line?

[+] ryan-allen|16 years ago|reply
I think a basic token based parser might be more appropriate for thorough validation of email addresses. I see it similar to trying to match XML or HTML with regexps, very common but so very easy to break. You can't go to wrong parsing an address token by token, can you?