It's Impossible to Validate an Email Address

[+] gaur|10 years ago|reply

What I'm about to say is more general than regex, but can online services please stop trying to validate my email address?

If I gave you an email address that you think is invalid, rest assured I did it for a reason. I'm not an imbecile: I know how to type my address correctly (especially when you make me type it twice). For all the imbeciles who don't know how to type their address correctly, the phone system still works fine.

I may have given you my real email address with a plus-sign for a filter. Don't tell me it's invalid.

I may have given you a fake email address, because I know you're just going to spam me. If you tell me it's invalid, I'll either spend an extra few minutes cooking up a better fake email address, or I'll leave your site.

[+] Kequc|10 years ago|reply

https://medium.com/i-m-h-o/please-stop-verifying-my-email-ad...

I wrote something about this a while ago. I'm no wordsmith but it fits here.

[+] T-hawk|10 years ago|reply

But, statistically speaking, it probably is an imbecile.

These systems aren't optimizing for someone like you trying to be clever. They're trying to handle the more common cases of someone typing a street address in the wrong field or forgetting the ".com" or putting in an HTTP address or some such.

They're trying to catch that, not every possible pedantically legal string you could throw at it. Maybe they could improve the validator, or maybe they'd rather be spending programmers' time on things with actual business value.

The sum total of imbeciles may well represent a bigger market than the sum total of people like you.

[+] makecheck|10 years ago|reply

Agreed. I find things like "[email protected]" usually work, and then if they ever bother to read what’s in their database they may get a hint.

[+] astrocat|10 years ago|reply

I know it can be frustrating, it's happened to me too, but the reality is that email validation generally isn't for you. It's for the 99% of other people who would greatly appreciate the heads up that they've typed "something@yahoocom" or "Boys@[email protected]" and it's probably not what they meant to type.

Now we can talk about HOW validation is implemented, and I think it would be completely fair to raise a warning: "Hey, did you mean to enter this?" instead of "sorry, nope." when non-trivial addresses are encountered.

[+] tyingq|10 years ago|reply

If you happen to control the web page where the user is entering the email, this little piece of code has been a godsend for us:

https://github.com/mailcheck/mailcheck

I agree with the idea that it's impossible to validate. But, mailcheck takes the approach of seeing if the email is potentially wrong, then prompting the user with what it thinks they meant. It's usually right, but if not, it allows whatever the user wants.

For example, if your user types in "[email protected]", it will suggest "[email protected]".

[+] thedufer|10 years ago|reply

This kind of stuff is great - make suggestions, but allow it to go through even if you think it's wrong. For the last email validation I worked on, there were only 2 absolute blockers - there must be an @ sign, and the domain must have MX records (emails that are technically valid remain useless if we can't send them anything).

There were a number of other checks (being close to yahoo.com or gmail.com or other common email hosts, containing surprising characters, etc) that would trigger warnings, but still allow the check to pass if the user assured us it was correct.

[+] makecheck|10 years ago|reply

This advice should probably extend to a variety of form elements.

Like DRM and a lot of other efforts to “control” things, aggressive validators invariably punish people who are just trying to do legitimate things. Don’t piss off your real customers.

I used to live in a town with a 12-letter name, and more than once a form decided that it knew the Universal Sensible Maximum Length of Town Names and wouldn’t let me type the last couple characters. And it usually doesn’t stop there, because once a site is incapable of storing things sensibly it invariably starts to have trouble matching things, giving errors that are just plain stupid (e.g. “this other thing doesn’t match what you entered”, well no shit...).

There is also too much thought put into what constitutes a person’s “name”. Generally, to work across all possible cultures, use a single, very long text field that can contain whatever the person decides to type. After accepting their input as-is, feel free to internally perform parsing logic to try to allow additional database queries but under no circumstances should your page make any assumptions.

The real “you should be fired as database administrator” mistake though is to store modified data without telling the user. This usually happens with passwords; I use a site for months and then one day accidentally hit Return too soon and my password works anyway, meaning they just CLIPPED whatever strong password I entered and stored whatever they felt like (usually 8 characters). NEVER do things like that without telling the user.

[+] Freak_NL|10 years ago|reply

It is 2016. We should know better by now than to clip passwords at all. Sure, place a limit of 512 characters on it to prevent abuse, and for security reasons there can be a number of minimal requirements for length and complexity, but please let me be the one to decide if 32 characters is sensible or not.

[+] educar|10 years ago|reply

> One more interesting tidbit is if you use unique sub-addresses for each of the sites you sign up to you will be able to see when someone, or rather who, sells your email to someone else... Busted!

Can't the spammers simply strip the subaddress/label after '+' ?

[+] 51Cards|10 years ago|reply

Exactly. This technique while noble in intent is very easy to subvert. I tried subaddresses for over a year and still found spam coming in without the tag. I don't know if others have had more success with it but I haven't noticed a difference. I'm pretty sure it's just being stripped.

Perhaps the reverse is more effective. Use a subaddress for all your mail and ignore that without a label coming in.

[+] kps|10 years ago|reply

No, they can't. Or rather, they can, but they'll be wrong. The ‘+’ is not a feature of email addresses; it doesn't mean anything different from any other character. It's just a feature of some destination systems that they ignore everything from ‘+’ when assigning incoming mail to accounts.

Generally, that's configurable; e.g. in postfix it's set by the optional |recipient_delimiter|. Suppose it's set to ‘-’, and you, John Public, sign up for SuspiciousService using the email address <[email protected]>. Normally, the mail gets delivered to your 'john+public' mailbox. But if Mr Spammer strips everything from ‘+’, he sends to <[email protected]>. And your system knows that that is not only not a valid local mailbox, but also that it matches a plus-stripped address, and therefore consigns the message and its sender to the fiery pits of hell.

[+] voidz|10 years ago|reply

It doesn't have to be +. It can be any character you configure. Recent postfix versions even supports multiple characters.

[+] ortusdux|10 years ago|reply

I went a step further and bought a .com that I setup with a catchall email forwarder. Most signups get a custom email at that domain. I still get Russian bride emails sent to oilchangeplace@

[+] MatthaeusHarris|10 years ago|reply

They can.

They don't.

It's extra effort for them for nearly zero marginal gain.

[+] herge|10 years ago|reply

I once heard the story of a man who helped Aruba set up their DNS (.aw) in the late 90's. In exchange, as part of his compensation, he asked for an email address at the top-level domain, and received something like js@aw, which is a perfectly functional email address, but trips up a lot of validators.

[+] marcosdumay|10 years ago|reply

It isn't a valid address. TLDs must not resolve, so it should be impossible to make a server handle it (yet, it is mostly possible, because most DNS servers do not completely implement the RFCs - still, there's no guarantee it will work on every network).

[+] vostok|10 years ago|reply

I heard this story about Ian Goldberg, Anguilla, and the obvious email address.

[+] hugi|10 years ago|reply

Dick move, though.

[+] mcv|10 years ago|reply

If you were to ask me for a regex, I'd say /.+@.+/.

That's the easiest and most accurate way to do it by regex. Sure, some invalid addresses may still get accepted, but that is unavoidable. Even the most thorough validation[0] is going to accept nonexistent addresses.

[0] Except those that validate by sending a mail to it. Sending an email is the only way to be sure.

[+] Freak_NL|10 years ago|reply

That is the only correct answer.

    .+@.+

In human language: an email address consists of something, an @, and something.

That's it. The name part can be anything; so don't validate it beyond having a character.

We don't have one-character domains yet, but there is no reason to exclude them. It is already possible to arrange your own top-level domain if your pockets are deep enough (.google), so don't be surprised if some megacorp starts using a top-level domain without any subdomains. Don't validate the domain part beyond having a character.

[+] userbinator|10 years ago|reply

Sending an email is the only way to be sure.

Absolutely this. The check for an '@' and something before/after it is for sanity, and anything beyond that would involve actually trying to use the address.

[+] s_kilk|10 years ago|reply

> If you were to ask me for a regex, I'd say /.+@.+/

How about "one@two@three@[email protected]"?

[+] LinuxBender|10 years ago|reply

    # get email addresses
    grep -EiEio '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'

    # censor email addresses
    sed -r 's/(<)[[:alpha:][:digit:]\._%\+-]+@[[:alpha:][:digit:]\.-]+\.[[:alpha:]]{2,4}(>)/\1--removed--\2/g'

If email doesn't meet those, I drop them on the floor. Then again, I drop email on the floor for lesser reasons.

[+] haddr|10 years ago|reply

Let's remind the famous quote from Jamie Zawinski: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

I was neglecting this quote for a long time, until I started using regular expressions in real projects...

[+] drivingmenuts|10 years ago|reply

I never really learned how to write regex. I can write simple ones with the use of a tool to help me figure out what I need to write (and a cheatsheet to explain the terms).

I realize they could save me some future potential grief, but I'm usually more concerned with the present actual grief they cause me.

I feel like I'm letting down the side.

[+] pbreit|10 years ago|reply

So many routing solutions use regexes. Is that really necessary?

[+] jukenim|10 years ago|reply

This checks whether or not an email address follows RFC 5322 via parsing vs via regex: https://github.com/jackbowman/email-addresses

[+] kps|10 years ago|reply

Yes! The title is misleading; it's not at all difficult to syntactically validate an email address; it's just not possible using regular expressions (HE COMES).

[+] jjp|10 years ago|reply

Title on article would be more complete if said by Regex.

[+] ketralnis|10 years ago|reply

That's what the article describes, but it's hard to validate an email address by sending to it too if you want time bounds. My mail server implements greylisting and its frequently difficult for me to verify my email address on services that send me tokens that expire. Greylisting typically delays by only 10 minutes or so but there are plenty of times that mail servers can be down for extended periods, or a quota/disc is full, or an intermediate mail router is down, or any number of other problems

[+] Kequc|10 years ago|reply

It seems there are weird things you can use in an email address that nobody does, as a result what is used and considered to be an email address has matured. If you create an email address that is weird, in practice you'll be less capable of using it.

The weirder it is the fewer web forms or software you'll successfully put it into.

I think we can just say no, functionally, you cannot put comments or additional @ symbols into your email address. It hasn't worked for long enough, people know you just aren't supposed to do it. I'd be surprised if you were allowed to create such a thing signing up for bing for example. You probably need to be the administrator of some chaotic UNIX server with full DNS, in order to force it to happen at this point.

Even Google Chrome's built in email field validation doesn't allow you to do it.

I shouldn't be expected to jump through the hoops necessary in order to allow "technically valid" email addresses that someone went out of their way to make, when I could more easily suggest they use a normal one.

[+] Spivak|10 years ago|reply

> I shouldn't be expected to jump through the hoops necessary in order to allow "technically valid" email addresses that someone went out of their way to make, when I could more easily suggest they use a normal one.

I really hope you don't work on anything important if your stance is, "I shouldn't be expected to implement specifications correctly because it's easier to only implement part of it."

Why are we even having this discussion? Implement it correctly once, put it in a library and never worry about it again. You don't have to jump through any hoops, you're only making more work for yourself by implementing the standard incorrectly and then having to deal with customers that think that the ITEF standard is more valid than your personal definition of what an email address should be.

If your code is passed down the line and eventually hits someone who writes unit tests for actual valid email addresses then your name is going to come up on the git blame when it fails.

[+] steventhedev|10 years ago|reply

It's easy to validate that the syntax is correct. The problem lies in what you're trying to do with those addresses. If you're importing a mailing list archive, chances are a syntactical check is the only one you can do, because half the domains for older lists don't exist anymore, and most of the mailboxes won't.

If you want to send email to that address, you're probably going to want something that can suggest gmail as a replacement for gmial. You can also check that the domain exists and has a MX record. If you run your own mail server you can probably even check that the mailbox exists...

If you want emails to be unique, you'll need to apply per-site logic like gmails optional .'s and strip the + segments. That's important if you're combining multiple lists of emails, or importing an existing mailing list for a user.

The gist is the real world is complicated, but you can pretty easily set up something that handles 90% of it.

[+] vostok|10 years ago|reply

Wouldn't it be reasonable to have a sanity check that can be bypassed by the user? It is very likely to be a mistake if there's no full stop in the address, but there are exceptions [0]. I would like to see a warning if I accidentally type vostok@examplecom instead of [email protected].

[0] https://mail.gnome.org/archives/evolution-list/2002-January/...

[+] walterstucco|10 years ago|reply

your first example is a valid email

[+] drdeadringer|10 years ago|reply

I remember reading the opinion that one need only verify that an email address is an email address by answering, "Does it have an '@'?". Yes? Email address. No? Try again.

Perhaps it is nice to ask a question one step above the email verification: How much responsibility does//should the user have, how much responsibility does//should the designer have, in ensuring the user's email address as valid?

[+] coldtea|10 years ago|reply

With a regex/parser maybe -- but it's very easy to require "activation" from said email address as a verification.

[+] dhoerl|10 years ago|reply

I added a comment to the authors article. You can construct a regex to process every valid email address except those with nested comments (a feature no one in the real world ever used): https://github.com/dhoerl/EmailAddressFinder

[+] leesalminen|10 years ago|reply

I've been using Mailguns validator [0] for a while now and have been quite pleased. It catches common typos and validates DNS on the host name.

[0] https://documentation.mailgun.com/api-email-validation.html

[+] nradov|10 years ago|reply

For anyone who would like to test their email address validation code, I wrote a fuzzer which can generate syntactically valid addresses (among other things).

https://github.com/nradov/abnffuzzer

[+] kristianp|10 years ago|reply

Is there ABNF for email addresses in an RFC?

[+] X86BSD|10 years ago|reply

This brought me back to how Postfix has been handling this all these years.

http://www.postfix.org/ADDRESS_VERIFICATION_README.html

[+] voidz|10 years ago|reply

I think that spaces are also valid in email addresses. So, even <bilbo [email protected]> would be a valid email address in that case...

[+] marcosdumay|10 years ago|reply

They are valid within quotes, so that <"b b"@example.com> is valid, but <b [email protected]> isn't.

Almost anything is valid within quotes, but quotes can not appear everywhere.

67 comments