There are problems I've seen with using Regex strings and expecting them to work in all cases on all Regex engines which is why I tend to stick with PCRE ~ http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Express... a point in favour of the Gruber example.
"... The pattern is also liberal about Unicode glyphs within the URL ..."
On the penultimate paragraph, and somewhat of a tangent:
Wow, I'd completely forgotten that you could have Unicode in domain names, and I suspect a lot of people don't think about it very much either. In my limited experience, even Chinese-only websites rarely stray from normal alphanumeric domains, even though the people visiting those sites could easily type out URLs with Chinese glyphs.
Perhaps I'm missing something here, but it seems that with good alphanumeric domains becoming less available, cool/clever/classy Unicode domains could be a viable alternative, given an appropriate purpose -- Google would probably not want one -- and a techie enough audience. When [for which sites?] and how often do people actually type URLs?
Example: a friend of mine did a cheeky web branding project a while ago named "Heart Star Heart"... ♥★♥.com would have been perfect.
EDIT: I should probably do more research on this myself, but it looks like there's some mysterious isomorphism between Unicode domains and "normal" domains. Firefox renders U+272A in http://✪df.ws/e7m correctly but changes its text to http://xn--df-oiy.ws/e7m and when I access ♥★♥.com my ISP complains that xn--p3hxmb.com doesn't exist. Anybody know what the isomorphism actually is?
I believe it has something to do with security, when browsers first added unicode url support there were issues with hackers and spammers using blank and lookalike unicode characters to trick people into visiting shady domains.
That said, non-ASCII URLs suck because not everyone can type them. Imagine being a tourist in Tokyo who has to lookup a restaurant on your laptop or having to lookup the product page for this gadget you bought in China…
It looks like line noise. But it's funny though, I can read regexps better than I can read formal semantics. Having tried once this evening to read http://matt.might.net/papers/might2007diss.pdf, regexps are refreshing :)
what is wrong with URI.parse in the stdlib? the articles' re goes out of url validation to get urlish things in free text, (e.g. " lokk at http://goo.com/bat, lovely" the comma would actually be part of a valid url, but the regex tries to detect this. Yet, for url validation ruby's lib is enough, I believe.
is the lexing part, and then there are other files in the same directory that do other little bits. The whole hyperlinks framework is under a BSD license.
An RFC-822-compliant regex is listed at (http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html), but if anything, it's just a strong using for using real parsing tools. Regexes don't handle recursion and balanced delimiters very well.
For validating emails, I've settled on /.@./, or if you really want to push for valid emails, /.@[^.]+\../. (Note the lack of anchoring to the beginning or end.) (That, and some limit on length.)
The rules are so flipping complicated and so easy to get wrong that you're better off just trying to send a mail and seeing what happens, and asking the recipient to validate reception if you care about the address. Is it really that important to exclude bad emails, at the cost of, say, blocking email addresses from the UK, as your regex seems to do? Even "validating" for sheer user error is only useful if you get it right.
What would be the equivalent in Python to the :punct: character class operator? I don't think the re module supports those. I guess they'd have to be spelled out pretty much?
It doesn't work with standard permalinks that feature hyphens in the url, and none of his examples show links with hyphens. Most blogs out there (wordpress) are using hyphens in their permalinks.
[+] [-] bootload|16 years ago|reply
There are problems I've seen with using Regex strings and expecting them to work in all cases on all Regex engines which is why I tend to stick with PCRE ~ http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Express... a point in favour of the Gruber example.
"... The pattern is also liberal about Unicode glyphs within the URL ..."
PCRE supports Unicode but it's not switched on by default ~ http://www.pcre.org/pcre.txt
[+] [-] nanotone|16 years ago|reply
Wow, I'd completely forgotten that you could have Unicode in domain names, and I suspect a lot of people don't think about it very much either. In my limited experience, even Chinese-only websites rarely stray from normal alphanumeric domains, even though the people visiting those sites could easily type out URLs with Chinese glyphs.
Perhaps I'm missing something here, but it seems that with good alphanumeric domains becoming less available, cool/clever/classy Unicode domains could be a viable alternative, given an appropriate purpose -- Google would probably not want one -- and a techie enough audience. When [for which sites?] and how often do people actually type URLs?
Example: a friend of mine did a cheeky web branding project a while ago named "Heart Star Heart"... ♥★♥.com would have been perfect.
EDIT: I should probably do more research on this myself, but it looks like there's some mysterious isomorphism between Unicode domains and "normal" domains. Firefox renders U+272A in http://✪df.ws/e7m correctly but changes its text to http://xn--df-oiy.ws/e7m and when I access ♥★♥.com my ISP complains that xn--p3hxmb.com doesn't exist. Anybody know what the isomorphism actually is?
[+] [-] natrius|16 years ago|reply
[+] [-] showerst|16 years ago|reply
Related: http://www.mozilla.org/security/announce/2009/mfsa2009-50.ht...
I'm not 100% sure of this though, hopefully someone more knowledgeable can chime in.
[+] [-] sorbits|16 years ago|reply
That said, non-ASCII URLs suck because not everyone can type them. Imagine being a tourist in Tokyo who has to lookup a restaurant on your laptop or having to lookup the product page for this gadget you bought in China…
[+] [-] ehsanul|16 years ago|reply
/^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:@]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?@]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?@]|%[0-9A-F]{2})*))?$/i
Yeah, more involved. Though it parses it into the url parts, and it does work.
[+] [-] wingo|16 years ago|reply
[+] [-] riffraff|16 years ago|reply
[+] [-] durin42|16 years ago|reply
http://hg.adium.im/adium-1.4/file/542aa252713b/Frameworks/Au...
is the lexing part, and then there are other files in the same directory that do other little bits. The whole hyperlinks framework is under a BSD license.
[+] [-] philfreo|16 years ago|reply
HN does it right, but Gruber's example seems to put the period in the URL.
[+] [-] tsetse-fly|16 years ago|reply
http://en.wikipedia.org/wiki/O3b_Networks,_Ltd.
HN does it wrong. There are exceptions either way, I wouldn't say that one is more correct.
[+] [-] notauser|16 years ago|reply
For e-mails I use:
And for twitter user names I use: (which incorrectly matches @@username as @username but I assume that kind of thing is a typo - the important thing is not to match e-mail addresses)[+] [-] silentbicycle|16 years ago|reply
[+] [-] jerf|16 years ago|reply
The rules are so flipping complicated and so easy to get wrong that you're better off just trying to send a mail and seeing what happens, and asking the recipient to validate reception if you care about the address. Is it really that important to exclude bad emails, at the cost of, say, blocking email addresses from the UK, as your regex seems to do? Even "validating" for sheer user error is only useful if you get it right.
[+] [-] blasdel|16 years ago|reply
Why bother trying to validate the domain lexically, when you can just try resolving it?
[+] [-] kprobst|16 years ago|reply
[+] [-] kingkilr|16 years ago|reply
[+] [-] techiferous|16 years ago|reply
(Note: you can't plug-n-play this middleware yet--still a coupla bugs. Will fix soon.)
[+] [-] whalesalad|16 years ago|reply
[+] [-] doulaimi|16 years ago|reply
[deleted]
[+] [-] DanBlake|16 years ago|reply