A Liberal, Accurate Regex Pattern for Matching URLs

[+] bootload|16 years ago|reply

  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

I found the w3 specs (rfc3987) suitable for my needs ~ http://www.ietf.org/rfc/rfc3987.txt a nice Regex to parse Url formats. This Regex allows you to extract scheme ($2), authority ($4), path ($5), query ($7) and fragment ($9) ~ http://www.flickr.com/photos/bootload/238916518/

There are problems I've seen with using Regex strings and expecting them to work in all cases on all Regex engines which is why I tend to stick with PCRE ~ http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Express... a point in favour of the Gruber example.

"... The pattern is also liberal about Unicode glyphs within the URL ..."

PCRE supports Unicode but it's not switched on by default ~ http://www.pcre.org/pcre.txt

[+] nanotone|16 years ago|reply

On the penultimate paragraph, and somewhat of a tangent:

Wow, I'd completely forgotten that you could have Unicode in domain names, and I suspect a lot of people don't think about it very much either. In my limited experience, even Chinese-only websites rarely stray from normal alphanumeric domains, even though the people visiting those sites could easily type out URLs with Chinese glyphs.

Perhaps I'm missing something here, but it seems that with good alphanumeric domains becoming less available, cool/clever/classy Unicode domains could be a viable alternative, given an appropriate purpose -- Google would probably not want one -- and a techie enough audience. When [for which sites?] and how often do people actually type URLs?

Example: a friend of mine did a cheeky web branding project a while ago named "Heart Star Heart"... ♥★♥.com would have been perfect.

EDIT: I should probably do more research on this myself, but it looks like there's some mysterious isomorphism between Unicode domains and "normal" domains. Firefox renders U+272A in http://✪df.ws/e7m correctly but changes its text to http://xn--df-oiy.ws/e7m and when I access ♥★♥.com my ISP complains that xn--p3hxmb.com doesn't exist. Anybody know what the isomorphism actually is?

[+] natrius|16 years ago|reply

http://en.wikipedia.org/wiki/Punycode

[+] showerst|16 years ago|reply

I believe it has something to do with security, when browsers first added unicode url support there were issues with hackers and spammers using blank and lookalike unicode characters to trick people into visiting shady domains.

I'm not 100% sure of this though, hopefully someone more knowledgeable can chime in.

[+] sorbits|16 years ago|reply

URLs do not allow unicode but most user agents support http://en.wikipedia.org/wiki/Internationalized_domain_name

That said, non-ASCII URLs suck because not everyone can type them. Imagine being a tourist in Tokyo who has to lookup a restaurant on your laptop or having to lookup the product page for this gadget you bought in China…

[+] ehsanul|16 years ago|reply

I was using a monster of a regex for validating URLs in a Ruby (Sinatra) app, and it wasn't even looking in unstructured text). Found it at http://snipplr.com/view/6889/regular-expressions-for-uri-val...

/^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:@]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?@]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?@]|%[0-9A-F]{2})*))?$/i

Yeah, more involved. Though it parses it into the url parts, and it does work.

[+] wingo|16 years ago|reply

It looks like line noise. But it's funny though, I can read regexps better than I can read formal semantics. Having tried once this evening to read http://matt.might.net/papers/might2007diss.pdf, regexps are refreshing :)

[+] riffraff|16 years ago|reply

what is wrong with URI.parse in the stdlib? the articles' re goes out of url validation to get urlish things in free text, (e.g. " lokk at http://goo.com/bat, lovely" the comma would actually be part of a valid url, but the regex tries to detect this. Yet, for url validation ruby's lib is enough, I believe.

[+] durin42|16 years ago|reply

Adium also has a pretty insane way of matching URLs, that's been in use (and growing all the while) since 2004.

http://hg.adium.im/adium-1.4/file/542aa252713b/Frameworks/Au...

is the lexing part, and then there are other files in the same directory that do other little bits. The whole hyperlinks framework is under a BSD license.

[+] philfreo|16 years ago|reply

I generally dislike when periods are taken as part of the URL. Then you can't end a sentence with a URL like http://example.com.

HN does it right, but Gruber's example seems to put the period in the URL.

[+] tsetse-fly|16 years ago|reply

In the case of

http://en.wikipedia.org/wiki/O3b_Networks,_Ltd.

HN does it wrong. There are exceptions either way, I wouldn't say that one is more correct.

[+] notauser|16 years ago|reply

Great stuff, my URL matching regex was very limited.

For e-mails I use:

  /\b[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9]([-a-z0-9_]?[a-z0-9])*(\.[-a-z0-9_]+)*\.(aero|arpa|biz|com|coop|edu|gov|info|int|mil|museum|name|net|org|pro|travel|mobi|[a-z]{2})|([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3})(:[0-9]{1,5})?\b/ig

And for twitter user names I use:

  /\B@\w+\b/ig

(which incorrectly matches @@username as @username but I assume that kind of thing is a typo - the important thing is not to match e-mail addresses)

[+] silentbicycle|16 years ago|reply

An RFC-822-compliant regex is listed at (http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html), but if anything, it's just a strong using for using real parsing tools. Regexes don't handle recursion and balanced delimiters very well.

[+] jerf|16 years ago|reply

For validating emails, I've settled on /.@./, or if you really want to push for valid emails, /.@[^.]+\../. (Note the lack of anchoring to the beginning or end.) (That, and some limit on length.)

The rules are so flipping complicated and so easy to get wrong that you're better off just trying to send a mail and seeing what happens, and asking the recipient to validate reception if you care about the address. Is it really that important to exclude bad emails, at the cost of, say, blocking email addresses from the UK, as your regex seems to do? Even "validating" for sheer user error is only useful if you get it right.

[+] blasdel|16 years ago|reply

Indent your lines with two spaces and they'll end up in a <pre> with your asterisks intact.

Why bother trying to validate the domain lexically, when you can just try resolving it?

[+] kprobst|16 years ago|reply

What would be the equivalent in Python to the :punct: character class operator? I don't think the re module supports those. I guess they'd have to be spelled out pretty much?

[+] kingkilr|16 years ago|reply

I guess that would be re.escape(string.punctuation), I've never looked/thought about it though.

[+] techiferous|16 years ago|reply

Wow, I just had this very problem a few days ago for an entry I submitted to CodeRack! http://coderack.org/users/techiferous/entries/90-racklinkify

(Note: you can't plug-n-play this middleware yet--still a coupla bugs. Will fix soon.)

[+] whalesalad|16 years ago|reply

It doesn't work with standard permalinks that feature hyphens in the url, and none of his examples show links with hyphens. Most blogs out there (wordpress) are using hyphens in their permalinks.

[+] doulaimi|16 years ago|reply

[deleted]

[+] DanBlake|16 years ago|reply

This is great. The regex we use on tinychat for URLs is self made and not as all inclusive as this.

29 comments