top | item 1487695

Stackoverflow, HTML by Regex, topmost answer

105 points| s2r2 | 15 years ago |stackoverflow.com | reply

37 comments

order
[+] jgrahamc|15 years ago|reply
First: please don't editorialize in the title. The "Got to see this" smacks of other places I will not mention.

Second: this is not at all interesting. The person asks a sensible question and then gets some ridiculous replies.

Third: it made me remember my spat with ESR about HTML parsing: http://news.ycombinator.com/item?id=923775 and now I feel sad.

[+] pclark|15 years ago|reply
I really dislike all this arm chair moderation of hacker news, I like the occasional commentary in submission titles, and I also like the fact the community decides what is or isn't interesting. Like the ~60 people who found this article interesting.
[+] euroclydon|15 years ago|reply
Some person explained that HTML is a Chomsky type 2 grammar and regular expressions are a Chomsky type 3 grammar, and provided this link: http://en.wikipedia.org/wiki/Chomsky_hierarchy

Can anyone here provide a link that makes the discussion of these typed grammars available to laymen?

[+] bad_user|15 years ago|reply
Oh, I don't know if reducing the grammars to Chomsky is really necessary.

Regular expressions, in their original version, are equivalent to Finite-stage Machines (i.e. ... regular grammars, no recursion, no stack, no memory further than keeping the current state). You can't describe the rules of HTML with a FSM.

Perl's regular expressions contain various enhancements. Newer versions of Perl's regexes also contain direct support for recursion (but frankly, you can't call those "regular expressions" anymore).

So ... if your regex library has recursion support, then you can parse HTML (since with recursion you can parse context-free / Chomsky type-2 grammars). If it doesn't support recursion, then you can't.

Btw ... the equivalent for a context-free grammar would be a Push-down Automaton ... http://en.wikipedia.org/wiki/Pushdown_automaton , which is a FSM + a stack.

[+] man1sh|15 years ago|reply
I think this is pretty old and has been discussed everywhere many times. Check the date of the question/answer too.
[+] albertzeyer|15 years ago|reply
Ehm, I wonder a bit, the discussions always goes that HTML is not regular. The poster though asked to just match any open tags. The language of HTML tags clearly is regular, isn't it?
[+] jauco|15 years ago|reply
For one thing: <br> is valid html, and this too:

    <!DOCTYPE html>
    <html>
    <head>
       <title>I AM YOUR DOCUMENT TITLE REPLACE ME</title>
    </head>
    <body>
       <div>
    <br id="<bl>">
       </div>
    </body>
    </html>
[+] megaduck|15 years ago|reply
The language of individual HTML tags is certainly regular, and trivially easy. However, the language of "matched HTML tags with junk between them" is NOT regular.

Anything that requires balanced matching is NOT parseable with standard Regular Expressions, and by not parseable I mean that you will literally have an infinite amount of bugs. Shoot me an email and I can show you the math.

Even with Perl's whiz-bang recursive not-really-regexes-regexes, it's strongly not recommended to tackle balanced matching problems like HTML or XML. It might be theoretically possible (I haven't actually checked), but your brain will leak from your ears and you probably won't get it right, no matter how smart you are.

[+] vog|15 years ago|reply
That's what I'm wondering, too.

If it was XML, one might get in trouble with "<[CDATA[" sections, but regarding HTML, I don't see a real issue here.

... especially not from the pragmatic point of view. Depending on the use case and the quality of the HTML source, a "dirty" regex hack might be a far better solution and using a DOM parser.

[+] jvdh|15 years ago|reply
> HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain

Just reading that makes me wince.

[+] arethuza|15 years ago|reply
I'm tempted to change our bug tracker to have a bug status of "it is too late it is too late we cannot be saved".
[+] gvb|15 years ago|reply
The terse term is "overcome by events" (OBE), applied literally.
[+] buro9|15 years ago|reply
The revisions for it are pretty cool too: http://stackoverflow.com/posts/1732454/revisions

Someone slightly not getting the joke edited it out on the basis of it being troll/rambling, then someone put it all back. The nice bit... the actual point is emphasised as a result.

[+] ars|15 years ago|reply
What is necessary to make regex turing complete?
[+] earcar|15 years ago|reply
Got to read that in a Cylon Hybrid or GLaDOS voice.
[+] ilkhd2|15 years ago|reply
Anybody can suggests fiction authors who write in this style - sane text morphing into gibberish and probably back to sanity?
[+] bigiain|15 years ago|reply
Try a book called "Random Acts of Senseless Violence" by Jack Womack. The text doesn't morph into gibberish, but the lead character transforms from entitled middle class to borderline psychopath street kid, and the way the language changes as the story progresses is _wonderful!_
[+] ivank|15 years ago|reply
Mark Z. Danielewski's House of Leaves
[+] wlievens|15 years ago|reply
It reminded me mostly of the gibberish from the Cylon hybrids in Battlestar Galatica.
[+] kurumo|15 years ago|reply
Robert Anton Wilson

Schrödinger's Cat trilogy especially

[+] nuxi|15 years ago|reply
Hmm, Joyce's Finnegans Wake? You did say "probably back".