Stackoverflow, HTML by Regex, topmost answer

[+] jgrahamc|15 years ago|reply

First: please don't editorialize in the title. The "Got to see this" smacks of other places I will not mention.

Second: this is not at all interesting. The person asks a sensible question and then gets some ridiculous replies.

Third: it made me remember my spat with ESR about HTML parsing: http://news.ycombinator.com/item?id=923775 and now I feel sad.

[+] pclark|15 years ago|reply

I really dislike all this arm chair moderation of hacker news, I like the occasional commentary in submission titles, and I also like the fact the community decides what is or isn't interesting. Like the ~60 people who found this article interesting.

[+] unknown|15 years ago|reply

[deleted]

[+] viraptor|15 years ago|reply

[deleted]

[+] euroclydon|15 years ago|reply

Some person explained that HTML is a Chomsky type 2 grammar and regular expressions are a Chomsky type 3 grammar, and provided this link: http://en.wikipedia.org/wiki/Chomsky_hierarchy

Can anyone here provide a link that makes the discussion of these typed grammars available to laymen?

[+] bad_user|15 years ago|reply

Oh, I don't know if reducing the grammars to Chomsky is really necessary.

Regular expressions, in their original version, are equivalent to Finite-stage Machines (i.e. ... regular grammars, no recursion, no stack, no memory further than keeping the current state). You can't describe the rules of HTML with a FSM.

Perl's regular expressions contain various enhancements. Newer versions of Perl's regexes also contain direct support for recursion (but frankly, you can't call those "regular expressions" anymore).

So ... if your regex library has recursion support, then you can parse HTML (since with recursion you can parse context-free / Chomsky type-2 grammars). If it doesn't support recursion, then you can't.

Btw ... the equivalent for a context-free grammar would be a Push-down Automaton ... http://en.wikipedia.org/wiki/Pushdown_automaton , which is a FSM + a stack.

[+] steveklabnik|15 years ago|reply

This answer on Reddit is pretty solid: http://www.reddit.com/r/programming/comments/cm02a/you_cant_...

Depends on your definition of 'laymen,' I guess.

Also note that PCREs actually are recursively enumerable (I think).

[+] friism|15 years ago|reply

For more discussion (of a blog-post about the answer by Atwood), see here: http://news.ycombinator.com/item?id=944673

[+] man1sh|15 years ago|reply

I think this is pretty old and has been discussed everywhere many times. Check the date of the question/answer too.

[+] unknown|15 years ago|reply

[deleted]

[+] albertzeyer|15 years ago|reply

Ehm, I wonder a bit, the discussions always goes that HTML is not regular. The poster though asked to just match any open tags. The language of HTML tags clearly is regular, isn't it?

[+] jauco|15 years ago|reply

For one thing: <br> is valid html, and this too:

    <!DOCTYPE html>
    <html>
    <head>
       <title>I AM YOUR DOCUMENT TITLE REPLACE ME</title>
    </head>
    <body>
       <div>
    <br id="<bl>">
       </div>
    </body>
    </html>

[+] megaduck|15 years ago|reply

The language of individual HTML tags is certainly regular, and trivially easy. However, the language of "matched HTML tags with junk between them" is NOT regular.

Anything that requires balanced matching is NOT parseable with standard Regular Expressions, and by not parseable I mean that you will literally have an infinite amount of bugs. Shoot me an email and I can show you the math.

Even with Perl's whiz-bang recursive not-really-regexes-regexes, it's strongly not recommended to tackle balanced matching problems like HTML or XML. It might be theoretically possible (I haven't actually checked), but your brain will leak from your ears and you probably won't get it right, no matter how smart you are.

[+] vog|15 years ago|reply

That's what I'm wondering, too.

If it was XML, one might get in trouble with "<[CDATA[" sections, but regarding HTML, I don't see a real issue here.

... especially not from the pragmatic point of view. Depending on the use case and the quality of the HTML source, a "dirty" regex hack might be a far better solution and using a DOM parser.

[+] jvdh|15 years ago|reply

> HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain

Just reading that makes me wince.

[+] arethuza|15 years ago|reply

I'm tempted to change our bug tracker to have a bug status of "it is too late it is too late we cannot be saved".

[+] gvb|15 years ago|reply

The terse term is "overcome by events" (OBE), applied literally.

[+] buro9|15 years ago|reply

The revisions for it are pretty cool too: http://stackoverflow.com/posts/1732454/revisions

Someone slightly not getting the joke edited it out on the basis of it being troll/rambling, then someone put it all back. The nice bit... the actual point is emphasised as a result.

[+] ars|15 years ago|reply

What is necessary to make regex turing complete?

[+] steveklabnik|15 years ago|reply

You need to be able to provide 'context.' I linked this above, but I'll link it to you, too: http://www.reddit.com/r/programming/comments/cm02a/you_cant_...

[+] Estragon|15 years ago|reply

I had heard that perl regular expressions are already Turing complete. Never verified it, though. E.g.

http://books.slashdot.org/comments.pl?sid=143298&cid=120...

[+] unknown|15 years ago|reply

[deleted]

[+] earcar|15 years ago|reply

Got to read that in a Cylon Hybrid or GLaDOS voice.

[+] ilkhd2|15 years ago|reply

Anybody can suggests fiction authors who write in this style - sane text morphing into gibberish and probably back to sanity?

[+] friism|15 years ago|reply

The prose is almost certainly inspired by H. P. Lovecraft. "At the Mountains of Madness" may be a good start, if you like this sort of stuff: http://www.dustylibrary.com/horror/5-at-the-mountains-of-mad...

[+] bigiain|15 years ago|reply

Try a book called "Random Acts of Senseless Violence" by Jack Womack. The text doesn't morph into gibberish, but the lead character transforms from entitled middle class to borderline psychopath street kid, and the way the language changes as the story progresses is _wonderful!_

[+] ivank|15 years ago|reply

Mark Z. Danielewski's House of Leaves

[+] Russelldb|15 years ago|reply

William S. Burroughs

[+] wlievens|15 years ago|reply

It reminded me mostly of the gibberish from the Cylon hybrids in Battlestar Galatica.

37 comments