top | item 26355006

You can't parse [X]HTML with regex (2009)

127 points| BerislavLopac | 5 years ago |stackoverflow.com | reply

130 comments

order
[+] taylodl|5 years ago|reply
2nd answer: "While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML."

This. So much this. Yes, you can't parse arbitrary, unknown XML with regex. But I don't find myself parsing arbitrary, unknown XML very often. Usually I know exactly what I'm expecting and if I can't find the information I need then it's a problem. Regex parsing is perfect for this scenario - and much, much faster. I created a regex parser for Java that even handles namespaces and relative paths. Can it parse every XML file? No - you can't parse XML with regex. But I can parse everything I need to parse - and if I can't? I can always fall back on full-featured XML parsers.

[+] dragontamer|5 years ago|reply
I found a one-pass top-down "XML parser". Not like a proper SAX parser, no... the XML had to be specially formatted, almost like TOML.

    <whatever> <--- Parser parsed this as a new section
    <foo attrib=bar> <--- All "XML" had to be one-per-line
    </whatever> <---- parser ignored this
It was an "XML parser" per se, but it really was just a linear one-pass parser that tricked me into thinking it was XML.

So really, it was more like TOML (or .INI files) than like XML. But I guess the advantage of making it "bastard XML" instead of TOML is that maybe this worked with XML-editors or something. I dunno...

[+] polote|5 years ago|reply
When I was working on a link parser in python for a crawler. I had two choices:

1. use some form of regex

2. use libxml and find links

1. was faster than 2. by a factor or two

Does a link-only parser would have been faster ? Yeah probably but it is much more complex to do

[+] xurukefi|5 years ago|reply
In my experience, a regex can be a much more robust solution than an XML parser depending on the use case. Back in the days when I did some webscraping I often had parsers throwing exceptions becacuse of ivalid HTML. More often than not I switched to regular expressions, which always worked out flawlessly.
[+] chubot|5 years ago|reply
All you need to parse HTML is regular expressions (to recognize tags) and a stack (to match tags).

Your programming language has a stack -- a call stack.

So in practice all you really need is regular expressions. (Which I tend to call "regular languages" to make a distinction with Perl-style regexes [1], although they work fine too in practice for this case)

Using the call stack in a more functional style is nicer than using the OOP style that s in the Python standard library, which is probably inherited from Java, etc.

I have done this with a bunch of HTML processors for the Oil blog and doc toolchain:

https://github.com/oilshell/oil/tree/master/doctools

It works well in practice and is correct and fast.

Big caveat: this style is only for HTML that I generate myself, e.g. the blog and docs. There are a bunch of rules around matching tags in HTML5 that are subtle. Although one of the points here is that you don't have to do a full DOM-style parse and can ignore those rules for many useful cases.

The other caveat is that HTML has a bunch of rules for what happens when you see a stray < or > that isn't part of a tag. This style makes it a hard syntax error, so it's really a subset of HTML (which has no syntax errors). For my purposes that is a feature rather than a bug, basically following Postel's law.

I meant to write a blog post titled "why/when you can parse HTML with regexes" about this but didn't get around to it.

There is also a caveat where parsing arbitrary name=value pairs with regexes isn't ergonomic, because it's hard to capture a variable number of pairs. However the point is that I wrote 5 or 6 useful and compact HTML processors that don't need that. In practice when you parse HTML you often have a "fixed schema".

Concrete examples are generating a TOC from <h1>, <h2>, etc. and syntax highlighting <pre><code> blocks. Those all work great with the regex + call stack style.

[1] http://www.oilshell.org/blog/2020/07/eggex-theory.html

edit: for completeness, another caveat is that the stack-based style is particularly bad for C++/Rust and arbitrary input because you could blow the stack, although we already limited the problem to "HTML generated ourselves"

[+] imedadel|5 years ago|reply
I've seen many people complain about StackOverflow, but this is the best example I've ever encountered.

The question: How to match

  <p>
  <a href="foo">
The answers: rants about how RegEx is not suitable for parsing entire HTML.

Only the 5th answer starts to actually answer the question.

[+] omginternets|5 years ago|reply
In all fairness, he does provide hints as to why regex will not work (namely: HTML is not a regular grammar).

Sure, it's somewhat obscured by the humorous rant, but it's not that bad an answer, either.

More to the point: I'm not sure I want to suck the humor out of everything. I agree that SO has problems, but humor and poetry are worthwhile things in otherewise serious places. It's all about quantity.

[+] chrismorgan|5 years ago|reply
“You’re asking the wrong question” is a valid response.
[+] EamonnMR|5 years ago|reply
I find this type of answer infinitely more paletable than "your question is answered here" or "comments are not for extended discussion, this conversation has been moved to chat"
[+] ajanuary|5 years ago|reply
According to the post, the more important part of the question is "what do you think", to which "I think you shouldn't, because..." is a good answer.
[+] seiferteric|5 years ago|reply
Yes, I have been down-voted and scolded for answering a question as literally described simply because others would rather assume the ignorance of the questioner. Yes I know people will often ask a question due to not understanding what they are doing, but when 10 other people have already responded with "don't do it that way!" I think it can be useful to actually answer the question as stated (if possible).
[+] armada651|5 years ago|reply
Actually, if you read his rant all the way to the end he does offer a helpful suggestion:

> Have you tried using an XML parser instead?

[+] mumblemumble|5 years ago|reply
This pops up every so often, and it sort of irritates me every time. Partially because it's overly simplistic, but, even more so, because, while it's cute and humorous, it's not actually very good advice and it doesn't actually answer the question. No, you can't parse html with regex. But go look at the question. The author is just trying to detect some tags. That's not exactly parsing.

It's true that there are some complications around things like "What if > appears in an attribute's value?" If you know your input well enough, or you don't need perfection, that might be a problem you can ignore. Alternatively, you can still use regex, if you use a sufficiently powerful regular expression tool. .NET's regular expressions, for example, have a concept of balancing groups that will let you do this.

I would also point out that a lot of open source HTML parsing libraries are even more dangerous than regular expressions for parsing unknown HTML, because they use recursive descent. Where you have recursion, you have the potential for a stack overflow. With a regex library, you do have to be careful about catastrophic backtracking, but that's at least something you can usually handle in your own code, or, in the worst case, defend against with timeouts.

A parser that's capable of blowing the call stack, and has been exposed to input from the Internet, though, is capable of taking down your process in a way you can't defend against in most languages. And it's difficult to patch up a parser like that without more-or-less rewriting it. I absolutely have had to deal with html handling code getting into situations like that in the past. Malicious input is real. So is plain old bad input. Reading the code before you use it is often a good idea.

[+] lifthrasiir|5 years ago|reply
I had once tweeted related quizzes. Can you guess parse trees (or reserializations) for following HTML fragments without invoking browsers? Assume that everything gets pasted right after the document body.

    1. <a b="42>c">d
    2. <a/b/c=d/e>f
    3. <a/="42>b
    4. <a x=&amp0>&amp0</a>
    5. a<!--->b<!--+->c<!-->d
Really, don't try to answer and just use complaint HTML parsers.
[+] bhaak|5 years ago|reply
What kind of HTML parser? A SGML one or a HTML5 one?

I'm really sad that they didn't go with a XML base for HTML5.

[+] gostsamo|5 years ago|reply
I get why all the complains against the top answer. At the same time one should appreciate its literary qualities in regard to structure and style.
[+] vbezhenar|5 years ago|reply
You can't in general case. But you can in lots of typical cases.

Actually real world HTML usually can't be parsed by any strict parser, as it's not valid. It's just a machine-generated text which pretends to be similar to HTML. So extracting some bits of information with regexes often works.

[+] lifthrasiir|5 years ago|reply
I believe you really meant that you are frequently dealing with HTML which structure is already known in advance, not the general HTML. Because...

> [...] real world HTML usually can't be parsed by any strict parser [...]

There is the literal standard for parsing HTML [1]. Any conformant implementation (and there are plenty) can of course parse the real world HTML by definition. Just that you don't always need the full HTML parser to do your job.

[1] https://html.spec.whatwg.org/multipage/parsing.html

[+] syrrim|5 years ago|reply
The html parser spec defines what every sequence of bytes should parse into. It defines certain such sequences as containing "errors", but it still defines exactly how they should be parsed. There is no invalid html. Every browser follows the spec, so every browser will parse the same html to the same thing. This is true even if the html contains "errors". The only checking most html receives is to make sure it renders correctly in a browser. If you are writing your own parser, you likely want it to do the same thing as every browser. In that case, you should use a parser that conforms to the spec.
[+] retox|5 years ago|reply
Exactly my experience on this. In a past life I've had to parse valid HTML that was generated by a forum system; the user submitted something akin to bbcode [b]this sort of thing[/b] that was pre-parsed and converted to valid HTML, and then I had to parse that fragment again after the fact.

Given the constraints it's entirely possible to parse (a subset of) irregular grammar with regular expressions. Asking a question along those lines on SO would have have only elicited responses that I/someone was doing it wrong.

I won't argue that it was or wasn't the wrong to do, but you don't always get to pick your client.

[+] indymike|5 years ago|reply
Funny thing. Email addresses need a rant like this too. Yes, you can parse 99% or so with a regex, but like HTML or XML, you really need an email address parser. RFC 2822 was designed to be parsed using string processing (in C no less) and requires some complexity that most regexes fail on. Here's a discussion about using the simpler, older RFC (822) and regex: https://stackoverflow.com/questions/20771794/mailrfc822addre...
[+] mumblemumble|5 years ago|reply
For most purposes, if you're trying to use parsing to achieve email address validation perfection, you've already lost the battle.

A valid email address typically isn't just a syntactically correct one; it's also one that can be used to get an email to the recipient. The only way to test that is to send an email and see if it gets to the recipient. Which is why it's much more common to see some minimal client-side validation that uses a simple regex that will (ideally) match all valid email addresses but only catch gross syntactic errors like typing # where you meant @, more for the sake of decent UX than anything else, and rely on asking the user to double-type their email address and sending an activation email to deal with finger-grained syntactic errors and the whole universe of non-syntactic errors.

[+] xg15|5 years ago|reply
Yes, it's hopeless to try and parse arbitrary HTML with regexes (or with anything really. I believe, the full HTML5 parsing algorithm is not even type-2, it's more or less turing-complete. In-the-wild HTML can also interact with scripting in all kinds of entertaining ways, to the point that a conforming HTML5 parser has to be able to execute javascript while parsing - and the javascript can inject additional tokens into the parser's input stream. It's possible to create a HTML5 document that only validates on every second thursday of the month.)

However, if we know the document is well-formed(!) XHTML, shouldn't it be possible? This would mean the document is valid XML and XML was specifically designed to be regex-friendly, I believe.

At least, out of my head, the only gotchas that have to be accounted for are comments and CDATA sections - those may contain arbitrary unescaped text, including angle brackets. However they also have unambigous start and end markers and can't be nested, so a regex could account for them.

Attribute values should not be a problem, as angle brackets must be escaped inside those to be valid XML.

I'm not sure about processing instructions and doctypes though.

[+] chubot|5 years ago|reply
Yes that's basically what I wrote about here:

https://news.ycombinator.com/item?id=26359556

I listed 3 or 4 caveats. CDATA might be another one since I'm not handling those... I've never used it so I left it out.

Actually I remember that regular languages weren't entirely enough; I think the .*? non-greedy matching was useful, e.g. for finding --> at the end of comments.

[+] theandrewbailey|5 years ago|reply
On my blog, I write my posts in markdown. After it's converted to HTML, I use regex to search and replace images (for high-res and alternative formats), and get the first paragraph (for a 'preview'). I've been doing this for years, so the 'never use regex over HTML' advice isn't holding up for me.
[+] csours|5 years ago|reply
To me, this typifies working with technology and programming. Computer programs only ever look like they are working, because they have not encountered problem data or conditions.

Aka, it works on the happy path.

Software engineering is how we balance how much of the unhappy path and corner cases we take care of, and how we handle them, imo.

[+] Corazoor|5 years ago|reply
Well, as stated that particuar answer is both right and wrong...

Yes, you can not use "true" regular expressions to parse recursive structures.

But the libraries that get used for regular expressions quite often include non-regular extensions (and confusingly call the resulting expressions still "regular").

Most notably, PCRE allows for recursive patterns via "(?R)". You can absolutely parse arbitrary HTML with it.

In fact you can parse anything whith that, including binary formats. You just can't do it whithout recursively applying the same "regex" again and again...

And precise error handling is basically impossible without writing a proper lexer anyway, since your regex won't (can't, really) tell you where it was thrown off. It either works or doesn't, the "why" is left to the program to figure out...

[+] darkwater|5 years ago|reply
My 5yo daughter can already write more or less ok-ish but she has big problems reading that back, especially when she does spelling mistakes.

I feel more or less the same when I write regular expressions.

[+] JackeJR|5 years ago|reply
[+] lucideer|5 years ago|reply
Counter-counter-argument, one of the comments underneath that answer:

> what you have written is not really a regular expression (modern, regular, or otherwise), but rather a Perl program that uses regular expressions heavily. Does your post really support the claim that regular expressions can parse HTML correctly? Or is it more like evidence that Perl can parse HTML correctly?

[+] harveywi|5 years ago|reply
Do the Halting Problem next.
[+] ricardo81|5 years ago|reply
I think we've all (mostly?) tried it. It really is the Wild West of the web when you're trying to parse other people's HTML, though.

I've played around with this parser which is extremely quick. https://github.com/lexbor/lexbor

[+] a1369209993|5 years ago|reply
You can't parse fully-general HTML with regex, but unless you're writing a web browser of something, that's not what you're trying to do; you're trying to parse the particular subset of HTML that happens to be emitted by this particular website that you got the HTML to be parsed from. And, much like the halting problem or integer factorization, despite the general case being difficult or impossible, the overwhelming majority of specific cases are easy.
[+] The_rationalist|5 years ago|reply
Aren't html code highlighters using regex? Isn't vscode using TextMate regex for color highlithing?
[+] chrismorgan|5 years ago|reply
Most code highlighters do. They’re normally close enough to accurate for highlighting purposes (though they will commonly have some uncommon constructs that they get wrong), but they tend to fall apart when you try to use that for much more; for example, indentation when you use regular expressions to parse your HTML tends to start falling apart if you take what XML users might consider “shortcuts” (such as omitting optional end tags).
[+] layer8|5 years ago|reply
You can usually use regexes for tokenization, which is sufficient for syntax highlighting, but you generally can’t use regexes for parsing (nested structures).
[+] zihotki|5 years ago|reply
Should 2011 (when answer was first provided) or 2009 (when question was posted) be added to the title?
[+] usrusr|5 years ago|reply
Clearly it should be 20(?:09|11)

On the other hand, the inclusion of [X] in the title is more than enough to establish the historical setting.

[+] h2odragon|5 years ago|reply
This argument is common, and this is a good answer; but so often people aren't "parsing" XML but extracting a few bits of it and would have benefited from less cargo cult and more thought in the answer cited.

As it is, I've seen this article used to scare people away from "can i make a game behave differently?" efforts that would have been trivial to do and likely given these people a gateway "i can try to be a programmer" experience.

[+] zeroimpl|5 years ago|reply
This answer is right. The original question isn't asking anything about parsing, they are trying to search for specific tags, which is a great use for a regex. Yes, it will probably return some invalid matches unless you go out of your way to filter out comments and script tags, but odds are you don't need that.

Applying regex's only really count as "parsing" when you are matching against an entire document. Searching is not parsing. Same argument applies if you apply grep to an html doc - I wonder why there's no posts about "you can't parse HTML with grep". I apply grep all the time to source code...

[+] goldsteinq|5 years ago|reply
You can't parse HTML with regex, but PCRE is not regex. I'm not sure if you can parse HTML with PCRE.
[+] legec|5 years ago|reply
Where are all the comments gone ?

(note : I mean the comments on StackOverflow, not the comments here in Hacker News ... )