(no title)
comex
|
26 days ago
If you already know where the start of the opening tag is, then I think a regex is capable of finding the end of that same opening tag, even in cases like yours. In that sense, it’s possible to use a regex to parse a single tag. What’s not possible is finding opening tags within a larger fragment of HTML.
kstrauser|26 days ago
You need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.
Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.
marcosdumay|26 days ago
DemocracyFTW2|25 days ago
This is what MDN (https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Com...) has to say:
Comments start with the string `<!--` and end with the string `-->`, generally with text in between. This text cannot start with the string `>` or `->`, cannot contain the strings `-->` or `--!>`, nor end with the string `<!-`, though `<!` is allowed. [...] The above is true for XML comments as well. In addition, in XML, such as in SVG or MathML markup, a comment cannot contain the character sequence `--`.
Meaning that you can recognize HTML comments with (one branch of) a RegEx—you start wherever you see `<!--` and consume everything up to one of the listed alternatives. No nesting required.
Be it said that I find the precise rules too convoluted for what they do. Especially XML's prohibition on `--` in comments is ridiculous taken on its own. First you tell me that a comment ends with three characters `-->`, and then you tell me I can't use the specific substring `--`, either? And why can't I use `--!>`?
An interesting bit here is that AFAIK the `<!` syntax was used in SGML as one of the alternatives to write a 'lone tag', so instead of `<hr></hr>` or `<hr/>` (XHTML) or `<hr>` (HTML) you could write `<!hr>` to denote a tag with no content. We should have kept this IMO.
*EDIT* On the quoted HTML source you see things like `-—` (hyphen-minus, em-dash). This is how the Vivaldi DevTools render it; my text editor and HN comment system did not alter these characters. I have no idea whether Chrome's rendering engine internally uses these em-dashes or whether it's just a quirk in DevTool text output.