top | item 41524506

(no title)

mjfisher | 1 year ago

Fascinating reading:

> The majority of developers are unacquainted with features such as processing instructions and entity expansions that XML inherited from SGML. At best they know about <!DOCTYPE> from experience with HTML but they are not aware that a document type definition (DTD) can generate an HTTP request or load a file from the file system.

I was one of them!

discuss

order

tannhaeuser|1 year ago

Developers are even less aware that SGML has (and always had) quantities in the SGML declaration, allowing among other things to restrict the nesting/expansion level of entities (and hence to counter EE attacks without resorting to heuristics).

Regarding DOCTYPE and DTDs, browsers at best made use of those to switch into or out of "quirks mode", on seeing special hardcoded public identifiers but ignored any declarations. WHATWG's cargo cult "<!DOCTYPE html>" is just telling an SGML parser that the "internal and external subset is empty", meaning there are no markup declarations necessary to parse HTML which is of course bogus when HTML makes abundant use of empty elements (aka void/self-closing elements in HTML parlance), tag omission, attribute shortforms, and other features that need per-element declarations for parsing. Btw that's what defines the XML subset of SGML: that XML can always be parsed without a DTD, unlike HTML or other vocabularies making use of above stated features.

Keep in mind SGML is a markup language for text authoring, and it would be pretty lame for a markup language to not have text macros (entities). In fact, the lack of such a basic feature is frequently complained about in browsers. The problems came when people misused XML for service payloads or other generic data exchange. Note SOAP did forbid DTDs, and stacks checked for presence of DTDs in payloads. That said, XML and XML Schema with extensive types for money/decimals, dates, hashes, etc. is heavily used in eg ISO 20022 payments and other financial messages, and to this date, there hasn't evolved a single competitor with the same coverage and scope (with the potential exception of ASN.1 which is even older and certainly more baroque).

bawolff|1 year ago

> Regarding DOCTYPE and DTDs, browsers at best made use of those to switch into or out of "quirks mode", on seeing special hardcoded public identifiers but ignored any declarations.

Not when processing XML mime types. In modern browsers that mostly means SVG files, but i think XHTML is still possible.

(Modern) HTML is neither SGML nor XML, so it doesn't follow the rules of either.

bawolff|1 year ago

Most of these exploits are so famous that common xml processors have disabled the underlying features.

So in practise you probably dont have to worry too much as long as you dont enable optional features in your xml library. (There are probably exceptions)

redbell|1 year ago

> I was one of them!

I still one of them!