The irony is that CDATA isn't even very useful; there's no way to escape the ]]> closing tag so you still have to invent some special escaping mechanism to use it.
Nobody expects entity definitions in XML either, and yet about once a year some new service or software is found vulnerable to XXE attacks. (Summary: a lot of XML parsers can be made to open arbitrary files or network sockets and sometimes return the content.)
XML is a ridiculously complex document format designed for editing text documents. It is not a suitable data interchange format. Fortunately we have JSON now.
This sort of argument has never held water, and it constantly perplexes me that it gets a pass in tech discussions. Even if we accept the dubious assertion that it was "designed" for editing text documents, the origins of some invention say literally nothing about its utility for a purpose.
Further, as complex as XML may be, comparing the robust, rich, diverse ecosystem of XML support with the amateur hour, barely credible JSON world is quite a contrast. JSON doesn't even have a date type (and everyone seems to roll their own). It lacks any sort of robust validation or transformation system: XML schemas are really one of XML's greatest features, and the nascent, mostly broken similes in JSON world don't compare.
JSON versus XML is a lot like NoSQL versus RDBMS -- the former is easier to pitch because its complete absence of a wide set of functionality seems like it makes implementations easier, when really it just pushes the complexity down the road.
> XML is a ridiculously complex document format designed for editing text documents. It is not a suitable data interchange format. Fortunately we have JSON now.
XML is about as simple as it gets for structured text documents. HTML is more complicated. Plain text is not expressive enough. Markdown, Asciidoc, reStructuredText, Wiki Creole, etc. all have pretty severe shortcomings by comparison to XML, and text processing systems will sometimes just convert those formats to XML. XML is easy to parse, easy to edit, and easy to emit.
XML also gives us SVG, which is lovely.
Yeah, use JSON everywhere else. But XML is not ridiculously complicated. The 1.0 specification http://www.w3.org/TR/REC-xml/ is not very long.
One of the main problems isn't just encoding XML correctly, but the additive mistakes that arise when information is copied and reused, where along the way some part of the chain does a mistake:
- Information scraped from a web page that was in ISO-8859-1
- Stored in a database that is Windows-1252
- Then emitted through an API in UTF-8 by someone who writes strings by ("<tag>" + string concatenatation + "</tag>")
- Then stored in a new database as UTF-8 but not sanity-checked (ie., MySQL instead of Postgres)
- Then emitted as an XML feed
...etc. Along the way someone forgets to encode the "&" and the data contains random spatterings of ISO-8895-1 characters, and you're screwed.
Most parsers I have encountered aren't lenient by default, and will barf on non-conformant input. So now the last link in the chain needs to sanitize and normalize, which is a pain.
XML and JSON are different data formats with different properties and different uses.
Please, don't blame one and praise the other, just use what's appropriate. It's somehow like with data structures - say, one won't generally use a graph when he actually needs a set, right?
Trying to stash data into a semantically inappropriate format leads to kludges, and I'd say some JSON-based formats (for example, HAL-JSON) feel like so. Obviously, it's the same (or, possibly, even worse) for XMLish abominations like SOAP. Neither XML nor JSON are silver bullets.
My point is, while sure, XML has its issues and arcane features, it's not universally terrible. Wonder if there's some standardized "XML Basic Profile" that's a as minimal as possible yet still functional and expressive subset of XML for the most typical use cases, huh. Somehow in a same manner, XML was "extracted" from SGML.
XML is [...]* designed for editing text documents.*
Was this ever a stated design goal? SGML, sure, probably; but I've never seen any evidence that XML had "document markup" as its sole intended application.
I've been following posts about this tool for a few weeks and it is really remarkable how many interesting results are already popping out already. In particular since static analyzers have been around for years and years.
I'm assuming afl-fuzz is particularly CPU-bound, and it would be interesting to see some numbers about how many CPU years are being dedicated to it at the moment - and if we would see even more interesting stuff if a larger compute cluster was made available.
It's also super scary how "effortlessly" these bugs appear to be uncovered, even in "well-aged" software like "strings".
It would be pretty cool to have a public cluster that anyone can submit jobs to that are prioritized based on amount of donated CPU cycles. Instead of "Seti at home" it would be "fuzz at home".
No kidding. Security work aside, he finds time to take up time-intensive hobbies like CNC milling for robot parts, and has the time to write up comprehensive documents about the hobby?! (http://lcamtuf.coredump.cx/gcnc/)
Heads-up to the "comment without reading the article" crowd: the title is not bemoaning a lack of handling for CDATA in existing parsers. It's discussing an interesting behavior of the AFL fuzzer when used with formats that require fixed strings in particular places...
Related: NOBODY EXPECTS THE SPANISH INQUISITION, either. :)
This thread reminded me of a draft post I've been sitting on for a while, related to ENTITY tags in XML and XXE exploits.
Basically, it's really easy to leave default XML parsing settings (for things like consuming RSS feeds) and accidentally open yourself up to reading files off the filesystem.
I'm actually not so surprised, given what the fuzzer does - mutating input to make forward progress in the code. Incremental string comparisons definitely fall under this category since they have a very straightforward definition of "forward progress"; either the byte is correct and we can enter a previously unvisited state, or it's incorrect and execution flows down the unsuccessful path. It's somewhat like the infinite monkey theorem, except the random stream is being filtered such that only a correct subsequence is needed to advance.
On the other hand, I'd be astonished if it managed to fuzz its way through a hash-based comparison (especially one involving crypto like SHA1 or MD5.)
It's kind of like breaking a password if you only have to guess 1 letter at a time until you get it right. Reminds me of the Weasel program: https://en.wikipedia.org/wiki/Weasel_program
It's just the simplest possible demonstration of evolution, where characters of a string are randomly changed, and kept if more of the characters match. In a short amount of time you get Shakespeare quotes.
Obviously hashes are designed to be difficult to break. Although I've never heard of anyone trying a method like this before. I've heard of people using things like SAT solvers to try to reason backwards what the solution should be. But this is the reverse, it's trying random solutions and propagating forward to see how far they get.
I doubt it would work, I'm just curious to know if this has been tried before and how well it does.
Yeah, hashes or even CRC codes would be non-starters... Unless the hash or cdc was stored in the input being fuzzed, then it's just a matter of iterating over the hash byte by byte.
Constant-time compares however would probably stump the fuzzer.
It can't. If you download the package you'll see it includes an example of patching PNG, as otherwise the CRC as the end of each block prevents afl from doing much at all.
The reasons why we are still relying a lot on software written in low-level languages have been discussed to death, and are quite orthogonal to the insight in the article, which is that seemingly lo-tech techniques can discover much about an opaque, potentially vulnerable piece of software. And even some seemingly insurmountable difficulties (“the algorithm wouldn't be able to get past atomic, large-search-space checks such as …”) may simply, with a bit of luck, fail to materialise.
Still, quoting from a sentence a few lines down in the article:
“this particular example is often used as a benchmark - and often the most significant accomplishment - for multiple remarkably complex static analysis or symbolic execution frameworks”
The author is thinking of backwards-propagation static analysis or symbolic execution frameworks, for which is it indeed a feat to reverse-engineer the condition that leads to exploring the possibility that there is a “CDATA” in the input. Forwards-propagation static analysis needs no special trick to assume that the complex condition must be taken some of the times and to visit the statements in that part of the code. The drawback of static analysis (especially with respect to fuzzing) is then with the false positives that can result from the fact that a condition was partially, or not at all, understood.
It's not much relevant to the article as the author doesn't imply CDATA is poorly supported (and that's not the topic at hand) but CDATA sections are very common in RSS files, as a way to shoehorn text of any type into various elements, so I'd be surprised if any well used parser lacked support.. it's even more of a requirement than namespace support IMHO.
[+] [-] NelsonMinar|11 years ago|reply
Nobody expects entity definitions in XML either, and yet about once a year some new service or software is found vulnerable to XXE attacks. (Summary: a lot of XML parsers can be made to open arbitrary files or network sockets and sometimes return the content.)
XML is a ridiculously complex document format designed for editing text documents. It is not a suitable data interchange format. Fortunately we have JSON now.
[+] [-] personZ|11 years ago|reply
This sort of argument has never held water, and it constantly perplexes me that it gets a pass in tech discussions. Even if we accept the dubious assertion that it was "designed" for editing text documents, the origins of some invention say literally nothing about its utility for a purpose.
Further, as complex as XML may be, comparing the robust, rich, diverse ecosystem of XML support with the amateur hour, barely credible JSON world is quite a contrast. JSON doesn't even have a date type (and everyone seems to roll their own). It lacks any sort of robust validation or transformation system: XML schemas are really one of XML's greatest features, and the nascent, mostly broken similes in JSON world don't compare.
JSON versus XML is a lot like NoSQL versus RDBMS -- the former is easier to pitch because its complete absence of a wide set of functionality seems like it makes implementations easier, when really it just pushes the complexity down the road.
[+] [-] TimWolla|11 years ago|reply
> CDATA sections may occur anywhere character data may occur;
(http://www.w3.org/TR/REC-xml/#sec-cdata-sect)
[+] [-] dietrichepp|11 years ago|reply
XML is about as simple as it gets for structured text documents. HTML is more complicated. Plain text is not expressive enough. Markdown, Asciidoc, reStructuredText, Wiki Creole, etc. all have pretty severe shortcomings by comparison to XML, and text processing systems will sometimes just convert those formats to XML. XML is easy to parse, easy to edit, and easy to emit.
XML also gives us SVG, which is lovely.
Yeah, use JSON everywhere else. But XML is not ridiculously complicated. The 1.0 specification http://www.w3.org/TR/REC-xml/ is not very long.
[+] [-] lobster_johnson|11 years ago|reply
- Information scraped from a web page that was in ISO-8859-1
- Stored in a database that is Windows-1252
- Then emitted through an API in UTF-8 by someone who writes strings by ("<tag>" + string concatenatation + "</tag>")
- Then stored in a new database as UTF-8 but not sanity-checked (ie., MySQL instead of Postgres)
- Then emitted as an XML feed
...etc. Along the way someone forgets to encode the "&" and the data contains random spatterings of ISO-8895-1 characters, and you're screwed.
Most parsers I have encountered aren't lenient by default, and will barf on non-conformant input. So now the last link in the chain needs to sanitize and normalize, which is a pain.
[+] [-] drdaeman|11 years ago|reply
Please, don't blame one and praise the other, just use what's appropriate. It's somehow like with data structures - say, one won't generally use a graph when he actually needs a set, right?
Trying to stash data into a semantically inappropriate format leads to kludges, and I'd say some JSON-based formats (for example, HAL-JSON) feel like so. Obviously, it's the same (or, possibly, even worse) for XMLish abominations like SOAP. Neither XML nor JSON are silver bullets.
My point is, while sure, XML has its issues and arcane features, it's not universally terrible. Wonder if there's some standardized "XML Basic Profile" that's a as minimal as possible yet still functional and expressive subset of XML for the most typical use cases, huh. Somehow in a same manner, XML was "extracted" from SGML.
[+] [-] colanderman|11 years ago|reply
Was this ever a stated design goal? SGML, sure, probably; but I've never seen any evidence that XML had "document markup" as its sole intended application.
[+] [-] ScottBurson|11 years ago|reply
Of course, we had s-expressions long ago.
But I agree about XML.
[+] [-] cbsmith|11 years ago|reply
Even JSON is really a sin. Using simple binary data formats like protocol buffers makes so much more sense.
[+] [-] jimmaswell|11 years ago|reply
[+] [-] adamtulinius|11 years ago|reply
[+] [-] 0x0|11 years ago|reply
I'm assuming afl-fuzz is particularly CPU-bound, and it would be interesting to see some numbers about how many CPU years are being dedicated to it at the moment - and if we would see even more interesting stuff if a larger compute cluster was made available.
It's also super scary how "effortlessly" these bugs appear to be uncovered, even in "well-aged" software like "strings".
[+] [-] hueving|11 years ago|reply
[+] [-] xendo|11 years ago|reply
[+] [-] comboy|11 years ago|reply
[+] [-] scott_karana|11 years ago|reply
Maybe he doesn't sleep.
[+] [-] unknown|11 years ago|reply
[deleted]
[+] [-] viraptor|11 years ago|reply
[+] [-] al2o3cr|11 years ago|reply
Related: NOBODY EXPECTS THE SPANISH INQUISITION, either. :)
[+] [-] adnam|11 years ago|reply
[+] [-] seba_dos1|11 years ago|reply
[+] [-] jschwartzi|11 years ago|reply
[+] [-] serve_yay|11 years ago|reply
[+] [-] mikeknoop|11 years ago|reply
Basically, it's really easy to leave default XML parsing settings (for things like consuming RSS feeds) and accidentally open yourself up to reading files off the filesystem.
I did a full write-up and POC here: http://mikeknoop.com/lxml-xxe-exploit
[+] [-] userbinator|11 years ago|reply
On the other hand, I'd be astonished if it managed to fuzz its way through a hash-based comparison (especially one involving crypto like SHA1 or MD5.)
[+] [-] Houshalter|11 years ago|reply
It's just the simplest possible demonstration of evolution, where characters of a string are randomly changed, and kept if more of the characters match. In a short amount of time you get Shakespeare quotes.
Obviously hashes are designed to be difficult to break. Although I've never heard of anyone trying a method like this before. I've heard of people using things like SAT solvers to try to reason backwards what the solution should be. But this is the reverse, it's trying random solutions and propagating forward to see how far they get.
I doubt it would work, I'm just curious to know if this has been tried before and how well it does.
[+] [-] acveilleux|11 years ago|reply
Constant-time compares however would probably stump the fuzzer.
[+] [-] dalke|11 years ago|reply
[+] [-] backspaces|11 years ago|reply
[+] [-] nickbauman|11 years ago|reply
[+] [-] bostonpete|11 years ago|reply
[+] [-] pjmlp|11 years ago|reply
Time to upgrade to more modern tools?
[+] [-] pascal_cuoq|11 years ago|reply
Still, quoting from a sentence a few lines down in the article:
“this particular example is often used as a benchmark - and often the most significant accomplishment - for multiple remarkably complex static analysis or symbolic execution frameworks”
The author is thinking of backwards-propagation static analysis or symbolic execution frameworks, for which is it indeed a feat to reverse-engineer the condition that leads to exploring the possibility that there is a “CDATA” in the input. Forwards-propagation static analysis needs no special trick to assume that the complex condition must be taken some of the times and to visit the statements in that part of the code. The drawback of static analysis (especially with respect to fuzzing) is then with the false positives that can result from the fact that a condition was partially, or not at all, understood.
[+] [-] petercooper|11 years ago|reply
[+] [-] xendo|11 years ago|reply
[+] [-] comboy|11 years ago|reply
[+] [-] benwilber0|11 years ago|reply
[+] [-] cmdrfred|11 years ago|reply
[+] [-] brabbit|11 years ago|reply