How to Avoid Being Called a Bozo When Producing XML (2005)

[+] bane|9 years ago|reply

My "favorite" XML formats are the one that are just some kind of weird meta-format and don't really use any of the XML features:

   <format>
      <record id="1">
         <field name="id" value="1"/>
         <field name="name" value="abc">blah blah</field>
         <field name="attribute">this is the attribute value</field>
         <field name="end_of_record" value="True"/>
      </record>
      <record id="2">
      ...
      </record>
   </format>

And yes, these types of abominations are everywhere.

The only way to avoid being called a Bozo when producing XML is to either

a) ensure that humans never had to see this craziness

b) don't use XML

XML as a config file format, in particular, is probably one of the worst ideas in computing.

[+] jroseattle|9 years ago|reply

> Don’t print

> Use an isolated serializer

Some old reference material (XML isn't as common as JSON anymore), but still worthwhile learning: don't output data formats directly. Directly = echo, print, printf,println...whatever your syntax suggests. I see this happen a lot with my junior engineers, and I have this same conversation with them.

Prefer to use data serializers that encapsulate all the syntactical rules that go along with XML, CSV, JSON, YAML, etc. Let the serializers do the grunt work of writing output in correct format.

Some serializers aren't always ideal - correctness and speed can be an issue. Nonetheless, prefer to use those mechanisms over writing your own output.

[+] zubat|9 years ago|reply

I use XML for a combination of features that I consider very important but are also perceived as "overkill": A source syntax that has already handled text escaping and encoding, lets me add some abstract structure, and lets me encode the text in a way that lets me nest different parsing modes for various kinds of structured data.

The first two are easy enough to get with your pick of JSON or S-Expressions. For a lot of things even CSV is enough, although CSV has the downside of being so simple that people opt to write an incorrect toolchain for it themselves instead of adding a dependency.

But it's the last feature that really produces the complexity. Once you get into "I want the inner structure to contain a different and unambiguous semantic meaning from the outer structure" you have a pretty substantial engineering problem. Less structured approaches like JSON or S-Expr's drop the problem on the floor by declaring one universal semantic, making the programmer deal with adding anything else on top. XML's compromises to achieve a more detailed representation of data involve the angle bracket tax, schema languages, etc.

If you want a guarantee that a rich data source can be processed correctly through an n-tier architecture that emits various radically different outputs, these compromises become compelling. I'm a big fan of DocBook, for example, and its canonical toolchain is an XSLT style sheet: The workflow I end up with is initial writing in a light syntax of choice, compile to DocBook XML, add additional formatting and styling in the XML, and then emit the final document in whatever forms needed - HTML, PDF, etc. It's extremely flexible, and you wouldn't get the same quality of result with a less extensive treatment.

For ordinary data serialization problems and one-offs, it is considerably less interesting.

[+] cptskippy|9 years ago|reply

XML is well regarded in the enterprise and languages like JAVA, C#, and VB.NET handle is spectacularly as an exchange format.

I think it's bad reputation comes from anyone not using an enterprise language because the support just isn't there.

I recall working with a partner who we were doing an identity federation with. Our system was using WS-Trust which is a SOAP/XML protocol. It wasn't ideal but everyone seemed to support it ok. These guys were cutting edge though and used Ruby on Rails.

No support for the protocol wasn't a huge deal, just means you have to craft your XML for your SOAP calls yourself. But at the time we were doing this, RoR didn't have SOAP or XML libraries. They had to write everything from the ground up. It sucked for me and I was just fielding rudimentary questions, I can't imagine how painful it must have been for them.

[+] wtbob|9 years ago|reply

> I think it's bad reputation comes from anyone not using an enterprise language because the support just isn't there.

On the contrary, I think that XML's bad reputation comes from the fact that it is <adverbial-particle modifies="#123">so</adverbial-particle> <adverb id="123">incredibly</adverb> <adjective>verbose</adjective>.

Also, the whole child/attribute dichotomy is a huge, huge mistake. I've been recently dealing with the XDG Menu Specification, and it contains a child/attribute design failure, one which would have been far less likely in a less-arcane format.

XML is not bad at making markup languages (and indeed, in those languages attributes make sense); it is poor at making data-transfer languages.

JSON has become popular because a lot of bad programmers saw nothing wrong with calling eval on untrusted input (before JSON.parse was available). It's still more verbose than a data transfer format should be, and people default to using unordered hashes instead of ordered key-value pairs, so it's not ideal.

The best human-readable data transfer format is probably canonical S-expressions; the best binary format would probably be ASN.1, were it not so incredibly arcane. As it is, maybe protobufs are a good binary compromise?

[+] einhverfr|9 years ago|reply

XML is very often the least bad format (compared with ASN.1, JSON, X12 EDI, CSV, and other interchange formats), particularly when dealing with statically typed languages. XML is a horrid chimera of SGML but at least it is both human readable, subject to machine validation, and gets the job done.

[+] qwertyuiop924|9 years ago|reply

Well, XML is complicated, so it's hard to build support for, and it's verbose, so it's heavy on the wire. Frankly, I think JSON is a better format in most contexts.

[+] chrisseaton|9 years ago|reply

What do you think JAVA stands for? It's not an abbreviation. It's the name of an island and it's just 'Java'.

[+] djur|9 years ago|reply

Ruby has had an included XML library since before Rails was released. soap4r is older than Rails too. I wrote my share of clients for SOAP services back then. soap4r wasn't fun to use but it mostly worked. If the service was really simple (a single call and response, for instance) it was sometimes more expedient to put together the request yourself.

When Savon came out 6-7 years ago it was a huge relief. Luckily, by that point, I was seeing a lot less SOAP. But even with Savon, the experience was only lifted to "not awful", never to "wow, I'm glad they used SOAP, this is so easy."

[+] SomeCallMeTim|9 years ago|reply

>well regarded in the enterprise

I think this alone should be enough to cast doubt on it, based on my (albeit limited) interactions with "enterprise" software.

>I think it's bad reputation comes from anyone not using an enterprise language because the support just isn't there.

What, like JavaScript? I've had to read and write XML packets from a Node app to work with (surprise!) an enterprise app. I had probably 20 choices of libraries with varying levels of features, and the one I chose worked fine.

I was lucky, compared to some of the others on this page: The "RPC"-style XML commands and responses I had to parse and generate were all well standardized, so I just wrote a wrapper that extracted the completely opaque tree of XML into a flatter JavaScript object/hash that was really easy to deal with, and similarly made a wrapper that would trivially generate the monstrous XML required to send commands and responses back to the server. My JSON-equivalent objects were easier to manipulate (and would also have been easier to deal with in Java or, in this case, C#), equally rich in the information they carried, but could have been serialized with 1/3 the number of bytes per message. Totally a win-win-win.

What I don't understand is why anyone thought using XML that way was a good idea, and why it still is popular in the enterprise. Bad habits are hard to break, I guess.

[+] int_handler|9 years ago|reply

My impression is that the reason why XML is so well-regarded in the enterprise is because these companies are not aware of better alternatives, such as Protocol Buffers [1]. The reason why XML has a bad reputation outside of the enterprise is because it is so incredibly verbose (both the language itself and the code used for working with it), and that all-in-all, it is a sub-optimal solution to a solved problem.

To illustrate: Protocol Buffers' wire format is much more compact. It removes the complexity of having to deal with XML parsers by providing classes generated from the message definition/schema. You can use it with GRPC to implement your service APIs. It is supported for many different languages, including Java and C#. It now even has a JSON mapping [2]. Overall, Protocol Buffers can do everything XML can do as both an exchange format and as a configuration language but better.

[1] https://developers.google.com/protocol-buffers/

[2] https://developers.google.com/protocol-buffers/docs/proto3#j...

[+] brightball|9 years ago|reply

Another part of it is that statically typed languages benefit much more from XML with a strictly defined schema like DTD or XSD because it makes it easier to generate the objects that you're going to have to map it into.

With a language like Ruby, PHP, etc that isn't strongly typed it's not nearly as big of a deal. Developers in those languages are used to assuming everything is a string and converting it to something useful without the need to premap every datatype.

That's probably the main reason that XML was so much more popular with the languages you mention compared to the parts of the ecosystem that didn't benefit from it's constructs much (if at all).

[+] ambrop7|9 years ago|reply

Some time back I needed to generate an XML file in a Java web application. I attempted to figure out how to do it "right". The only "special" requirement was that it is formatted in a readable way.

So I was figuring out the Java XML stuff (don't remember what that was exactly, probably standard). But at some point the timeout in my brain kicked in, and I just wrote a loop generating the XML by brute-force through PrintWriter or something. I even escaped strings right since some library I had available conveniently offered the escape method (Guava maybe?).

[+] chiph|9 years ago|reply

Back in the early days of XML, Internet Explorer would insert "+" characters to fold nested sections of XML. And was the default program to open .xml files. Guess what showed up in the documents I got from an integration partner?

[+] firewalkwithme|9 years ago|reply

It still does! and I get corrupted files like that mailed to weekly by integration partners. I may be wrong but I think FF also adds some crap to xml files when used as a viewer. I actually like xml, for some reason the structure of it makes alot of sense to me, while json is untidy and confusing

[+] Slartie|9 years ago|reply

Guess what caused a serious outage of a system at a customer that I know, with an estimated impact on his bottom line in the seven-digit area? Yeah, right: naive copying of some XML out of IE into the configuration of said system. Including those '+' characters, which resulted in it not exactly being XML anymore.

[+] kyllo|9 years ago|reply

I once got an XML file from an integration partner where the whole thing was XML escaped (all the tags looked like <node>value</node>) because they had embedded it within an outer "envelope" XML file. They saw nothing wrong with this and argued when I questioned it. I wonder how they were planning to express escape sequences within the inner XML document that was already escaped...

[+] cptskippy|9 years ago|reply

I think if you select/copied what was displayed you'd get the plus and minus signs. If you saved that wouldn't happen.

[+] Nuffinatall|9 years ago|reply

Compared to the problems when dealing with 'delimited text', XML is great.

Also it's flexible where you can specify properties as attributes or child nodes, depending on wildcard specifications.

So I have dealt with lots of edge-case XML situations, but the solutions are always straight forward. Also it helps to have a client vs. trying to parse out raw XML, which means programming and scripting sometimes relies on personal tool development. XML handles scope creep well.

[+] DougN7|9 years ago|reply

Handling scope creep is my favorite feature. With XML, it's easy to deserialize even if an expected element is not there, or if there is an extra one you're not expecting, at least that's been my experience. I haven't done much JSON but I'm not sure how that would work with it.

[+] 616c|9 years ago|reply

On the Cognicast there was an excellent tangent (all of them were good) in episode 106, where Michael Nygard bemoans with the fellow Cognitect Craig that, despite all the hate from the JSON generation, the failed promise of XML was the ability (again, that is part, not the whole) to have separated data and presentation with schemas, so you would not have to redesign endpoints all the damn time.

http://blog.cognitect.com/cognicast/106

This is just one view, and I am sure I will be mercilessly downvoted, as this is a gross simplification of that point, but it was one of many gems in that episode. I might finally review XSLT as this once again affirms things other devs told me when they said do not write off XML, in the complexity of it is something interesting.

[+] ams6110|9 years ago|reply

I loved XML and XSLT. And Internet Exporer, for all its faults, had great support for XSLT in the browser from version 5. It was quite easy to build "rich" single-page apps that get XML data from the server and build various user presentations by updating DOM with XSLT.

[+] erlehmann_|9 years ago|reply

The HTML for my blog is generated by applying an XSLT stylesheet to its feed.

You can see the stylesheet here: http://news.dieweltistgarnichtso.net/posts/atom2html.xsl

You can see the resulting web page here: http://news.dieweltistgarnichtso.net/posts/

[+] sethev|9 years ago|reply

I thought of that same exchange when I read this post but remembered it more as a lament that JSON doesn't support namespaces - so JSON is always context dependent.

[+] _greim_|9 years ago|reply

This article in some ways describes the delta from HTML development to XML development. In the early/mid 2000s, XML was cargo-culted through the tech world on a massive scale; typically being adopted by web developers who proceeded to apply the same habits and tools for XML as they'd been using for HTML. Which of course resulted in many of the issues mentioned.

[+] MichaelGG|9 years ago|reply

There's a popular piece of "newer" software that decided that XML rules were too difficult. So they URL encode all values. It also uses print style formatting for XML tag names, so if you manage to get a name value that has, say, a : in it, you'll get invalid tags. This is the default setup, in 2016, for a system that handles a lot of real-world telephone calls.

Even just a few years ago I've worked with companies that wrote their own "XML parser". They explained it was pretty easy but they had to "special case" for broken output in the real world. An example of this output? "<tag />".

HTML would have been far better off if it had the strictness of XML. Remove end tag names so you can't have invalid nesting. If browsers had refused to parse invalid docs from the start, invalid docs would not have been produced. (And like XML, they could provide decent error messages, so the difficulty would not be significantly raised.)

[+] beagle3|9 years ago|reply

I used to hate doing XML in Python - ElementTree was the nicest of them 10 years ago, but it still hurt.

But last year, I discovered xmltodict[0] and since then, I don't really care - it makes doing xml (both reading and writing) no more cumbersone than using dicts, while still supporting stuff like namespaces, CDATA and friends.

I still think XML is a horrible, misguided idea - from inception, but even more so in how it is used in practice - but I no longer feel any pain interfacing with it.

[0] https://github.com/martinblech/xmltodict

[+] Mikhail_Edoshin|9 years ago|reply

Python has a very good lxml module for advanced XML processing. You can define your own classes for XML elements, so you can read an XML file and get your own classes for the underlying elements. They're somewhat limited, you can easily define methods, but the data is locked to what's in XML. You can also define your own XPath functions and XSLT extensions. Comes very handy sometimes.

The API is still rather awkward though.

[+] zo1|9 years ago|reply

I think a big problem with XML in most languages is the tooling around it. The libraries to parse/create it are not very pleasant to work with because of the immense complexity they have to deal with. If they only had to conform to a very small subset of all of XML's features and quirks, you'd have a very sane ecosystem.

[+] legulere|9 years ago|reply

There's really no reason to use UTF-16 but compatibility with older software (which is usually broken when handling surrogate pairs). It's an atavism from times when all unicode codepoints fitted into 16 bit.

[+] yuhong|9 years ago|reply

I think that one boils to basically back in 1990, ISO 10646 wanted 32-bit characters but had no software folks on that committee, while the Unicode people was basically software folks but thought that 16-bit was enough (this dates back to the original Unicode proposal from 1988). UTF-8 was only created in 1992, after the software folks rejected the original DIS 10646 in mid-1991.

[+] Agentlien|9 years ago|reply

This reminds me of an interesting experience I had with XML at a pervious job a few years ago.

We had bought a product from another company which was to be integrated into our own main product. Theirs was horribly ugly, looking like a cross between a 90's website and an infomercial, predominately in vivid shades of pink and purple. And it was really buggy. I soon noticed that all the content (many hundred pages with text, video and interactive content) was specified in a giant XML file and that the application itself simply interpreted this file and presented it to the user. We quickly decided that the best course of action was for me to reverse-engineer this XML file and write our own code to generate an integrated version of it, presented in a visual style more in line with the rest of our own product. This meant we could also solve some of their bugs on the way.

I still feel this was the only reasonable option and it did work out within our given time frame. However, I will never forget the horrors I saw in that one file. A few gems included:

- The file was most certainly handwritten with lots of tag mismatches and spell errors in tag names.

- One of the main sections was missing in their own standalone version because of a syntax error which caused their program to skip over the entire main branch of the syntax tree in which it occurred.

- Exercises where you had to order a list of items were defined as dragging items into hit boxes on a static bitmap image of the numbers 1-10 on a purple background. The same image was used regardless of how many items had to be ordered. The hit boxes didn't align with those numbers at all and often overlapped. In their implementation, Items were stuck right where you dropped them, rather than snapping to a fixed position by the right number.

- We wrote a few tools to identify images and videos which were either present on disk but never referenced or vice versa. This was often a case of spelling errors, slight variations in word connotation or files placed in the wrong folder. In these cases, their original program would bail out and skip that page.

- Indices of chapters were written as plain text rather than inferred. They did not match how things were laid out in the XML and where it happened to align it was sooner or later broken by sections which were commented out or failed to parse.

There were many more issues, but these give some insight into the exciting challenge of getting their data to work in a consistent and logical manner. After the XML file had been thoroughly massaged into submission and uniformity, of course.

[+] jessaustin|9 years ago|reply

Please edit your post to eliminate the fixed-text:

- It will be easier to read.

- Reading won't require a lot of fiddly trackpadding.

- Maybe it would be nice if HN's simple markup system could handle the case in which the author wants a list of indented items, but it doesn't, and fixed-text is a poor substitute for that.

[EDIT:] Thanks!

[+] rwmj|9 years ago|reply

This is by no means totally bulletproof, but these C macros around libxml2 let us write nested well-formed XML expressions as code:

Example usage: https://github.com/libguestfs/libguestfs/blob/master/src/lau...

Macro definitions: https://github.com/libguestfs/libguestfs/blob/master/src/lau...

[+] oceanswave|9 years ago|reply

Totally, we took this a step further and created a subversion repository where xml documents describe classes. Each method is either inline, or is described by a xml element of a particular namespace that links to a subversion id and revision. ;)

[+] maze-le|9 years ago|reply

Some XML dialects become very confusing if features are added as an afterthought without consideration of syntax and sematics. Microsofts Wordprocessing XML for example has caveats like w:permStart:

    <w:permStart w:id="0" w:edGrp="editor"/>
    (...)
    <w:permEnd w:id="0"/>

permStart and permEnd define regions where special permissions are required to edit a document. It is encoded in a complete anti-XML syntax, where different tags (and a common ID) represent the start and end of a region.

[+] Mikhail_Edoshin|9 years ago|reply

Microsoft Wordprocessing XML is very quirky :) I think they use these markers because different areas can overlap and thus you cannot express this with a tree-like structure.

[+] coding123|9 years ago|reply

There are many flavors of XML and JSON out there now. I think for many developers JSON started to "look good" when the number of standards that started stacking up against XML (and XML-ish/SGML-ish/HTML-ish based formats) started to make people go insane. In the healthcare world we typically had to deal with a never ending set of "format standards" that kept integrating themselves together. I guess originally that may have been the beauty of XML... we started with XML RPC, moving on to SOAP 1.0, SOAP 1.1 introduced new ways to send headers. At some point however it just went crazy.. I think kinda when the enterprise-level people got their hands on things, they started porting all of their non-standard wack-job features into XML.

WS-Addressing - ok seems simple, but now your SOAP stack has to support async processing. WS-Trust - OK Let's add a simple feature that lets you put "some tokens" in the request and response for security, auditing, non-repudiation - good ideas sure. WS-Eventing - Let's add enterprise queuing to XML and soap and require stacks to support that, let the users of the stack figure out a way to connect that to the queues.

Anyway the list goes on, and you can read about it here: https://en.wikipedia.org/wiki/List_of_web_service_specificat...

Suffice it to say, but XML died because the developer now had to learn all of these, how they worked because one tiny industry body starts to adopt 1% of each, requiring implementors to learn 99% of all. It basically just made JSON attractive, a reset if you will.

XML won't go away. HTML will continue forever (it crosses a developer-designer "human line" that makes it kinda permanent) Developers adapt to future technologies a lot faster than designers and other's dabbling in HTML.

Now all this being said, you can see the list of standards piling up against JSON. There's really no critical mass ready replacement though, so JSON will be safe for quite a while longer. JSON will only be replaced in various "areas" like YAML for config, binary JSON-compatible representations for wire and/or storage.

I'm not biased against XML for data transfer, but if someone asked me to create a SOAP 1.1 service with WS-Trust, SAML tokens, etc... I'd also argue for a more industry accepted REST service with OAuth tokens, simply because it would be like introducing the Hummer all over again in age where Tesla's are everywhere. - everyone would hate us.

[+] Marazan|9 years ago|reply

XML is a perfectly fine format that was (ab)used dreadfully by many, many people to such an extent that many people only have examples of completely dreadful XML as their reference.

So many XML-as-interpreted-programming-language monstrosities out there (I know I wrote one as I had the perfect problem domain to use LISP but didn't have the environment capability to use LISP but did have a Database XML field to store 'data' in so I did XML-as-S-Expression with a SAX based interpreter - it was surprisingly nice).

[+] erlehmann_|9 years ago|reply

Discussions about XML and JSON often remind me of this comment on HN: https://news.ycombinator.com/item?id=5702868

Partial quote:

> XML can certainly be shorter than JSON and often is, and repeated tags are the best showcase for it:

     <user id="abc">
        <phoneNo type="home">123456789</phoneNo>
        <phoneNo type="work">321654987</phoneNo>
    </user>

> This turns into this beautiful JSON:

     [
      "users": [
	{
	  "id": "abc",
	  "phoneNos": [
	    { "type": "home", "value": "123456789" }, 
	    { "type": "work", "value": "321654987" }
	  ]
	}
      ]
    ]

[+] lmm|9 years ago|reply

Not a fair comparison since the JSON case includes the outer list as well. And whenever I've seen the equivalent of this in a real-world XML format it would use a <phoneNos> tag to group the phone numbers together.

[+] stesch|9 years ago|reply

Had to post this old article because I encountered some bozo code again. Reading more about some CMS and planning on using it for my blogs when I saw the code of the RSS feed. It was written by the lead developer of the CMS and used text templates.

[+] oceanswave|9 years ago|reply

The way your comment comes across is a bit irritating. Not understanding the underlying codebase and classifying based on an attenuated knowledge of a topic promotes one to the 'bozo' status more quickly than not. Many systems use text-template based feeds, examples are Shopify, Salesforce, Wordpress, and more. Are these systems fundamentally broken purely because of this approach? Probably not. In your case, are the text templates escaping their values when outputting? Are they validating for correct XML once generated? Have more questions than having pre-defined answers.

[+] oceanswave|9 years ago|reply

The author of this post is a bozo, doing any (or not doing any) of the suggested things does not guarantee well formed XML. Disregarding whole sections of the XML spec, prescribing a certain way to generate xml are more harmful than not. Can text templates generate well formed xml, absolutely. Can tools generate non-well formed xml, absolutely.

[+] wbkang|9 years ago|reply

> Making mistakes with them is extremely easy and taking all cases into account is hard.

He states why right there. He doesn't say anywhere whether templates can or cannot generate well-formed xml.

[+] Cozumel|9 years ago|reply

The sheer amount of sites that produce badly formed RSS feeds is staggering, the whole point of a feed is to make your content accessible to everyone, a bit like meta-tags. Why have it if you're not going to at least implement it properly?

[+] nialo|9 years ago|reply

I recently wrote a first pass at an RSS feed parser for podcasts, but couldn't find examples of interestingly malformed podcast feeds to test against. Do you have examples of sites with badly formed RSS feeds?

[+] 0x0|9 years ago|reply

I had no idea there's such a beast called "XML 1.1". That sounds fun!

244 comments