Don't Invent XML Languages (2006)

[+] account-5|2 years ago|reply

This is less about hating on XML and more about not reinventing the wheel.

I quite like XML. Things like xpath make working with it, or getting data from it much easier than JSON; though I love jq syntax and can't wait until it starts being incorporated into languages. I don't even mind xslt provided it's not being over used.

[+] unwind|2 years ago|reply

Tim Bray [1] is a co-author of the original XML spec so he is probably not hating on XML so much, no.

[1]: https://en.m.wikipedia.org/wiki/Tim_Bray

[+] marcosdumay|2 years ago|reply

The article makes no point about XML at all, it's really "don't invent languages, unless it's your last hope".

[+] masklinn|2 years ago|reply

> I quite like XML. Things like xpath make working with it

XPath is great. 1.0 anyway, after that everything but the new functions was nonsense.

But do note: XPath is not an xml dialect, it’s a non-XML DSL being applied to documents.

[+] userbinator|2 years ago|reply

2006, well into the era of XML being the trendy fad that every piece of Serious Business software was supposed to use.

Now 18 years later, JSON seems to have displaced it.

Personally, I've never found text-based formats to be a good choice for data that humans will rarely need to read or write; I much prefer simple and efficient binary formats, which can be just as extensible without the additional inefficiency and needlessly-introduced-failures of string handling.

[+] mrweasel|2 years ago|reply

> I've never found text-based formats to be a good choice for data that humans will rarely need to read or write;

A computer may need to read the data millions of times, but humans only need to read it when things goes wrong. If you take the route of using a binary format, then you need to provide a robust set of tools to view, edit, debug that format. It can absolutely work, journald is binary, but has the tooling to help you get the plain text, when you need it. Protobuf could work well, but if you design your own binary data format you have to build some tooling as well.

For many (most) it doesn't matter much if you use XML or JSON, the computer will process it fast enough. If the concern is parsing speed, then yes, absolutely binary formats might be the best option. For things like HFT or anything else that processes vast volumes of data, its probably the only choice.

[+] devjab|2 years ago|reply

In my experience people read the JSON output of things more often than anything else in both development and in maintenance. You can certainly make a point about that being stupid, and it certainly also is stupid. It just seems to be what happens when you have a gazillion APIs and even more business processes and a lot of people who should know what they are doing, who don’t.

I recently had the joy of reverse engineering a terrible piece of JavaScript which created a vast and ridiculously complicated JSON delivery for one of our most important frontends. It was so full of any assumptions that the original author had several functions which were still doing various things with “unknown” because the data which had once been there wasn’t. Anyway, since I had to reverse engineer it from terrible code and no documentation I figured I’d just talk with the people who worked the data inputs and outputs and maybe learn something about the process from their business processes. Turns out they couldn’t help, what was even worse was how the frontend partner we were delivering the JSON to couldn’t even explain why they needed it the way it arrived. Those were some “fun” sessions trying to Sherlock Holmes out of the mess. This was a bad story, but it’s just one out of a thousand such stories in enterprise and non-tech IT.

I think the reason JSON “won” over XML was the ”same” human reason. I’ve seen a lot of XML, but I don’t think I’ve ever seen some that didn’t break its own standards. Like you’d get all XML with data between ><‘s and then suddenly a single field would have the data inside the <… value=“”>… because duck you, or whatever else. Often they wouldn’t bother supplying you with schemas, or if they did, their auto-generated schemas wouldn’t have been updated in five years because apparently someone forgot that. Then once you “progress” beyond the technical funzies you’ll still have to deal with using the XML to find out what the hell is going on, and since it’s much harder to read (not entirely sure why) you’ll have to first convert it into JSON anyway, before whatever business process person has to read it.

[+] tannhaeuser|2 years ago|reply

> Now 18 years later, JSON seems to have displaced it.

More like, the people who didn't understand XML was designed to simplify HTML's SGML serialization moved elsewhere to misuse JSON (and YAML, and other things), the use cases and limitations of which they also failed to understand.

From the XML spec (edited by Tim Bray, the author of the linked blog):

> XML is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. [1]

[1]: https://www.w3.org/TR/1998/REC-xml-19980210.html

[+] nocombination|2 years ago|reply

Generally there is misunderstanding of these markups. JSON came along because the JavaScript people found it convenient and more network-efficient (the irony being it's not really). But a million years later followed the schema-validation logic to back-fill deficiencies.

In my opinion, one should go for XML when writing a portable document format of any variety to allow a vast array of schema validators, linters, and so-forth. This makes writing plugins which target a given schema much more portable and easier to write.

It's kind of ridiculous how many times web folks reinvent the wheel. Seriously, get rid of YAML, TOML, JSON. We already had INI, XML/XPATH/XSD.

There's nothing wrong with XML, people are just lazy and so is JSON. It's the lazy, sloppy cousin of XML which was a well-thought-out standard to fill a deficiency in HTML for data.

Honestly though, we should be using something like this: http://openddl.org/

[+] DrDroop|2 years ago|reply

The funny thing about this discussing is that no one seems to care about the actual problem that is being solved by XML rather then just serialising and deserialising some data used by one piece of software. XML is mainly useful for exchanging data between organisations in a pre agreed format and schema where everyone is doing something else with the information. I'm implementing XBRL right now, and I can't say that I enjoy it, but I also wouldn't know how to do it better except maybe in some superficial way. Sure XML is bloated and maybe something like RDF/TTL would be better because of build in support for links, multiple languages and datatypes but that still leaves the problem of having to deal with whatever is expressed in a taxonomy that can be GB big.

[+] joshlk|2 years ago|reply

I’m sure it’s down to use case and circumstances but I frequently find it useful to be able to read and edit text based data formats (JSON/YAML/TOML). If you are concerned about data size you can easily use on-the-fly compression at little cost nowadays.

[+] rlv-dan|2 years ago|reply

A problem with binary formats is not now, it's tomorrow when tools have changed and documentation is lost (if there ever was any).

[+] eru|2 years ago|reply

> Personally, I've never found text-based formats to be a good choice for data that humans will rarely need to read or write; I much prefer simple and efficient binary formats, which can be just as extensible without the additional inefficiency and needlessly-introduced-failures of string handling.

Many modern serialisation / deserialisation libraries (like Rust's serde) let you serialise your data structures to a variety of formats, including both JSON and various binary formats of your choice.

But it should also be relatively easy to take any binary format, and define an alternate text based format for it that allows one-to-one conversion in both directions. A bit like assembly vs binary machine language.

[+] tootie|2 years ago|reply

On the flip side, XML is by far the superior choice for documents people will have to read. Paired with XSD it makes it a lot easier to created valid, well-formed documents that conform to expected input. Ideal for anything that isn't required to be transmitted over the wire frequently. Like config files. I feel like we could save the world thousands of hours of head-scratching if things like web server config or deployment descriptors were done in rigorously-defined XML formats instead of made-up formats like Dockerfile.

[+] eggdaft|2 years ago|reply

The Pragmatic Programmer explains why plain text is the way to go.

Biggest issue with binary is the difficulty reading / manipulating it long-term.

[+] _a_a_a_|2 years ago|reply

Easy human readability is a massive win when you have to diagnose problems.

[+] dfgdfg34545456|2 years ago|reply

Configuration files as text are nicer in source control.

[+] lpapez|2 years ago|reply

Thankfully we've moved on and no longer invent XML languages.

We use YAML now which is obviously much better.

[+] ansgri|2 years ago|reply

There are probably as many people that recognize the sarcasm as those that honestly think so.

[+] hun3|2 years ago|reply

Norway.

https://hitchdev.com/strictyaml/why/implicit-typing-removed/

[+] unknown|2 years ago|reply

[deleted]

[+] simpaticoder|2 years ago|reply

If you squint at XML, JSON, or YAML you see a kind of lispy data-structure shape, an n-arry tree. The reader has a context stack that they are pushing and popping from as they read. The real problem is that every problem space is isomorphic to one that has successively tighter context. And a format that is applicable to every problem is one that is applicable to no problem. I believe that computer languages must get worse at some things to get better at others in a zero-sum way. Any attempt to avoid this trade-off leaves you with a very powerful mush.

[+] wodenokoto|2 years ago|reply

No, xml is inherently different.

It’s a mark up language unlike the others which are data languages.

In xml you can nest data inside of data. In json you can have data structures that can hold data.

Eg., `<tag>lorem <tag2>ipsum</tag2>.</tag>` is not a thing in the others.

[+] 435345345|2 years ago|reply

I don't even know why we have any *ML. Everything they can do can be done with Lisp-syntax better.

[+] sanxiyn|2 years ago|reply

S-expression still needs a schema. Just as you can't say "use XML", neither you can say "use S-expression". Most of work of defining DocBook is about defining DocBook schema, not about defining DocBook XML syntax.

[+] phlakaton|2 years ago|reply

This argument has been made in one form or another for decades. There are good reasons it hasn't prevailed. S-expressions are clumsier at expressing semistructured data that is primarily strings, which is what markup languages were designed for.

Whether you think S-expressions are superior probably depends on the use case you have in mind. And there's a good chance it's not the use case you should be optimizing for.

[+] otabdeveloper4|2 years ago|reply

XML is a document markup language. As in, articles and books n stuff.

How in hell is that comparable to anything "Lisp-syntax"?

[+] dkersten|2 years ago|reply

I still see people, in 2024, writing new software and using XML as the data format. I don't have an example offhand, but I recently saw a hobby game engine using XML to store its engine-specific game object/scene data.

Personally, I like to use TOML for anything that is likely to also be edited by humans and JSON or binary for something that will only ever be used by machines.

[+] CharlesW|2 years ago|reply

> I still see people, in 2024, writing new software and using XML as the data format.

Because it's a pragmatic choice with a large, mature, proven ecosystem. (Not that there's anything wrong with TOML for INI-type use cases.)

I think some readers are missing the point of Tim's article. He's not saying "don't use XML" — he's saying "don't invent XML languages…unless you have to".

> "The smartest thing to do would be to find a way to use one of the perfectly good markup languages that have been designed and debugged and have validators and authoring software and parsers and generators and all that other good stuff. Here’s a radical idea: don’t even think of making your own language until you’re sure that you can’t do the job using one of the Big Five: XHTML, DocBook, ODF, UBL, and Atom."

Presumably, he'd recommend using RSS and OPML as well if those fit your use case.

[+] freedomben|2 years ago|reply

in the Java world? I've seen a couple XMLs on newer software but all from Java ecosystem, where there are substantial libraries in existence and the authors are already super familiar with it to the point where it's basically "free"

Everything else is YAML, JSON, or TOML (especially in the rust world)

[+] novagameco|2 years ago|reply

I like to use XML for UI markup, similar to html

[+] greenyoda|2 years ago|reply

Another standardized XML language that the article doesn't mention is RDF: https://en.wikipedia.org/wiki/Resource_Description_Framework

[+] oliviergg|2 years ago|reply

I still remember working briefly in 2008 with a thing called Magic (now uniPaas I think) XML all way down. Even the code : each line was a XML . You were supposed to code with a weird GUI. Still have nightmares

[+] joshlk|2 years ago|reply

Side question: when did XML start to loose favour to JSON? Did this happen because of something in particular or was it a gradual transition?

[+] lolive|2 years ago|reply

JSON can be translated into a structure of arrays, maps and primitive types. It can be traversed with functional programming.

That matches 100% the skillset/mindset of developpers.

Whereas XML has too many quirks and tricks, that it requires a dedicated API [aka DOM API] that you must learn and master.

XPath is the tool of choice for XML traversal. VERY versatile and super readable, but it cannot beat the ultimate flexibility of the functional programming.

[+] rosywoozlechan|2 years ago|reply

JSON is basically a subset of JavaScript. Parsing XML sucks, everyone rolled their own RSS and it was basically a nightmare having to deal with invalid XML. JSON has always been really strict and simple. If JavaScript in a browser couldn't parse it it just didn't work. libxml was also a nightmare, XSLT pages looked terrible and honestly there where just a lot of bad XML ideas out there despite gems like SVG. SOAP was horrible, and still is for people having to interface with a SOAP API like Netsuite.

But what was really the end of XML was the end of XHMTL and the HTML5 working group that said they weren't going to do XML HTML and it was funded by Google's large sums of money and involved talented people work at a hot new browser called Firefox and the web dev community went their way and everyone followed.

[+] chubot|2 years ago|reply

I don't know exactly when, but I think it had a lot to do with the fact that in 2004, Python, Ruby, and yes JavaScript were all fairly niche.

(This might be hard to believe for some, but it's true. GMail had just come out, and it was one of the first web APPS in JavaScript, certainly the first really good one. Paul Graham wrote "the Python paradox" https://paulgraham.com/pypar.html , and Ruby on Rails didn't exist yet.)

JSON maps well to core data structures of Python, Ruby, and (obviously) JavaScript, but XML doesn't.

XML seemed to be associated with Java. There was a lot more ceremony around the APIs. You didn't just manipulate data; you had to create classes and call methods.

So basically neither JSON nor XML are really convenient in Java. But JSON is very convenient in the dynamic languages. You don't need extra abstractions / layers / code generators -- it's just "there".

[+] unknown|2 years ago|reply

[deleted]

[+] jszymborski|2 years ago|reply

I could be wrong, but I recall that during the early AJAX days, there was a bit of a XML v JSON war that ended up with JSON winning. Some stated cases for JSON was that it was less verbose and therefore cheaper to send over the wire, among other stated advantages.

I'm sure there were plenty of other sources of fatigue like XHTML Schemas etc...

[+] aktenlage|2 years ago|reply

You may enjoy this podcast episode, which tells the story of the inventor of JSON: https://corecursive.com/json-vs-xml-douglas-crockford/

[+] scoofy|2 years ago|reply

When I was writing an xbrl-to-json library I suddenly realized that they are not perfectly transferable.

XML comes from spreadsheets, and so it was the first mover. Whereas json came from key-value pairs. I think key-value pairs are much easier to picture in your mind, and for the vast majority of work people are doing, it's just simpler.

I think XML is wildly complicated for simply APIs.

[+] gravpuse|2 years ago|reply

I had nightmare debugging sessions with Xml and xslt. Xml is big and ugly. Json was so much lighter and easier and so it won.

[+] unknown|2 years ago|reply

[deleted]

[+] bryanrasmussen|2 years ago|reply

several things helped to damage XML as the format of choice.

1. mismanagement by w3c of associated standards:

The awful bloated XML Schema spec that can't even validate many types of common XML design patterns, obviously as shown by Schematron it would be better if they had leveraged XPath in making the validation format they pushed on everyone and made people think wow, XML is this big complicated bloated beast, we need something else.

The awful bloated SOAP spec which Don Box once said if only XML schema had existed we wouldn't have had to make SOAP - let me tell you that was the best darn laugh I had in that year! Which of course with the whole Rest movement - very loosely based on a member of the W3C's PhD dissertation - and W3C committed to SOAP that everyone resented it made everyone resent the W3C and XML in turn.

The tying of second versions of successful standards in to the questionable XML Schema spec made members of communities using these successful technologies - XSL-T, XPath - feel that maybe they weren't enjoying the new versions of the tech so much, there was some drop-off and complaints.

The creation of XHTML as an XML dialect did not suit very many people.

2. Continued increase of the Web as platform of choice.

The successful XML technologies were not well suited to making web sites that were not document based. If your site was say a thin navigation structure to allow you to get around a bunch of documents a top level programming language to handle serving documents, transforming documents to XHTML with XSLT, and then a thin layer of JavaScript on top was quite a decent solution.

But XSLT is not really suitable to making all sorts of sites with lots of different data sources being pulled in to build a frontend. So when you have a lot of languages in use what do you do? You drop the language that is least suited to most of your tasks and use some of the other languages to take care of the dropped functionality, thus easing cognitive load.

I'm serious about this, I was very good with XSLT and associated technologies and built many high quality document based websites for large organization, but the XML stack of technologies is sub-par for building most modern websites that often contain multiple app-like functionalities on every page.

I suppose that the programmers and technologists at the W3C did not realize this because they did not build websites, they were more Enterprise applications, data pipelines and many coming from the publishing world.

3. JSON was being pushed by Douglas Crockford. As the E programming language was shutting down https://www.crockford.com/ec/etut.html he started to focus more of his time on JavaScript and arguing for JSON which essentially he identified as existing as a potential data interchange format. As REST started to take away from SOAP and JSON got pushed by someone who did understand web programming the increasing web focused software development environment moved away from SOAP and XML which were seen as being essentially the same to REST and JSON (or really REST-like and JSON-like) because these were seen as being simpler and quicker to iterate with - which is essentially correct.

On the Web, especially web frontend simple wins because frontend development is in many ways more complicated than other forms of development - why so? Because a frontend developer potentially has to handle very many more types of complexity than are generally handled in other programming disciplines - this often leads Frontend developers to cut corners that other disciplines wouldn't so as to cut cognitive load but here I'm definitely getting off the subject - at any rate for the expanding web market XML and its related technologies were a bundle of complexity that could be replaced with a simpler stack, even if it meant that some of the things that stack was good at might be made slightly harder it seemed and probably nearly always was still a significant win.

[+] jiggawatts|2 years ago|reply

I love listening to young developers guess at the history of XML, and why it was "complex" (it wasn't), and then turn around an reinvent that wheel, with every bit of complexity that they just said they didn't like... because it's necessary.

So a bit of history from someone who was already developing for over a decade when XML was the new hotness:

The before times were bad. Really bad. Everybody and everything had their own text-based formats.[1] I don't just mean a few minor variants of INI files. I mean wildly different formats in different character encodings, which were literally never provided. Niceties like UTF-8 weren't even dreamt of yet.

Literally every application interpreted their config files differently, generated output logs differently, and spoke "text" over the network or the pipeline differently.

If you need to read, write, send, or receive N different text formats, you needed at least N parsers and N serializers.

Those parsers and serializers didn't exist.

They just didn't. The formats were not formally specified, they were just "whatever some program does"... "on some machine". Yup. They output different text encodings on different machines. Or the same machine even! Seriously, if two users had different regional options, they might not be able to share files generated by the same application on the same box.

Basically, you either had a programming "library" available so that you could completely sidestep the issue and avoid the text, or you'd have to write your own parser, personally, by hand. I loooved the early versions of ANTLR because they made this at least tolerable. Either way, good luck handling all the corner-cases of escaping control characters inside a quoted string that also supports macro escapes, embedded sub-expressions, or whatever. Fun times.

Then XML came along.

It precisely specified the syntax, and there were off-the-shelf parsers and generators for it in multiple programming languages! You could generate an XML file on one platform and read it in a different language on another by including a standardised library that you could just download instead of typing in a parser by hand like an animal. It even specified the text encoding so you wouldn't have to guess.

It was glorious.

Microsoft especially embraced it and to this day you can see a lot of that history in Visual Studio project files, ASP.NET web config files, and the like.

The reason JSON slowly overtook XML is many-fold, but the key reason is simple: It was easier to parse JSON into JavaScript objects in the browser, and the browser was taking off as an application developer platform exponentially. JavaScript programmers outnumbered everyone else combined.

Notably, the early versions of JSON were typically read using just the "eval()" function.[2] It wasn't an encoding per-se, but just a subset of JavaScript. Compared to having to have an XML parser in JavaScript, it was very lightweight. In fact, zero weight, because if JavaScript was available, then by definition, JSON was available.

The timeline is important here. An in-browser XML parser was available before JSON was a thing, but only for IE 5 on Windows. JSON was invented in 2001, and XMLHttpRequest become consistently available in other browsers after 2005 and was only a standard in 2006. Truly universal adoption took a few more years after that.

XML was only "complex" because it's not an object-notation like JSON is. It's a document markup language, much like HTML. Both trace their roots back to SGML, which dates back to 1986. These types of languages were used in places like Boeing for records keeping, such as tracking complex structured and semi-structured information about aircraft parts over decades. That kind of problem has an essential complexity that can't be wished away.

JSON is simpler for data exchange because it maps nicely to how object oriented languages store pure data, but it can't be readily used to represent human-readable documents the way XML can.

The other simplification was that JSON did away with schemas and the like, and was commonly used with dynamic languages. Developers got into the habit of reading JSON by shoving it into an object, and then interpreting it directly without any kind of parsing or decoding layer. This works kinda-sorta in languages like Python or JavaScript, but is horrific when used at scale.

I'm a developer used to simply clicking a button in Visual Studio to have it instantly bulk-generate entire API client libraries from a WSDL XML API schema, documentation and all. So when I hear REST people talk about how much simpler JSON is, I have no idea what they're talking about.

So now, slowly, the wheel is being reinvented to avoid the manual labour of RETS and return to machine automation we had with WS-*. There are JSON API schemas (multiple!), written in JSON (of course), so documentation can't be expressed in-line (because JSON is not a markup language). I'm seeing declarative languages like workflow engines and API management expression written in JSON gibberish now, same as we did with XML twenty years ago.

Mark my words, it's just a matter of time until someone invents JSON namespaces...

[1] Most of the older Linux applications still do, which makes it ever so much fun to robustly modify config files programatically.

[2] Sure, these days JSON is "parsed" even by browsers instead of sent to eval(), for security reasons, but that's not how things started out.

[+] maxrecursion|2 years ago|reply

I used the OVAL "Open source Vulnerability Assessment Language", written in XML, daily to automate STIGs. Finding documentation for it was awful, but once I knew the syntax development was a breeze. Most chill job I ever had. A job like that is my retirement plan once I have enough money that salary no longer matters.

[+] megaperplex|2 years ago|reply

The author makes an argument against designing new XML languages. I think his arguments are weak. This does not mean I think we should design more XML languages, but that the arguments this particular author brings against it are weak. That having be said, the mid section with the tooling suggestions by use case is neat.

One thing he condemns such endeavors for is that it is unpleasant and somehow "political". I can see what he means, but this has nothing to do with "overdoing the extensibility" of XML. As Aaron Schwartz put it

"Instead of the "let's just build something that works" attitude that made the Web (and the Internet) such a roaring success, they brought the formalizing mindset of mathematicians and the institutional structures of academics and defense contractors. They formed committees to form working groups to write drafts of ontologies that carefully listed (in 100-page Word documents) all possible things in the universe and the various properties they could have, and they spent ours in Talmudic debates over whether a washing machine was a kitchen appliance or a household cleaning device. [https://www.cs.rpi.edu/~hendler/ProgrammableWebSwartz2009.ht...]"

It is true that similar endeavors are prone to looking for an Absolute Cosmic Eternal Perfect Ontological Structure (credit: Lion Kimbro). If you drop that idea in any office, you will get as many proposals for entities as there are anuses, as if anyone is entitled to an ontology.

Don't get me wrong, anyone might be entitled to submit an entity or criticize a hierarchy, but I think this is meaningful mostly in the context of targeted audience research and agile development practices. All in all, I think that the problem here is not with the 'X' in XML, but with poor organization-level practices.

Furthermore, I did follow the link and surveyed the XML languages. I did not see the apparently self-evident truth the writer sees in there. Sure, there are many of them, but how is this even an argument? Some of the listed languages seem quite cool to me, especially the science ones. And the next person might dig the legal ones. If the argument here is that "there are so many of them languages, they just can all be important" (or "real") does not sit well with me. There are tons of different programming languages, web frameworks, linux distributions, not to mention the incomprehensible multitude in other domains, such as car maker models or, well, birds.

It is just simplistic to disparage any number of things because they are too many to make readily sense of, and this is a cognitive stance I can't endorse. Look at Medical Subject Headings, or the Dewey Decimal or the Library of Congress cataloging systems. There is just a ton of things out there and for each one of those, there is a person that has more expertise on than yourself. These taxonomies might be important to them, what are you gonna do? Stop them?

A bird's view exasperation of the sheer number of things is the hallmark of a small town mentality that is untenable for the hacktivist mindset. The response here is, I guess, reusability of existing standards, and agile practices involving the user in the development process. But the author did not bring up any of these.

94 comments