Parsing JSON is a Minefield

[+] s_q_b|9 years ago|reply

Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention.

Secondly, assume that parsing your input will crash, so catch the error and have your application fail gracefully.

This is the number one security issue I encounter in "security audited" PHP. (The second being the "==" vs. "===" debacle that is PHP comparison.)

As one example, consider what happens when the code opens a session, sets the session username, then parses some input JSON before the password is evaluated. Crashing the script at the json_decode() fails with the session open, so the attacker can log in as anyone.

Third, parsing everything is a minefield, including HTML. We as a community invest a lot of collective effort in improving those parsers, but this article does serve as a useful reminder of a lot of the infrastructure we take for granted.

Takeaways: Don't parse JSON yourself, and don't let calls to the parsing functions fail silently.

[+] mikeash|9 years ago|reply

A parser should never crash on bad input. If it does, that's a serious bug that needs immediate attention, since that's at least a DoS vulnerability and quite likely something that could result in remote code execution. You definitely need to assume that the parser could fail, but that's different. Unless you're using "crash" in some way I'm not familiar with?

[+] inimino|9 years ago|reply

Parsing HTML is literally orders of magnitude more complex.

It is pretty rare to need to parse JSON yourself (what environment doesn't have that available?) but it isn't that difficult. It's a simple language.

[+] rurban|9 years ago|reply

Oh my, this is pure nonsense.

First, parsing JSON is trivial compared to other parsing tasks. There are no cycles as in YAML or other serializers, it's trivial forward scanning, without any need to tokenize or backtracking.

Second, JSON is one of the simplest formats out there, and due its simplicity also its most secure. It has some quirks and some edge cases are not well-defined. But with those problems you can always check against your local javascript implementation and the spec, just as OP did.

I know very few JSON parsers which actually crash on illegal input. There are some broken ones, but there are much more broken and insecure by default YAML or BSON parsers or language serializers, like pickle, serialize, Storable, ...

Parsing JSON is not a minefield, parsing JSON is trivial.

Takeaway: Favor JSON over any other serialization format, even if there are some ill-defined edgecases, comments are disallowed and the specs are not completely sound. The YAML and XML specs are much worse, their libraries horrible and bloated.

JSON is the only secure by default serializer. It doesn't allow objects nor code, it doesn't allow cyclic data, no external data, it's trivial, it's fast.

Having summarized that, I'm wondering why OP didn't include my JSON parser in his list, Cpanel::JSON::XS, which is the default fast JSON serializer for perl, is the fastest of all those parsers overall, and is the only one which does pass all these tests. Even more than the new one which OP wrote for this overview STJSON. The only remaining Cpanel::JSON::XS problem is to decode BOM of UTF16 and UTF32. Currently it throws an error. But there are not even tests for that. I added some.

Regarding security: https://metacpan.org/pod/Cpanel::JSON::XS#SECURITY-CONSIDERA...

[+] moron4hire|9 years ago|reply

>> Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention

This attitude really pisses me off. Get off your high horse.

[+] TeeWEE|9 years ago|reply

Takeaway: Dont parse json. Every json parser is wrong. Json is not a standard.

Parsing everything is NOT a minefield. We have BNF's parser theory etc for that. Lots of languages have clear unambigious definitions... Json clearly not. ITs a disgrace for the software engineering community.

[+] lucb1e|9 years ago|reply

> Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention.

Been there done that. (The medical attention, I mean.) Worked just fine. The article makes it sound extremely difficult, but 100% of the article is about edge cases that rarely happen with normal encoders and can often be ignored (e.g. who cares if you escape your tab character?).

> consider what happens when the code opens a session, sets the session username, then parses some input JSON before the password is evaluated

Edit: I responded more elaborately on the unlikelihood of this, but honestly, I can't come up with a single conceivable scenario. How would you decode part of the JSON and only parse the password bit later?

[+] MaulingMonkey|9 years ago|reply

> Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention.

As someone who has written his own JSON parser, I must concur. Ahh - are there any doctors here...?

In my defense - I was porting a codebase to a new platform, and needed to replace the existing JSON 'parser'. You see, it was:

  - Single-platform
  - Proprietary
  - Little more than a tokenizer with idiosyncrasies and other warts

Why was it chosen in the first place? Well, it was available as part of the system on the original platform. Not that I would've made the same choice myself. We had wrappers around it - but they didn't really abstract it away in any meaningful manner. So all of it's idiosyncrasies had leaked into all the code that used the wrappers. In the interests of breaking as little existing code as possible, I wrote a bunch of unit tests, and rewrote the wrapper in terms of my own hand rolled tokenizer. Later - either after the port, or as a side project during the port to help out a coworker (I forget) - I added some saner, higher level, easier to use, less idiosyncratic interfaces - basically allowing us to deprecate the old interface and clean it up at our leisure. This basically left us with a full blown parser - and it was all my fault.

> Takeaways: Don't parse JSON yourself, and don't let calls to the parsing functions fail silently.

I'd add to this: Fuzz your formats. All of them. Even those that don't receive malicious data will receive corrupt data.

Many of the same problems also affect e.g. binary formats. And just because you've parsed valid JSON doesn't mean you're safe. I've spent a decent amount of time using e.g. SDL MiniFuzz - fixing invalid enum values, unchecked array indicies, huge array allocations causing OOMs, bad hashmap keys, the works. The OOM case is particularly nasty - you may successfully parse your entire input (because 1.9GB arrays weren't quite enough to OOM your program during parsing), and then later randomly crash anywhere else because you're not handling OOMs throughout the rest of your program. I cap the maximum my parser will allocate to some multiple of the original input, and cap the original input to "something reasonable" (1MB is massive overkill for most of my JSON API needs, for example, so I use it as a default.)

[+] RangerScience|9 years ago|reply

I had a need to write my own JSON parser for C#, although that was mostly because I hated the data structures the existing C# parsers produced.

I had the advantage that I only needed to use it for rapid prototype projects, and that I could count on all of the data from a single source being the same "shape" (only the scalar values changed, never the keys or objects).

Not following the RFC helped greatly, as I just dgaf about things like trailing commas.

The biggest "gotcha" for my first implementation was a) temporary object allocation and b) recursion depth. The third prototype to use the parser needed to parse, I think it was, a ten thousand line JSON file? The garbage collection on all the temporary objects (mostly strings) meant that version took ~30 seconds. I refactored to be non-recursive and re-use string variable wherever possible, and it dropped down to seconds.

There was a moment in writing it that I thought I would have an actually legitimate use for GOTO (a switch case that was mostly a duplicate of another case), but that turned out not to be the case :/

[+] edoceo|9 years ago|reply

The article is about how the parsers behave differently. No suggestions of writing from scratch.

The === issue is in JS as well.

[+] legulere|9 years ago|reply

There's nothing wrong with writing a JSON parser, but you shouldn't expect to do anything else for the next time if you want to do it correctly.

[+] Kiro|9 years ago|reply

Why would you set the session before the password has been evaluated?

[+] unknown|9 years ago|reply

[deleted]

[+] mi100hael|9 years ago|reply

> In conclusion, JSON is not a data format you can rely on blindly.

That was definitely not my take-away from the article. More like "JSON is not a data format you can rely on blindly if you are using an esoteric edge-case and/or an alpha-stage parsing library." I haven't ever run into a single JSON issue that wasn't due to my own fat fingers or trying to serialize data that would have been better suited to something like BSON.

[+] SloopJon|9 years ago|reply

Figures that something like this would be posted on my day off. I put this through a parser that I cover, and found that the only failures were for top-level scalars, which we don't support, and for things we accept that we shouldn't. I'll look through the latter tomorrow, as well as the optional "i_" tests.

Test suites are a huge value add for a standard, so thank you, Nicolas, for researching and creating this one. I was surprised that JSON_checker failed some of the tests. I use its test suite too.

[+] gcirino42|9 years ago|reply

The correct answer to parsing JSON is... don't. We experimented last hackday with building Netflix on TVs without using JSON serialization (Netflix is very heavy on JSON payloads) by packing the bytes by hand to get a sense of how much the "easy to read" abstraction was costing us, and the results were staggering. On low end hardware, performance was visibly better, and data access was lightening fast.

Michael Paulson, a member of the team, just gave a talk about how to use flatbuffers to accomplish the same sort of thing ("JSOFF: A World Without JSON"), linked in this thread: https://news.ycombinator.com/item?id=12799904

[+] DanielRibeiro|9 years ago|reply

Wow! This was a great practical analysis of existing implementations, besides a great technical overview of the spec(s). Thanks for open sourcing the analysis code[1], and for the extended results[2]

[1] https://github.com/nst/JSONTestSuite

[2] http://seriot.ch/json/parsing.html

[+] dgreensp|9 years ago|reply

An informative article. The point is not that parsing JSON is "hard" in any sense of the word. It's that it's underspecified, which leads to parsers disagreeing.

Although the syntax of JSON is simple and well-specced:

* The semantics are not fully specified

* There are multiple specs (which is a problem even if they are 99% equivalent)

* Some of the specs are needlessly ambiguous in edge cases

* Some parsers are needlessly lenient or support extensions

[+] kstenerud|9 years ago|reply

I did write my own parser, but for a reason: I need it to be able to recover as much data as possible from a damaged, malformed, or incomplete file.

Turns out that a good chunk of these tests are for somewhat malformed, but not impossible to reason about files. Extra commas, unescaped characters, leading zeroes... I'd rather just accept those kinds of things rather than throw an error in the user's face. It's a big bad world out there, and data is by definition corrupt.

And this is borne out when I plug my parser into this test suite: Many, many yellow results, which is exactly how I want it.

[+] paulddraper|9 years ago|reply

Lots of issues are trivially answered.

---

> Scalars..In practice, many popular parsers do still implement RFC 4627 and won't parse lonely values.

Right. RFC 7159 expanded the definition of a JSON text.

> A JSON text is a serialized value. Note that certain previous specifications of JSON constrained a JSON text to be an object or an array.

If RFC 7159 wasn't different from 4627, there'd be no reason for 7159. Same with RFC 1945 and 7230 for HTTP. (Of course, HTTP is versioned...maybe he just means to repeat the earlier versioning criticism.)

---

> it is unclear to me whether parsers are allowed to raise errors when they meet extreme values such 1e9999 or 0.0000000000000000000000000000001

And then quotes the relevant part of the RFC 7159 grammar with answers the question:

> This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

Parsers may limit this however they like. And so may serializers. This includes yielding errors. (Though approximating the nearest possible 64-bit double is IMO the better choice.)

---

So yeah, in the end there is fair amount of flexibility in standard JSON.

To summarize:

> An implementation may set limits on the size of texts that it accepts.

> An implementation may set limits on the maximum depth of nesting. [this one was never mentioned though]

> An implementation may set limits on the range and precision of numbers.

> An implementation may set limits on the length and character contents of strings.

Most implementations on 32-bit platforms will not parse 5GB JSON texts.

[+] peatmoss|9 years ago|reply

What ever happened with EDN (pronounced "eden") from the Clojure people? https://clojure.github.io/clojure/clojure.edn-api.html https://github.com/edn-format/edn

I always thought that seemed like a nice alternative data format to JSON. Anyone using this it in the wild?

[+] grayrest|9 years ago|reply

It got supplanted by Transit [0] as an interchange format. Both are only really used within the Clojure community. I use Transit for internal-facing APIs.

[0] https://github.com/cognitect/transit-format

[+] arohner|9 years ago|reply

Clojure programmers use it everywhere. I suspect almost nobody else does though.

[+] Confusion|9 years ago|reply

There was a great article at some point that explained why 'be liberal in what you accept' is a very bad engineering practice in certain circumstances, such as setting a standard, because it causes users to be confused and annoyed when a value accepted by system A is subsequently not accepted by supposedly compatible system B. Leading to pointless discussions about what the spec 'intended' and subtle incompatibility. Anyone know what article I mean?

[+] _wiv7|9 years ago|reply

"NaN and Infinity"

Yeah. And I learned this the hard way with the Perl module JSON::XS. It successfully encodes a Perl NaN, but its decoder will choke on that JSON. (Reported it to the maintainer who insists that is consistent with the documentation and wouldn't fix it)

[+] realkitkat|9 years ago|reply

If JSON is comparable to minefield, then I guess XML and ASN.1 are nothing short of nuclear Armageddon in complexity and ones ability to shoot themselves into the leg ;-)

[+] kowdermeister|9 years ago|reply

I still love JSON regardless :) Client / server side languages have first class support for serialization and in most cases the data structures are rather easy.

I'd be very skeptical if one would suggest an alternative format for a web based project, however I can imagine such situations.

[+] indexerror|9 years ago|reply

> In conclusion, JSON is not a data format you can rely on blindly.

What does HN suggest for configuration files (to be written by a human essentially)?

I am looking at YAML and TOML. My experience with JSON based config files was horrible.

[+] robert_tweed|9 years ago|reply

Funnily enough, as I've been experimenting with Chef and trying to stick to JSON config files where allowed, I was again struck that (a) it's not a good choice for config files (b) it's an OK choice though (c) lots of people are using it anyway (d) nearly everyone that does so (including Chef) allows comments, so in reality are not actually using JSON at all.

Point (d) is the important one. I really think we need a standard for json-with-comments. JSONC or whatever, but it should have a different standard filename and it should have an RFC dictating what is and isn't allowed. Personally I would allow only // comments because there are too many subtle issues with C-style comments, but it may be too late to agree on that.

Half the point of JSON is that if application A stores its data as JSON then application B can parse that without any nasty surprises. Except, there are now probably thousands of noncompliant implementations in the wild that only exist because the standard doesn't allow comments. Each one of those standards adds subtle differences (in addition to the comments themselves) depending largely on how they remove the comments before passing to the standards-compliant JSON parser (assuming they do that, which being DC's recommended approach, is as close to a standard as currently exists).

[+] petre|9 years ago|reply

YAML is equally horrible and the spec is an order of magnitude more complex. I wasted half an hour trying to spot an error in the ejabberd yaml config, only to find out something trivial was missing. At least JSON has braces even though it's not suitable for configuration files. By all means choose TOML or something else (even ini or java properties files) instead.

[+] steveklabnik|9 years ago|reply

I've used JSON and YAML for a very long time, but whenever I have the option, I'll be using TOML every time.

YAML is _huge_ (did you know all JSON is valid YAML?) and those features can come back to bite you... http://blog.codeclimate.com/blog/2013/01/10/rails-remote-cod...

JSON has a number of annoyances, mostly no comments, no trailing commas, all the stuff this article gets into.

Formats like INI or CSV don't really have a spec, or if they do, most implementations don't seem to follow them.

TOML is a bit weird at first, but it's grown on me quite a bit.

[+] marios|9 years ago|reply

I don't have a specific recommendation, but when I see a project uses a JSON file as configuration, I wonder: "hasn't the author ever needed to include a comment in the configuration ?".

[+] kozhevnikov|9 years ago|reply

https://github.com/typesafehub/config/blob/master/HOCON.md

[+] isidor3|9 years ago|reply

Lua was written for exactly this purpose, and I personally enjoy writing with it, so it would be my first choice in most cases.

[+] tantalor|9 years ago|reply

Protocol Buffers text format: https://developers.google.com/protocol-buffers/docs/overview...

[+] FreeFull|9 years ago|reply

In my personal experience TOML works really well. It's a little reminiscent of .ini files, but definitely is better.

[+] Ketak|9 years ago|reply

I have actually used Lua before with good success. It was on a smaller scale, so I can't speak to edge cases, but I would certainly recommend considering it at the least.

[+] sirn|9 years ago|reply

If you decide to use YAML, make sure to check out Strict YAML[1] and its FAQ[2].

[1]: https://github.com/crdoconnor/strictyaml

[2]: https://github.com/crdoconnor/strictyaml/blob/master/FAQ.rst...

[+] mamcx|9 years ago|reply

INI files.

Of course, some lunatics try to embed big amounts of text and that is where INI files not look ok.

[+] empath75|9 years ago|reply

i actually like hcl: https://www.terraform.io/docs/configuration/syntax.html

[+] lucb1e|9 years ago|reply

> What does HN suggest for configuration files (to be written by a human essentially)?

JSON.

YAML confuses many people by being whitespace-sensitive; ini files I find too limited.

[+] jzwinck|9 years ago|reply

YAML or TSV depending on whether your configuration looks like a rectangular table.

If you want extreme flexibility using C++ as the main language, take a look at my project: https://github.com/jzwinck/pccl

It lets you configure your C++ apps using Python. Config items can even be Python functions.

[+] slasaus|9 years ago|reply

After trying json, yaml, json5, java properties, ini and toml, I finally choose hjson* as the configuration file format for the software I'm building. It's the easiest format to read and write IMHO, a bit like nginx config files.

* http://hjson.org

[+] uiri|9 years ago|reply

I would recommend TOML, but I am a bit biased as the author of the toml python package.

[+] metafunctor|9 years ago|reply

The page has been taken down for some reason (getting a 403).

Google cache: http://webcache.googleusercontent.com/search?q=cache:8jVuBmx...

[+] dep_b|9 years ago|reply

Now the mess that is called JavaScript dates has crept into any system imaginable in the world. I can understand we needed to go for the lowest denominator but Crockford's card really could cram in another line with a date time string format.

[+] eridius|9 years ago|reply

Speaking as someone who wrote a JSON parser, this article and the accompanying test suite looks to be very valuable, and I will be adding this test suite to my parser's tests shortly.

That said, since my parser is a pure-Swift parser, I'm kind of bummed that the author didn't include it already, but instead chose to include an apparently buggy parser by Big Nerd Ranch instead. My parser is https://github.com/postmates/PMJSON

[+] jayd16|9 years ago|reply

tl;dr JSON with a bunch of shitty extensions is awful. The error handling among JSON parsers is inconsistent.

[+] ohstopitu|9 years ago|reply

When I didn't know better, I wrote my own JSON parser for Java (it was years back and I didn't know about java libraries). From experience: DON'T. DO. IT.

That said, if you have decided to do it....

1) know fully well that it'll fail and build it with that assumption.

2) Please, please, please...give useful error messages when it does fail or you'd be spending way too much time over something simple.

[+] SFJulie|9 years ago|reply

By sheer randomness I was having the thought about it today: I made some code to highlight where the stdlib json module sees the mistakes in JSON decoding in the python stdlib.

I used the exception with string "blabal at line x, col y, char(c - d)" to actually highlight (ANSI colors) WHERE the mistake were.

https://gist.github.com/jul/406da833d99e545085dac2f368a3b850...

I played a tad with it, and the highlighted area for missing separators, unfinished string, lack of identifier were making no sense. I thought I was having a bug. Checked and re-checked. But, No.

I made this tool because, whatever the linters are I was always wondering why I was not able to edit or validate json (especially one generated by tools coded by idiots) easily.

I thought I was stupid thinking json were complex.

Thanks.

292 comments