The Norway Problem

[+] keeperofdakeys|5 years ago|reply

This is part of more general problem, they had to rename a gene to stop excel auto-completing it into a date.

https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

Edit: Apparently Excel has its own Norway Problem ... https://answers.microsoft.com/en-us/msoffice/forum/msoffice_...

[+] masklinn|5 years ago|reply

> This is part of more general problem

The more general problem basically being sentinel values (which these sorts of inferences can be treated as) in stringly-typed contexts: if everything is a string and you match some of those for special consideration, you will eventually match them in a context where that's wholly incorrect, and break something.

[+] dalbasal|5 years ago|reply

I suppose this is a cliched thought, but the more general problem kind of emblematic of current "smart" features... and their expected successors.

OOH, this is a a typically human problem. We have a system. It's partly designed, partly evolved^. It's true enough to serve well in the contexts we use it in on most days. There are bugs in places (like norway, lol) that we didn't think of initially, and haven't encountered often enough to evolve around.

In code, we call it bugs. In bureaucracy, we just call it bureaucracy. Agency A needs institution B's document X, in a way that has bugs.

Obviously, it's also a typical machine problem. @hitchdev wants to tell pyyaml that Norway exists, and pyyaml doesn't understand. A user wants to enter "MARCH1" as text (or the name of a gene), and excel doesn't understand.

Even the most rigid bureaucracy is made of people and has fairly advanced comprehension ability though. If Agency A, institution B or document X are so rigid that "NO" or "MARCH1" break them... it probably means that there's a machine bug behind the human one.

Meanwhile... a human reading this blog (even if they don't program) can understand just fine from context and assumptions of intent.

IDK... maybe I'm losing my edge, but natural language programming is starting to seem like a possibility to me.

^I feel like we need a new word for these: versioned, maybe?

[+] bilalq|5 years ago|reply

I don't understand why those support agents for Microsoft just threw their hands up in the air and asked customers to go through some special process for reporting the bug in Excel. Why are they not empowered/able to report the issue on behalf of customers? It's so clearly a bug in Excel that even they are able to reproduce with 100% reliability.

[+] andrepd|5 years ago|reply

I'd say the more general problem is a bad type system! In any language with a half decent type system where you can define `type country = Argentina | ... | Zambia` this would be correctly handled at compile-time, instead of having strange dynamic weak typing rules (?) which throw runtime errors in production (???).

[+] zoward|5 years ago|reply

An even more general problem is that we as humans use pattern-matching as a cerebral tool to navigate our environment, and sometimes the patterns aren't what they appear to be. The Norway problem is the programming equivalent of an optical illusion.

[+] WalterBright|5 years ago|reply

Good language design involves deliberately adding redundancy which acts like a parity bit in that errors are more likely to be detected.

[+] mcv|5 years ago|reply

The real problem here is that people use Excel to maintain data. Excel is terrible at that. But the fact that it may change data without the user being aware of it, is absolutely the biggest failing here.

[+] wayoutthere|5 years ago|reply

The one I’ve seen was a client who wanted to store credit card numbers in an Excel sheet (yes I know this is a bad idea, but it was 15 years ago and they were a scummy debt collection call center). Signed integers have a size limit, which a 16 digit credit card number significantly exceeds.

Now, you and I know this problem is solved by prepending ‘ to the number and it will be treated as a string, but your average Excel user has no understanding of types or why they might matter. Many engineers will also look past this when generating Excel reports.

[+] jgalt212|5 years ago|reply

and cusips, which are strings, get converted to scientific notation.

https://social.msdn.microsoft.com/Forums/vstudio/en-US/92e0a...

[+] nullsense|5 years ago|reply

Easiest solution is just to rename Norway.

[+] imtringued|5 years ago|reply

So basically they renamed a gene because they had employees who were too stupid to use excel?

[+] qwertox|5 years ago|reply

Regarding Excel: It also happens with Somalia, which makes this issue even stranger. Apparently because of "SOM".

[+] commandlinefan|5 years ago|reply

There’s a really simple solution to this problem, which has been around since the 70’s: schemas.

[+] afturkrull|5 years ago|reply

> they had to rename a gene to stop excel auto-completing it into a date.

No one in their right mind uses a spreadsheet for data analysis. Good for working out your ideas but not in a production environment. I figure excel was chosen as this the utility the scientists were most familiar with.

The proper tool for the job would be a database. I recall reading about a utility, a highly customized database with an interface that looks just like a spreadsheet.

[+] helsinkiandrew|5 years ago|reply

> they had to rename a gene to stop excel auto-completing

I can just about understand that "No" might cause a problem, but “Membrane Associated Ring-CH-Type Finger 1" being converted to MAR-1 defeats me.

[+] atombender|5 years ago|reply

The world desperately needs a replacement for YAML.

TOML is fine for configuration, but not an adequate solution for representing arbitrary data.

JSON is a fine data exchange format, but is not particularly human-friendly, and is especially poor for editable content: Lacks comments, multi-line strings, is far too strict about unimportant syntax, etc.

Jsonnet (a derivative of Google's internal configuration language) is very good, but has failed to reach widespread adoption.

Cue is a newer Jsonnet-inspired language that ticks a lot of boxes for me (strict, schema support, human-readable, compact), but has not seen wide adoption.

Protobuf has a JSON-like text format that's friendlier, but I don't think it's widely adopted, and as I recall, it inherits a lot of Protobufisms.

Dhall is interesting, but a bit too complex to replace YAML.

Starlark is a neat language, but has the same problem as Dhall. It's essentially a stripped-down Python.

Amazon Ion [1] is neat, but I've not seen any adoption outside of AWS.

NestedText [2] looks promising, but it's just a Python library.

StrictYAML [3] is a nice attempt at cleaning up YAML. But we need a new language with wide adoption across many popular languages, and this is Python only.

Any others?

[1] https://amzn.github.io/ion-docs/

[2] https://nestedtext.org/

[3] https://github.com/crdoconnor/strictyaml/

[+] diggan|5 years ago|reply

Seems you're missing my personal favorite, extensible data notation - EDN (https://github.com/edn-format/edn). Probably I'm a bit biased coming from Clojure as it's widely used there but haven't really found a format that comes close to EDN when it comes to succinctness and features.

Some of the neat features: Custom literals / tagged elements that can have their support added for them on runtime/compile time (dates can be represented, parsed and turned into proper dates in your language). Also being able to namespace data inside of it makes things a bit easier to manage without having to result to nesting or other hacks. Very human friendly, plus machine friendly.

Biggest drawback so far seems to be performance of parsing, although I'm not sure if that's actually about the format itself, or about the small adoption of the format and therefore not many parsers focusing on speed has been written.

[+] rubyn00bie|5 years ago|reply

Your list is like a graveyard of my dreams and hopes. Anything that doesn't validate the format of the underlying data is pretty much dead to me...

The problem with most of these is they're useless to describe the data. Honestly, it is completely not useful to have the following to describe data:

email => string

name => string

dob => string

IMHO, it is akin to having a dictionary (like Oxford English) read like:

email - noun

name - noun

birthday - noun

It says next to nothing except, yes, they are nouns. All too often I waste time fighting nils and bullshit in fields or duplicating validation logic all over the place.

"Oh wow, this field... is a string..? That's great... smiles gently except... THERE SHOULD NOT BE EMOJI IN MY FUCKING UUID, SCHEMA-CHUD. GET THE FUCK OFF MY LAWN!"

[+] djedr|5 years ago|reply

Still early, but here's my baby I hope can improve things:

website with grammar spec: https://tree-annotation.org/

prototype of a JSON/YAML alternative for JS: https://github.com/tree-annotation/tao-data-js

same thing, even less finished for C#: https://github.com/tree-annotation/tao-data-csharp

working on it constantly, more to come soon

[+] fmakunbound|5 years ago|reply

XML and XML Schema solved this more than 20 years ago. It had to be replaced with JSON by the web developers though, so they could just “eval() it” to get their data.

[+] dragonwriter|5 years ago|reply

> The world desperately needs a replacement for YAML.

The world desperately needs support for YAML 1.2, which solves the problems the article addresses fairly completely (largely in the “default” Core schema[0], but more completely with the support for schemas in general), plus a bunch of others, and has for more than a decade. But YAML 1.2 libraries aren’t available for most languages.

[0] not actually an official default, but reflects a cleanup of the YAML 1.1 behavior without optional types, so its defaultish. Back when it looked like YAML 1.3 might happen in some reasonably-near future, it was actually indicated by team members that the JSON Schema for YAML (not to be confused with the JSON Schema spec) would be the explicit default YAML Schema in 1.3, which has a lot to recommend it.

[+] svnpenn|5 years ago|reply

You seem pretty quick to disregard TOML. I switched all my JSON and YAML for TOML. Do you care to detail what is missing?

[+] azernik|5 years ago|reply

YAML had a worse example, once.

For the ease of entering time units YAML 1.1 parsed any set of two digits, separated by colons, as a number in sexagesimal (base 60). So 1:11:00 would parse to the integer 4260, as in 1 hour and 11 minutes equals 4260 seconds.

Now try plugging MAC addresses into that parser.

The most annoying part is that the MAC addresses would only be mis-parsed if there were no hex digits in the string. Like the bug in this post, it could only be reproduced with specific values.

Generally, if you're doing implicit typing, you need to keep the number of cases as low as possible, and preferably error out in case of ambiguity.

[+] adwn|5 years ago|reply

> For the ease of entering time units YAML 1.1 parsed any set of two digits, separated by colons, as a number in sexagesimal (base 60).

This is a mind-boggling level of idiocy. Even leaving aside the MAC address problem, this conversion treats "11:15" (= 675) different from "11:15:00" (= 40500), even though those denote the same time, while treating "00:15:00" (15 minutes past midnight) and "15:00" (3 in the afternoon) the same.

[+] dragonwriter|5 years ago|reply

> YAML had a worse example, once.

It had it literally at the same time as it had the problem in the article (the article refers to YAML 2.O, a nonexistent spec, and to PyYAML, a real parser which supports only YAML 1.1.)

Both the unquoted-YES/NO-as-boolean and sexagesimal literals were removed in YAML 1.2. (As was the 0-prefixed-number-as-octal mentioned in a sibling comment.)

[+] whytookay|5 years ago|reply

One that really surprised/confused me was that pyaml (and the yaml spec) attempts to interpret any 0-prefixed string into an octal number.

There was a list of AWS Account IDs that parsed just fine until someone added one that started with a 0 and had no numbers greater than 7 in it, after which our parser started spitting out decidedly different values than we were expecting. Fixing it was easy, but figuring out what in the heck was going on took some digging.

[+] lainga|5 years ago|reply

We had a Grafana dashboard where one of the columns was a short Git hash. One day, a commit got the hash `89e2520`, which Grafana's frontend helpfully decided to display as "+infinity". Presumably it was parsing 89E+2520.

[+] m463|5 years ago|reply

slightly related, on my microwave 99 > 100, even 61 > 100

[+] kstenerud|5 years ago|reply

The worst tragedy of this is the security implications of subtly different parsers. As your application surface increases, you're likely to mix languages (and thus different parsers), which means that the same input data will produce different output data depending on whether your parser replaces, truncates, ignores, or otherwise attempts to automatically "fix up" the data. A carefully crafted document could exploit this to trick your data storage layer into storing truncated data that elevates privileges or sets zero cost, while your access control layer that ignores or replaces the data is perfectly happy to let the bad document pass by.

And here's something else to keep you up at night: Just think of how many unintentional land mines lurk in your serialized data, waiting to blow up spectacularly (or even worse, silently) as soon as you attempt to change implementation technologies!

This is why I've been so anal about consistent decoder behavior in Concise Encoding https://github.com/kstenerud/concise-encoding/blob/master/ce...

https://concise-encoding.org/

[+] yellowapple|5 years ago|reply

This is exactly why configuration/serialization formats should make as few assumptions about value types as possible. Once parsing's done, everything should be a string (or possibly a symbol/atom, if the program ingesting such a file supports those), and it should be up to the application to convert values to the types it expects. This is Tcl's approach, and it's about as sensible as it gets.

...which is why it pains me to admit that in my own project for a Tcl-like scripting/config language[1] I missed the float v. string issue, so it'll currently "cleverly" return different types for 1.2 (float) v. 1.2.3 (atom). Coincidentally, I started work on a "stringy" alternative interpreter that hews closer to Tcl's philosophy (to fix a separate issue - namely, to avoid dynamically generating atoms, and therefore avoid crashing the Erlang VM when given potentially-adversarial input), so I'm gonna fix that case for at least the "stringy" mode (by emitting strings instead of numbers, too), knocking out two birds with one stone for the upcoming 0.3.0 release :)

----

[1]: https://otpcl.github.io, for those curious

[+] WalterBright|5 years ago|reply

> The most tragic aspect of this bug, howevere, is that it is intended behavior according to the YAML 2.0 specification.

This is one of those great ideas that sadly one needs experience to realize are really bad ideas. Every new generation of programmers has to relearn it.

Other bad ideas that resurface constantly:

1. implicit declaration of variables

2. don't really need a ; as a statement terminator

3. assert should not abort because one can recover from assert failures

[+] mcv|5 years ago|reply

If it ignores part of the spec, I don't think "strictyaml" is the correct name here. Instead, if it interprets everything as string, perhaps "stringyaml" would have been more accurate, though I'm sure that's not as good PR.

I'm reminded of the discussion we had a few days ago about environment variables; one problem there is that env variables are always strings, and sometimes you do want different types in your config. But clearly having the system automatically interpret whether it's a string or something else is a major source of bugs. Maybe having an explicit definition of which field should be which type would help, but then you end up with the heavy-handed XML with its XSD schema.

Or you just use JSON, which is light-weight, easy to read, but unambiguous about its types. I guess there's a good reason it's so popular.

Maybe other systems like yaml and environment variables should only ever be used for strings, and not for anything else, and I suppose replacing regular yaml with 'strictyaml' could play a role there. Or cause unending confusion, because it does violate the spec.

[+] grenoire|5 years ago|reply

I was helping out a friend of mine in the risk department of a Big 4; he was parsing CSV data from a client's portfolio. Once he started parsing it, he was getting random NaNs (pandas' nan type, to be more accurate).

I couldn't get access to the original dataset but the column gave it away. Namibia's 2-letter ISO country code is NA—which happens to be in pandas' default list of NaN equivalent strings.

It was a headache and a half...

[+] sdfhbdf|5 years ago|reply

What I am most baffled by with Yaml is the fact that it’s a superset of JSON.

Whenever an input accepts YAML you can actually pass in JSON there and it’ll be valid

It really surprised me when I found out and I use JSON Whenever possible since then since it’s much stricter

https://en.m.wikipedia.org/wiki/JSON#YAML

[+] yakshaving_jgt|5 years ago|reply

> it’s equally true that extremely strict type systems require a lot more upfront and the law of diminishing returns applies to type strictness - a cogent answer to the question “why is so little software written in haskell?“

I was with the article up until that point. I don't agree that diminishing returns with regards to type strictness applies linearly. Term-level Haskell is not massively harder than writing most equivalent code in JavaScript — in fact I'd say it's easier and you reap greater benefit. Perhaps it's a different story when you go all-in on type-level programming, but I'm not sure that's what the author was getting at. This smells of the Middle Ground logical fallacy to me. Or of course the comment was tongue-in-cheek and I'm overreacting.

[+] 7952|5 years ago|reply

I had to rewrite some JavaScript code in Postgres recently that measured the overlap between different elevation ranges. In JS I had to write it myself and deal with the edge cases and bugs. In Postgres I just use the range type and some operators. It was brilliant in comparison. The tiny effort of learning it was worth it. The list of data types I use all the time is bigger than just string, numbers and booleans. Serialisation formats should support them. Particularly as there are often text format standards that already exist for a lot of them. Give me wkt geometry and iso formatted dates. It's not that difficult and totally with it.

[+] choeger|5 years ago|reply

That law of diminishing returns might actually apply, I am not 100% sure. But more powerful type systems allow for the more complex composition of more complex interfaces in a safe manner. Think of higher-level modules and data structures. Or dependent types and input handling. Or linear types and resource handling.

[+] samvher|5 years ago|reply

I agree. I would say that Erlang goes ~80% of the way compared to Haskell's type system and the last 20% really matter, to the point that in many cases I find myself not really using Erlang's (optional) type system at all. Better type coverage and more descriptive types allow the compiler to infer more and I'd say this is the opposite of diminishing returns.

[+] abujazar|5 years ago|reply

Norwegian here. I’d say the problem is YAML, not Norway :D

[+] jasode|5 years ago|reply

That author's blog post sent me down a rabbit hole of insanity with YAML and the PyYAML parser idiosyncrasies.

First, he mentions "YAML 2.0" but there's no such reference about "2.0" from yaml.org or Google/Bing searches. Yaml.org and wikipedia says yaml is at 1.2. Apparently the other commenters in this thread clarified that the older "YAML 1.1" is what the author is referring to.

Ok, if we look at the official YAML 1.1 spec[1], it has this excerpt for implicit bool conversions:

   y|Y|yes|Yes|YES|n|N|no|No|NO
  |true|True|TRUE|false|False|FALSE
  |on|On|ON|off|Off|OFF

But the pyyaml code excerpts[2][3] from resolver.py has this:

  u'tag:yaml.org,2002:bool',
  re.compile(ur'''^(?:yes|Yes|YES|n|N|no|No|NO
              |true|True|TRUE|false|False|FALSE
              |on|On|ON|off|Off|OFF)$''', re.X),

The programmer omitted the single character options of 'y' and 'Y' but it still has 'n' and 'N' ?!? The lack of symmetry makes the parser inconsistent.

And btw for trivia... PyYAML also converts strings with leading zeros to numbers like MS Excel: https://stackoverflow.com/questions/54820256/how-to-read-loa...

[1] https://yaml.org/type/bool.html

[2] 2020 latest: https://github.com/yaml/pyyaml/blob/ee37f4653c08fc07aecff69c...

[3] 2006 original : https://github.com/yaml/pyyaml/blob/4c570faa8bc4608609f0e531...

[+] ancarda|5 years ago|reply

You can catch this with yamllint (https://github.com/adrienverge/yamllint):

    % cat countries.yml 
    ---
    countries:
      - US
      - GB
      - NO
      - FR

    % yamllint countries.yml 
    countries.yml
      5:4       warning  truthy value should be one of [false, true]  (truthy)

[+] RcouF1uZ4gsC|5 years ago|reply

YAML seems like a really neat idea, but over time, I have I have come to regard it as being too complicated for me to use for configuration.

My personal favorite is TOML, but I would even prefer plain JSON over YAML

The last thing I want at 2 AM when trying to look figure out if an outage is due to a configuration change is having to think if each line of my configuration is doing the thing I want.

YAML prizes making data look nicely formatted over simplicity or precision. That for me, is not a tradeoff, I am willing to make.

[+] ravanave|5 years ago|reply

Btw, the reason Haskell isn’t used more isn’t type system per se, as all types can be inferred at the compilation time. People would sometimes use this feature even to see if GHCi guesses the type correctly (by correctly I mean exactly how the user wants, technically it’s correct always) first time and save them some time writing it either with an extension or just copy&paste from the interpreter window.

When it gets hairy is that most programming languages have low entrance barrier. To write Haskell effectively you’ve got to unlearn a lot of rooted bad habits and you get to dive into the “mathematical” aspect of the language. Not only you got monads, but there’s plethora of other types you need to get comfortably onboard with and the whole branch of mathematics talking about types (you don’t need to even know that such a field as category theory exists to use it).

However, since most people just want to write X, or just want hire a dev team at price they can afford, Haskell rarely is the first choice language.

[+] lolinder|5 years ago|reply

This comment was buried in a thread, but I'm bringing it out because it's very relevant to the conversation:

https://news.ycombinator.com/item?id=26679728

> the article refers to YAML 2.O, a nonexistent spec, and to PyYAML, a real parser which supports only YAML 1.1.

> Both the unquoted-YES/NO-as-boolean and sexagesimal literals were removed in YAML 1.2.

[+] paxys|5 years ago|reply

I will never understand why YAML didn't just require quoted strings. Did the creator not anticipate how many problems the ambiguity would cause?

[+] blunte|5 years ago|reply

If you want no misunderstandings, be explicit. This applies to YAML and life in general. There's an annoying but fairly accurate saying about assumptions that applies.

If you want something to be a specific type, you better have an explicit way of indicating that. If you say quotes will always indicate a string, great. Of course we know it's not that simple, since there are character sets to consider.

The safest answer is to do something like XML with DTDs. But that imposes a LOT of overhead. Naturally we hate that, so we make some "convention over configuration" choices. But eventually, we hit a point where the invisible magic bites us.

This is one case where tests would catch the problem, if those tests are thorough enough - explicitly testing every possibility or better yet, generative testing.

[+] dpratt71|5 years ago|reply

I don't understand why Haskell gets brought up in the middle of an otherwise interesting and useful article. This sort of thing cannot happen in Haskell. And while Haskell is not universally admired, I can't recall seeing Haskell's flavor of type inference being a reason why someone claimed to dislike Haskell.

[+] Waterluvian|5 years ago|reply

I have never gotten far into a project and thought, "my config files are too verbose. I wish there were clever shorthands."

Does Yaml have any sort of strict mode?

I imagine I could find a linter that disallows implicit strings.

[+] exyi|5 years ago|reply

Not YAML by itself, but there are libraries that parse a YAML-like format that is typed. For example this one: https://hitchdev.com/strictyaml/. Technically, it is not compatible with the YAML spec.

325 comments