Why numbering should start at zero (1982)

[+] TOGoS|10 years ago|reply

You use zero and one-based counting for different things. Mixing them up is where fencepost errors come from.

Say you're counting the length of a fence, so you are numbering the posts. Of course it's preferable to number the posts starting at zero, because after having numbered N posts, you know the length of the fence is N.

On the other hand, if you are counting the posts themselves, you would want to say "one!" after the first post, as you have just counted one post. In this case, the number you assign the post is the number of posts counted after encountering it. Otherwise you end up having to +1 at the end.

Which is all to say: I've always felt that the disagreement between whether to start at one or zero is due to people not making clear what it is they're counting.

If you're numbering the points between spans (as you tend to be when giving names to dates, memory locations, and fence posts), always start at zero. If you're giving numbers to the spans themselves, it's often convenient to use a physically-based number such as the location of the center of the span (when writing graphics applications I will often refer to the pixel in the upper-left corner as 0.5,0.5, rounding down to get its address).

[+] Stratoscope|10 years ago|reply

Those are great insights, thanks for laying it out like that.

This is also why rulers start with 0 (even if you can't see the 0) - because a ruler isn't about counting the marks, it's about measuring the space between the marks.

There's an excellent discussion and illustration of this in the original Inside Macintosh - a few us us were kicking it around a couple of years ago here:

https://news.ycombinator.com/item?id=6601515

[+] qnaal|10 years ago|reply

I would argue for one-based counting [a] and zero-based indexing.

Illustrated example of an array- indexes along the top (where the data can be found), and counting along the bottom (how many values have been read):

  0   1   2   3
  | A | B | C | D |
      ^   ^   ^   ^
      "1" "2" "3" "4"

[a] https://en.wikipedia.org/wiki/Ordinal_number_%28linguistics%...

[+] kmill|10 years ago|reply

You made me realize that we use 0-indexing because we confused the torsor[1] for the group. What I mean is that we may take differences between labels (numbers on the posts) to see how many posts to the left or the right a post is relative to another, and then we do this all relative to the first post. It is convenient in, say, C to do 0-based indexing because then you can add the index (the difference) to the pointer to the first element to move over that many posts, but that is different from the concept of which ordinal the post is labeled with.

[1] Quickly stated, a torsor over a group is a collection of things which support taking differences, and the differences are elements of the group. One example is pointers vs integers, since the difference between two pointers (to things in the same array in C) is an integer, and the sum of a pointer and an integer is another pointer. Another example is energy vs work, or Euclidean space vs vectors.

[+] johnlbevan2|10 years ago|reply

Regarding the initial argument, I'd opt for C or D; since here we're being consistent with our signs and therefore the statement's more readable. To me this is more apparent than the number of elements being the upper bound minus the lower, or natural numbers shrinking to unnatural (which I confess I don't understand given here i is by definition an integer and thus natural). Of these two I'd likely opt for C since this lends itself to natural language (e.g. if I ask someone to choose a number between 1 and 10 I mean the choice is inclusive of 1 and 10), so again is likely more readable.

Also "pernicious: having a harmful effect, especially in a gradual or subtle way". I don't understand this statement either. A language not mentioned, PowerShell, uses two dots to great effect (ellipses vs two dots is another debate; but I can't see how one would be considered harmful and the other not). `1..10 | %{"we're at number {0}" -f $_}`; simple, clear and obvious.

The whole article reads as if someone's trying to show off their intellect, rather than present an honest analysis of the benefits of the merits of the different indexing/numbering systems. @TOGoS's comment on this thread, on the other hand gives a compelling and well reasoned argument of when to use each system.

[+] mayoff|10 years ago|reply

I've always been charmed by EWD's handwriting: http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD831.PDF

[+] jonahx|10 years ago|reply

That literally looks like it was typed in a hand drawn font -- remarkable.

[+] fdej|10 years ago|reply

One big advantage of zero-based indexing is that it's consistent with division with remainder. This is especially nice when working with multidimensional arrays.

[+] TheLoneWolfling|10 years ago|reply

I'd argue that in the presence of one-based indexing, remainders being defined as 1..n, inclusive, makes more sense anyways.

[+] domador|10 years ago|reply

I've long been a zero-based-index-loving geek, but I'm now coming around to preferring one-based indices in programming. Hopefully in the near future I'll write a blog post detailing why.

Regarding the article itself... sorry, Dijkstra, but I think conventions C and D are more beautiful, as I prefer discrete ranges where the signs on both ends are the same. Convention C is my favorite, as it only makes mention of members of the range, not numbers that lie outside of it.

[+] eblume|10 years ago|reply

I have an example of a similar journey I followed.

The Lua language has a 1-based indexing scheme. Furthermore, the only thing that differentiates a hash-map from an ordered array is the existence of only keys in the range [1, ∞) (and they must be entirely contiguous starting from 1 until the last present key).

The interface for arrays and hash-map data types is uniform, called a 'table', which has your standard `mapper[key] = value` syntax (and also, equivalently, `mapper.key = value` for well-formatted string-literal keys).

This results in the very surprising and confounding fact that if a naive programmer doesn't realize that Lua has this system, they will very likely end up starting from 0 and build an unordered hash map rather than an ordered array - complete with all the resulting performance penalties and unordered traversals.

For a long time this struck me as being absolutely absurd and very damning. After working with Lua for a while, though, there began to be a certain elegance to it that I could admire. There is a SO post that goes in to some of the benefits[0][1].

I still much prefer 0-based indexing though.

0: http://stackoverflow.com/questions/2785704/why-do-lua-arrays...

1: Irony intended.

[+] learnstats2|10 years ago|reply

Reading Dijkstra's own argument convinced me that 1-indexing is better. Convention C (consistently non-strict) is the one that we normally use to describe ranges in English.

[+] aaronchall|10 years ago|reply

"When dealing with a sequence of length N, the elements of which we wish to distinguish by subscript, the next vexing question is what subscript value to assign to its starting element. Adhering to convention a) yields, when starting with subscript 1, the subscript range 1 ≤ i < N+1; starting with 0, however, gives the nicer range 0 ≤ i < N. So let us let our ordinals start at zero: an element's ordinal (subscript) equals the number of elements preceding it in the sequence. And the moral of the story is that we had better regard —after all those centuries!— zero as a most natural number."

[+] mjevans|10 years ago|reply

To put this another way; as mentioned by the fence post example another poster above mentioned.

If you label the fence posts with numbers, from 1 to N, the value at a given address along the length of the fence is the distance from the start of the fence.

In this case 0 posts from the start is the value 1. What makes this confusing is the arbitrary choice of the value of the object at location 0. Dijkstra does a reasonable job of explaining why logically beginning numbering at 0 makes sense for the index/address, but does not reasonably explain that the fault in the logic of those whom would count an item is failing to use that logically derived index/address as the identity of the object.

That is, the numeric id of the object at address 0 should also be 0, not 1, as many assume it should be based on the conventions that were instilled in basic education.

[+] jimhefferon|10 years ago|reply

When he's right, he's right.

Hard to convince my students of it, though (they are Americans; is it different elsewhere?).

[+] stared|10 years ago|reply

..and why shouldn't:

if (someStr.indexOf("John") !== -1) { ... }

vs

if (someStr.indexOf("John")) { ... }

Or when _teaching_ it's kind of funny saying "So, we take the first element, that is the zeroth, then the third (I mean the real third, not the one with index three)...".

OK, but to be entirely honest, I both sides have pros and cons. And as long as a language/program/person is reasonably consistent (vide https://xkcd.com/163/), there is not a big difference.

[+] icebraining|10 years ago|reply

Your example should really be (using a polyfill if necessary)

  if(someStr.includes('John')) {...}

While I don't mind it for containers, I've never liked using Boolean coercion for numbers. Zero is obviously a special case, but I don't think it maps well with the True/False dichotomy.

[+] ufo|10 years ago|reply

Option #2 assumes that 0 is falsy. Its an OK assumption for a language like C but for languages that let lots of different things go into an if condition I prefer for zero to be truthy and just false / null / etc to be falsy.

[+] DawkinsGawd|10 years ago|reply

Can someone explain this sentence to me:

"Consider now the subsequences starting at the smallest natural number: inclusion of the upper bound would then force the latter to be unnatural by the time the sequence has shrunk to the empty one."

The previous statement asserts that inclusion of the lower bound is preferred because the sequence then starts with a natural number (so for a set 1<=x<12 the sequence starts at 1 instead of 1.000000 .... 1 in 1<x<12). How is inclusion of the upper bound forcing the "latter" (what is he referring to by latter) to be unnatural by the time the sequence has shrunk to the empty one.

[+] to3m|10 years ago|reply

"The latter" here refers to the upper bound. By "unnatural" he means negative, as he is discussing the natural numbers (positive integers and zero). By convention these are the numbers used to index arrays. (So, for your examples: for the set 1<=x<12, the sequence is 1,2,3,4,5,6,7,8,9,10,11; for the set 1<x<12, the sequence is 2,3,4,5,6,7,8,9,10,11.)

As for what he's saying: suppose you decide your array indexing starts at the lowest natural number, i.e., 0. And imagine an array with 1 element. And assume you've adopted the notation a<=i<=b. This array's indexes are then 0<=i<=0.

But what if you have an array of 0 elements? What then? You can't say 0<=i<=-1, because -1 is not natural and therefore not a valid array index.

But if you adopt the convention that a<=i<b, then your 1-element array's indexes would be 0<=i<1, and your empty array would be 0<=i<0. (Or, indeed, 1<=i<1, or n<=i<n for any n you choose - the first section deals with how to denote the range, so the choice of n isn't the key thing.)

[+] nfoz|10 years ago|reply

I love one-based indexing. I prefer array[1..n] to array[0..n-1]. The reason is because with one-based indexing, I don't have names for things that are invalid. begin, end: these are valid places to be. Your "end" token can mean "the last element", and you don't have to think about "being behind the thing that represents the thing past your last element".

I acknowledge it's domain-specific, but I find there's something mentally relaxing about one-based indexing. And I grew up on 0-based.

[+] gnufx|10 years ago|reply

The Fortran (not FORTRAN) standard current when Dijkstra wrote that allowed arbitrary lower bounds in array declarations, and an array could be equivalenced to one with different bounds. The language doesn't impose the choice.

[+] vacri|10 years ago|reply

> That is ugly

This is the defining reason? Seems an awful lot like personal preference, to me.

[+] qnaal|10 years ago|reply

ugly in that "it's 'ugly' to have to use a signed int when referring to a range of unsigned ints"

that's pretty ugly

[+] conexions|10 years ago|reply

Think of defining an empty range.

    a) 1 <= x < 1
    c) 1 <= x <= 0

c seems quite a bit more confusing than a.

[+] 0xdeadbeefbabe|10 years ago|reply

Starting with one considered harmful.

[+] Gibbon1|10 years ago|reply

I always had this feeling that with one based indexing, 70s era compilers ended up subtracting one from the index. For each and every element access made. Because memory decoding circuits don't start at one, they 'start' with 0000...000. Course it would be possible to just waste the first element, but 70's era memory... Waste a 16 bit word here, and there and suddenly you're talking real money.

47 comments