The format of strings in early (pre-C) Unix

[+] Thomas_Lord|10 years ago|reply

Slightly off topic. The article doesn't call it out but there's a lovely assembly hack here. In:

bec 1f / branch if no error

   jsr r5,error / error in file name

       <Input not found\n\0>; .even

   sys exit

jsr calls a subroutine passing the return address in register 5. The routine error interprets the return address as a pointer to the string.

r5 is incremented in a loop, outputing one character at a time. When the null is found, it's time to return.

The instructions used to return from "error:" aren't shown but there is a subtlety here, I think.

".even" after the string constant assures that the next instruction, "sys exit", to which "error:" is supposed to return, is aligned on an even address.

By implication, the return sequence in "error:" just be sure to increment r5, if r5 is odd. I am guessing something like the pseudo-code:

inc r5

and r5, fffe

ret r5

[+] ksherlock|10 years ago|reply

Yep!

http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/sh.s

    error:
        ...
	inc	r5 / inc r5 to point to return
	bic	$1,r5 / make it even

[+] kazinator|10 years ago|reply

After skimming through this, I navigated around this Chris Siebelmann's site with the forward and back links, discovering something way more interesting than Unix strings and refreshingly relevant:

"How I do per-address blocklists with Exim"

https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EximPerUse...

I run Exim, and I'm also a huge believer in blocking spam at the SMTP level, and also do some things globally that should perhaps be per-user. I'm eagerly interested in everything this fellow has to say.

[+] username223|10 years ago|reply

I've followed his site for awhile, and he seems like a thoughtful person who likes to document his reasoning. I'm not a sysadmin, so I don't care about a lot of what he writes, but he's worth reading for stuff I care about. For example, he's pretty astute about the "how" and "why" of spam.

[+] coupdejarnac|10 years ago|reply

I was hoping to read something juicy like null termination was created by a summer intern.

[+] noselasd|10 years ago|reply

This is the origination of null terminated strings in C:

"In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled ‘*e’. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

From https://www.bell-labs.com/usr/dmr/www/chist.pdf

[+] tomrod|10 years ago|reply

Nope! Only time-wasting and mind-engaging software like solitaire.

[+] holmak|10 years ago|reply

I have seen it claimed that null-terminated strings were encouraged by the instruction sets of the time -- that some instruction sets make null-terminated sequences easier to handle than length-prefixed ones. The article's error-message-printing code snippet is a good example. Does anyone think there is any truth to this?

[+] toast0|10 years ago|reply

Null terminated is going to be nice in most instruction sets, you don't need to keep track of a count, so you save a register, and you have one less thing to increment (or decrement). Loop condition is basically free too, loading the next byte into a register is going to set the status register, so you don't need a compare, you can just branch if the zero flag is set.

As long as you are handling good data, it's clearly more efficient.

[+] jamesbowman|10 years ago|reply

Yes, the PDP-11 instruction set makes null-terminated strings especially efficient. Strcpy() is 3 instructions per character, for example.

The PDP-11 features that make this possible are (1) post increment addressing and (2) MOVE instructions set the condition codes.

[+] HillRat|10 years ago|reply

The ASCIZ directive on the PDP-11 assembler took a string literal and blitted it to memory with a trailing NULL, so there was a convenience factor there. BCPL and B used null-terminated strings, so it made sense for C to inherit the practice.

Pascal strings were significantly size-limited due to memory constraints (an implementation detail), so there was a reason to prefer null termination, but in essence we're still slaves to an obsolete instruction set.

[+] SCHiM|10 years ago|reply

Yes, the architecture on which you're most likely reading this (x86 or x64) also has semi-dedicated string handling instructions, akin to higher level string functions.

For example, a string copy 'instruction' that copies a 0 terminated string from one location to another looks like this:

  lea esi, source_string
  lea edi, dest_string
  repnz movsb           ; copies esi -> edi until *esi = 0

[+] coliveira|10 years ago|reply

Maybe that's not the whole picture, but I think a big reason for null terminated strings is that C has pointers. A char pointer makes it practical to move through strings and a NULL terminator is easier to handle than a size test. In the end, sized strings make C code more complicated, unlike other languages such as Pascal.

[+] derefr|10 years ago|reply

I always felt like NUL-termination, newline-separation, and (eventually) UTF-8 were all sort of complementary ideas: they all take as an axiom that strings are fundamentally streams, not random-access buffers; and they all separate the space of single-byte coding units, by simple one-bitwise-instruction-differentiable means, into multiple lexical token types.

Taking all three together, you end up with the conception of a "string" as a serialized bitstring encoding a sequence of four lexical types: a NUL type (like the EOF "character" in a STL stream), an ASCII control-code type (or a set of individual control codes as types, if you like), a set of UTF-8 "beginning of rune" types for each possible rune length, a "byte continuing rune" type, and an ASCII-printable type. (You then feed this stream into another lexer to put the rune-segment-tokens together into rune-tokens.)

In the end, it's not a surprise that all of these components were effectively from a single coherent design, thought up by Ken Thompson. It's a bit annoying that each part ended up introduced as part of a separate project, though: NULs with Unix, gets() with C, and runes with Plan 9.

One of the pleasant things about Go's string support, I think, it that was an opportunity for Ken to express the entirety of his string semantics as a single ADT type. That part of the compiler is quite lovely.

[+] emmelaich|10 years ago|reply

How else would you implement them, seriously.

You have two choices, counted or terminated.

Counted places a complexity burden at the lowest level of coding.

With terminated you still have the option of implementing strings with structs or arrays with counts or anything.

And people did of course. Many many different implementations of safe strings exist in C; the fact that none have won out vindicates the decision to use sentinel termination.

[+] bitwize|10 years ago|reply

One of the worst programming ideas ever dates bavk even earlier than we thought.

If only Dennis had had the foresight to nip that one in the bud...

[+] marvy|10 years ago|reply

what's your better idea? (hint: this has to work in assembly language.)

[+] oso2k|10 years ago|reply

ken had written most of the original asm kernel.

[+] castell|10 years ago|reply

The predecessor of Unix, Multics was written in PL/1 and was very innovative (modern OS still borrow "new ideas"): https://en.wikipedia.org/wiki/Multics

[+] unknown|10 years ago|reply

[deleted]

[+] jamesfmilne|10 years ago|reply

That was anti-climatic.

64 comments