Slightly off topic. The article doesn't call it out but there's a lovely assembly hack here. In:
bec 1f / branch if no error
jsr r5,error / error in file name
<Input not found\n\0>; .even
sys exit
jsr calls a subroutine passing the return address in register 5. The routine error interprets the return address as a pointer to the string.
r5 is incremented in a loop, outputing one character at a time. When the null is found, it's time to return.
The instructions used to return from "error:" aren't shown but there is a subtlety here, I think.
".even" after the string constant assures that the next instruction, "sys exit", to which "error:" is supposed to return, is aligned on an even address.
By implication, the return sequence in "error:" just be sure to increment r5, if r5 is odd. I am guessing something like the pseudo-code:
After skimming through this, I navigated around this Chris Siebelmann's site with the forward and back links, discovering something way more interesting than Unix strings and refreshingly relevant:
I run Exim, and I'm also a huge believer in blocking spam at the SMTP level, and also do some things globally that should perhaps be per-user. I'm eagerly interested in everything this fellow has to say.
I've followed his site for awhile, and he seems like a thoughtful person who likes to document his reasoning. I'm not a sysadmin, so I don't care about a lot of what he writes, but he's worth reading for stuff I care about. For example, he's pretty astute about the "how" and "why" of spam.
This is the origination of null terminated strings in C:
"In BCPL, the first packed byte contains the number of characters in the
string; in B, there is no count and strings are terminated by a special character, which B spelled
‘*e’. This change was made partially to avoid the limitation on the length of a string caused by
holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our
experience, less convenient than using a terminator.
I have seen it claimed that null-terminated strings were encouraged by the instruction sets of the time -- that some instruction sets make null-terminated sequences easier to handle than length-prefixed ones. The article's error-message-printing code snippet is a good example. Does anyone think there is any truth to this?
Null terminated is going to be nice in most instruction sets, you don't need to keep track of a count, so you save a register, and you have one less thing to increment (or decrement). Loop condition is basically free too, loading the next byte into a register is going to set the status register, so you don't need a compare, you can just branch if the zero flag is set.
As long as you are handling good data, it's clearly more efficient.
The ASCIZ directive on the PDP-11 assembler took a string literal and blitted it to memory with a trailing NULL, so there was a convenience factor there. BCPL and B used null-terminated strings, so it made sense for C to inherit the practice.
Pascal strings were significantly size-limited due to memory constraints (an implementation detail), so there was a reason to prefer null termination, but in essence we're still slaves to an obsolete instruction set.
Yes, the architecture on which you're most likely reading this (x86 or x64) also has semi-dedicated string handling instructions, akin to higher level string functions.
For example, a string copy 'instruction' that copies a 0 terminated string from one location to another looks like this:
lea esi, source_string
lea edi, dest_string
repnz movsb ; copies esi -> edi until *esi = 0
Maybe that's not the whole picture, but I think a big reason for null terminated strings is that C has pointers. A char pointer makes it practical to move through strings and a NULL terminator is easier to handle than a size test. In the end, sized strings make C code more complicated, unlike other languages such as Pascal.
I always felt like NUL-termination, newline-separation, and (eventually) UTF-8 were all sort of complementary ideas: they all take as an axiom that strings are fundamentally streams, not random-access buffers; and they all separate the space of single-byte coding units, by simple one-bitwise-instruction-differentiable means, into multiple lexical token types.
Taking all three together, you end up with the conception of a "string" as a serialized bitstring encoding a sequence of four lexical types: a NUL type (like the EOF "character" in a STL stream), an ASCII control-code type (or a set of individual control codes as types, if you like), a set of UTF-8 "beginning of rune" types for each possible rune length, a "byte continuing rune" type, and an ASCII-printable type. (You then feed this stream into another lexer to put the rune-segment-tokens together into rune-tokens.)
In the end, it's not a surprise that all of these components were effectively from a single coherent design, thought up by Ken Thompson. It's a bit annoying that each part ended up introduced as part of a separate project, though: NULs with Unix, gets() with C, and runes with Plan 9.
One of the pleasant things about Go's string support, I think, it that was an opportunity for Ken to express the entirety of his string semantics as a single ADT type. That part of the compiler is quite lovely.
Counted places a complexity burden at the lowest level of coding.
With terminated you still have the option of implementing strings with structs or arrays with counts or anything.
And people did of course. Many many different implementations of safe strings exist in C; the fact that none have won out vindicates the decision to use sentinel termination.
The predecessor of Unix, Multics was written in PL/1 and was very innovative (modern OS still borrow "new ideas"): https://en.wikipedia.org/wiki/Multics
[+] [-] Thomas_Lord|10 years ago|reply
bec 1f / branch if no error
jsr calls a subroutine passing the return address in register 5. The routine error interprets the return address as a pointer to the string.r5 is incremented in a loop, outputing one character at a time. When the null is found, it's time to return.
The instructions used to return from "error:" aren't shown but there is a subtlety here, I think.
".even" after the string constant assures that the next instruction, "sys exit", to which "error:" is supposed to return, is aligned on an even address.
By implication, the return sequence in "error:" just be sure to increment r5, if r5 is odd. I am guessing something like the pseudo-code:
inc r5
and r5, fffe
ret r5
[+] [-] ksherlock|10 years ago|reply
http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/sh.s
[+] [-] kazinator|10 years ago|reply
"How I do per-address blocklists with Exim"
https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EximPerUse...
I run Exim, and I'm also a huge believer in blocking spam at the SMTP level, and also do some things globally that should perhaps be per-user. I'm eagerly interested in everything this fellow has to say.
[+] [-] username223|10 years ago|reply
[+] [-] coupdejarnac|10 years ago|reply
[+] [-] noselasd|10 years ago|reply
"In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled ‘*e’. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.
From https://www.bell-labs.com/usr/dmr/www/chist.pdf
[+] [-] tomrod|10 years ago|reply
[+] [-] holmak|10 years ago|reply
[+] [-] toast0|10 years ago|reply
As long as you are handling good data, it's clearly more efficient.
[+] [-] jamesbowman|10 years ago|reply
The PDP-11 features that make this possible are (1) post increment addressing and (2) MOVE instructions set the condition codes.
[+] [-] HillRat|10 years ago|reply
Pascal strings were significantly size-limited due to memory constraints (an implementation detail), so there was a reason to prefer null termination, but in essence we're still slaves to an obsolete instruction set.
[+] [-] SCHiM|10 years ago|reply
For example, a string copy 'instruction' that copies a 0 terminated string from one location to another looks like this:
[+] [-] coliveira|10 years ago|reply
[+] [-] derefr|10 years ago|reply
Taking all three together, you end up with the conception of a "string" as a serialized bitstring encoding a sequence of four lexical types: a NUL type (like the EOF "character" in a STL stream), an ASCII control-code type (or a set of individual control codes as types, if you like), a set of UTF-8 "beginning of rune" types for each possible rune length, a "byte continuing rune" type, and an ASCII-printable type. (You then feed this stream into another lexer to put the rune-segment-tokens together into rune-tokens.)
In the end, it's not a surprise that all of these components were effectively from a single coherent design, thought up by Ken Thompson. It's a bit annoying that each part ended up introduced as part of a separate project, though: NULs with Unix, gets() with C, and runes with Plan 9.
One of the pleasant things about Go's string support, I think, it that was an opportunity for Ken to express the entirety of his string semantics as a single ADT type. That part of the compiler is quite lovely.
[+] [-] emmelaich|10 years ago|reply
You have two choices, counted or terminated.
Counted places a complexity burden at the lowest level of coding.
With terminated you still have the option of implementing strings with structs or arrays with counts or anything.
And people did of course. Many many different implementations of safe strings exist in C; the fact that none have won out vindicates the decision to use sentinel termination.
[+] [-] bitwize|10 years ago|reply
If only Dennis had had the foresight to nip that one in the bud...
[+] [-] marvy|10 years ago|reply
[+] [-] oso2k|10 years ago|reply
[+] [-] castell|10 years ago|reply
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] jamesfmilne|10 years ago|reply