Like NULL, confusion over EOF is a problem which can be eliminated via algebraic types.
What if instead of a char, getchar() returned an Option<char>? Then you can pattern match, something like this Rust/C mashup:
match getchar() {
Some(c) => putchar(c),
None => break,
}
Magical sentinels crammed into return values — like EOF returned by getchar() or -1 returned by ftell() or NULL returned by malloc() — are one of C's drawbacks.
What always annoyed me about C is that it has all the tools to simulate something approaching this, save for some purely syntactical last-mile shortcomings. We can already return structs; if only there were a way to neatly define a function returning an anonymous struct, and immediately destructure on the receiving end. Something like:
This is (semantically) perfectly possible today, you just have to jump through some syntactic hoops explicitly naming that return struct type (because among others anonymous structs, even when structurally equivalent, aren't equivalent types unless they're named...). Compilers could easily do that for us! It would be such a simple extension to the standard with, imo, huge benefits.
Every time I have to check for in-band errors in C, or pass a pointer to a function as a "return value", I think of this and cringe.
> Magical sentinels crammed into return values — like EOF returned by getchar() or -1 returned by ftell() or NULL returned by malloc() — are one of C's drawbacks.
They're part of the C standard library. The POSIX I/O APIs don't have these problems. The Linux I/O system calls are even better because they don't have errno.
Honestly, the C standard library just isn't that good. Freestanding C is a better language precisely because it omits the library and allows the programmer to come up with something better.
This is very well explained in the classic book The UNIX Programming Environment, by Kernighan and Pike, in page 44:
Programs retrieve the data in a file by a system call ... called read. Each time read is called, it returns the next part of a file ... read also says how many bytes of the file were returned, so end of file is assumed when a read says "zero bytes are being returned" ... Actually, it makes sense not to represent end of file by a special byte value, because, as we said earlier, the meaning of the bytes depends on the interpretation of the file. But all files must end, and since all files must be accessed through read, returning zero is an interpretation-independent way to represent the end of a file without introducing a new special character.
Read what follows in the book if you want to understand Ctrl-D down cold.
In the beginning, there was the int. In K&R C, before function prototypes, all functions returned "int". ("float" and "double" were kludged in, without checking, at some point.)
So the character I/O functions returned a 16-bit signed int. There was no way to return a byte, or a "char". That allowed room for out of band signals such as EOF.
It's an artifact of that era. Along with "BREAK", which isn't a character either.
Seems like the confusion arises because getchar() (or its equivalent in langauges other than c) can produce an out-of-band result, EOF, which is not a character.
Procedural programmers don't generally have a problem with this -- getchar() returns an int, after all, so of course it can return non-characters, and did you know that IEEE-754 floating point can represent a "negative zero" that you can use for an error code in functions that return float or double?
Functional programmers worry about this much more, and I got a bit of an education a couple of years ago when I dabbled in Haskell, where I engaged with the issue of what to do when a nominally-pure function gets an error.
I'm not sure I really got it, but I started thinking a lot more clearly about some programming concepts.
The amusing thing about it is that C does not guarantee that EOF is out-of-band!
ISO C says that char must be at least 8 bits, and that int must be at least 16. It is entirely legal to have an implementation that has 16-bit signed char and sizeof(int)==1. In which case -1 is a valid char, and there's no way to distinguish between reading it and getting EOF from getchar().
> and did you know that IEEE-754 floating point can represent a "negative zero" that you can use for an error code in functions that return float or double?
I am begging, please never ever do this. NaN literally exists for this reason. NaN even allows you to encode additional error context and details into the value.
CP/M and DOS use ^Z (0x1A) as an EOF indicator. More modern operating systems use the file length (if available). Unix/Linux will treat ^D (0x04) as EOF within a stream, but only if the source is "cooked" and not "raw". (^D is ASCII "End Of Transmission or EOT" so that seems appropriate, except in the world of unicode.)
Strictly speaking, as discussed elsewhere in this thread, ^D can cause a terminal device to signal an EOF condition; other kinds of Unix byte streams don't make this association.
For example,
$ python3 -c 'print("".join(chr(c) for c in range(10)))' | python3 -c 'print(list(ord(c) for c in input()))'
will confirm that it doesn't happen in a pipe (the ASCII 4 character there is totally unrelated to EOF).
Using the "file length" as opposed to the "EOF indicator" is like how strings can either be represented as pointer to a contiguous sequence of `char` ending with a NULL byte, or as a tuple of (length, pointer), without the needed NULL byte.
One gives a priori information the other a posteriori.
The kernel returns EOF "if k is the current file position and m is the size of a file, performing a read() when k >= m..."
So, is the length of each file stored as an integer, along with the other metadata? This reminds me of how in JavaScript the length of an array is a property, instead of a function that counts it right then, like say in PHP.
Apparently it works. I've never heard of a situation where the file size number did not match the actual file size, nor of a time when the JavaScript array length got messed up. But it seems fragile. File operations would need to be ACID-compliant, like database operations (and likewise do JavaScript array operations). It seems like you would have to guard against race conditions.
Does anyone have a favorite resource that explains how such things are implemented safely?
You are not thinking about it clearly. Ask yourself this: Filesystem formats use blocking and deblocking. How would a filesystem know the file size without having metadata for it?
Perhaps a marginally better title would be "EOF is not a character [on Unix]". There are some OS that have an explicit EOF character, but it seems to have been the less common approach historically. CP/M featured an explicit end of file marker because the file system didn't bother to handle the problem of files which were not block-aligned, so the application layer needed to detect where the actual end of the file was located (lest it read the contents of the rest of the block). This is a pretty unusual thing to do, and was definitely a hassle for developers, so CP/M descendants like MS-DOS fixed it.
It's just a convention, it isn't enforced by the OS. The C runtime for example will check for character 26 if you're reading a file opened in text mode but not in binary mode. The underlying OS call to read a file makes no distinction between text and binary.
I'm just reading up on this now. But according to Wikipedia "CP/M used the 7-bit ASCII set", so then character 26 would be the "SUB (substitute)" character. No?
Of course it isn't, you couldn't have arbitrary binary files if one of the 256 possible bytes was reserved.
That's why getchar returns int and not char; one char wouldn't be enough for 257 possible values (256 possible char values + eof).
Well then try explaining ctrl+c vs ctrl+d to someone who's never touched a terminal at all. Starts off so easily... "see one tells the program to stop" the other, well, if you're in a shell... or some programs... oh god. IDK anymore, just assume it works. What was the question?"
I find it interesting that Rust's `Read` API for `read_to_end` [1] states that it "Read[s] all bytes until EOF in this source, placing them into buf", and stops on conditions of either `Ok(0)` or various kinds of `ErrorKind`s, including `UnexpectedEof`, which should probably never be the case.
The reason for that is that, for simplicity's sake, all of the I/O functions share the same error type. `UnexpectedEof` should never be returned from `read_to_end`, but it can be returned from `read_exact`.
That's because `UnexpectedEof` is never returned from `read()`, it's only ever returned from `read_exact()`. In fact, `UnexpectedEof` didn't exist originally, it was added together with `read_exact()` to represent its unique error case (which is: `read()` returned end-of-file, but we still needed more bytes to completely fill the buffer). It's an error to return `UnexpectedEof` from any of the other methods of the `Read` trait, and since it's an error, it makes sense for `read_to_end()` to stop and propagate that error.
(In fact, thinking better about it, there are some cases where `read()` could legitimately return `UnexpectedEof`, like when it's a wrapper for a compressed stream which has fixed-size fields, and that stream was truncated in the middle of one of these fields. It's clear that, in that case, `UnexpectedEof` is not an end-of-file for the wrapper; it should be treated as an I/O error.)
Banged my head against the wall once after trying to figure out why Ctrl+D generates some character in bash but I can't send that character in a pipe to simulate termination.
> Banged my head against the wall once after trying to figure out why Ctrl+D generates some character in bash but I can't send that character in a pipe to simulate termination.
Yes, you can. You just end your stream by closing the pipe.
For me EOF is a boolean state. Either I am at the end of file (stream / memory mapped etc) or not. That's how I was taught when I started programming. Never occurred to me to think of it like a character.
> All stdio functions now treat end-of-file as a sticky condition. If you
read from a file until EOF, and then the file is enlarged by another
process, you must call clearerr or another function with the same effect
(e.g. fseek, rewind) before you can read the additional data. This
corrects a longstanding C99 conformance bug. It is most likely to affect
programs that use stdio to read interactive input from a terminal.
This strikes me as the sort of pedantic and "I'm witty" click bait that occasionally percolates upwards on HN, especially considering the specifics of "EOF" are very much contingent on operating context.
rectang|6 years ago
What if instead of a char, getchar() returned an Option<char>? Then you can pattern match, something like this Rust/C mashup:
Magical sentinels crammed into return values — like EOF returned by getchar() or -1 returned by ftell() or NULL returned by malloc() — are one of C's drawbacks.nothrabannosir|6 years ago
Every time I have to check for in-band errors in C, or pass a pointer to a function as a "return value", I think of this and cringe.
Someone|6 years ago
Getchar doesn’t return a char; it returns an int (https://en.cppreference.com/w/c/io/getchar).
⇒ if C didn’t do automatic conversions from int to char, we would have that (in a minimalistic sense)
That wouldn’t work for ftell and malloc (and, in general, most of the calls that set errno), though.
matheusmoreira|6 years ago
They're part of the C standard library. The POSIX I/O APIs don't have these problems. The Linux I/O system calls are even better because they don't have errno.
Honestly, the C standard library just isn't that good. Freestanding C is a better language precisely because it omits the library and allows the programmer to come up with something better.
nixpulvis|6 years ago
pwdisswordfish2|6 years ago
sfifs|6 years ago
enriquto|6 years ago
That would be the textbook case of stupid over-engineering.
charlysl|6 years ago
Programs retrieve the data in a file by a system call ... called read. Each time read is called, it returns the next part of a file ... read also says how many bytes of the file were returned, so end of file is assumed when a read says "zero bytes are being returned" ... Actually, it makes sense not to represent end of file by a special byte value, because, as we said earlier, the meaning of the bytes depends on the interpretation of the file. But all files must end, and since all files must be accessed through read, returning zero is an interpretation-independent way to represent the end of a file without introducing a new special character.
Read what follows in the book if you want to understand Ctrl-D down cold.
Animats|6 years ago
It's an artifact of that era. Along with "BREAK", which isn't a character either.
bhaak|6 years ago
GCC only outputs a warning by default: "warning: return type defaults to ‘int’ [-Wimplicit-int]"
reidacdc|6 years ago
Procedural programmers don't generally have a problem with this -- getchar() returns an int, after all, so of course it can return non-characters, and did you know that IEEE-754 floating point can represent a "negative zero" that you can use for an error code in functions that return float or double?
Functional programmers worry about this much more, and I got a bit of an education a couple of years ago when I dabbled in Haskell, where I engaged with the issue of what to do when a nominally-pure function gets an error.
I'm not sure I really got it, but I started thinking a lot more clearly about some programming concepts.
int_19h|6 years ago
ISO C says that char must be at least 8 bits, and that int must be at least 16. It is entirely legal to have an implementation that has 16-bit signed char and sizeof(int)==1. In which case -1 is a valid char, and there's no way to distinguish between reading it and getting EOF from getchar().
snek|6 years ago
I am begging, please never ever do this. NaN literally exists for this reason. NaN even allows you to encode additional error context and details into the value.
fennecfoxen|6 years ago
This is a supplementary source of confusion.
nixpulvis|6 years ago
If by procedural you mean, nonsense, then sure... I agree that a function named `getchar` returning an `int` is procedural. :P
anonymousiam|6 years ago
schoen|6 years ago
For example,
will confirm that it doesn't happen in a pipe (the ASCII 4 character there is totally unrelated to EOF).pwdisswordfish2|6 years ago
http://jdebp.info/FGA/dos-character-26-is-not-special.html
nixpulvis|6 years ago
One gives a priori information the other a posteriori.
combatentropy|6 years ago
So, is the length of each file stored as an integer, along with the other metadata? This reminds me of how in JavaScript the length of an array is a property, instead of a function that counts it right then, like say in PHP.
Apparently it works. I've never heard of a situation where the file size number did not match the actual file size, nor of a time when the JavaScript array length got messed up. But it seems fragile. File operations would need to be ACID-compliant, like database operations (and likewise do JavaScript array operations). It seems like you would have to guard against race conditions.
Does anyone have a favorite resource that explains how such things are implemented safely?
JdeBP|6 years ago
chrisseaton|6 years ago
jcrawfordor|6 years ago
mark-r|6 years ago
nixpulvis|6 years ago
EDIT: Seems like 26 = EOF is a DOS thing.
EDIT 2: Some confusing comments: https://www.perlmonks.org/bare/?node_id=228760
EDIT 3: A pretty good thread (read NigelQ's replay): http://forums.codeguru.com/showthread.php?181171-End-of-File...
IndexPointer|6 years ago
schoen|6 years ago
nixpulvis|6 years ago
nixpulvis|6 years ago
[1]: https://doc.rust-lang.org/std/io/trait.Read.html#method.read...
comex|6 years ago
cesarb|6 years ago
(In fact, thinking better about it, there are some cases where `read()` could legitimately return `UnexpectedEof`, like when it's a wrapper for a compressed stream which has fixed-size fields, and that stream was truncated in the middle of one of these fields. It's clear that, in that case, `UnexpectedEof` is not an end-of-file for the wrapper; it should be treated as an I/O error.)
badrabbit|6 years ago
kylek|6 years ago
nixpulvis|6 years ago
enriquto|6 years ago
Yes, you can. You just end your stream by closing the pipe.
jwilk|6 years ago
The exception even tells you that "chr() arg not in range(0x110000)" which has nothing to do with range of C's character types.
unnouinceput|6 years ago
Thorrez|6 years ago
jwilk|6 years ago
https://sourceware.org/bugzilla/show_bug.cgi?id=1190
https://sourceware.org/legacy-ml/libc-alpha/2018-08/msg00003...
> All stdio functions now treat end-of-file as a sticky condition. If you read from a file until EOF, and then the file is enlarged by another process, you must call clearerr or another function with the same effect (e.g. fseek, rewind) before you can read the additional data. This corrects a longstanding C99 conformance bug. It is most likely to affect programs that use stdio to read interactive input from a terminal.
agumonkey|6 years ago
unknown|6 years ago
[deleted]
cjohansson|6 years ago
jes5199|6 years ago
guerrilla|6 years ago
ineedasername|6 years ago
1996|6 years ago
^D (0x04) is EOT and 0x03 is EOText: https://www.systutorials.com/ascii-table-and-ascii-code/
So, kinda, but somehow I'm happy it never got turned into a weird combinations depending on the OS.
mark-r|6 years ago