How Duff's Device Works

[+] okmjuhb|15 years ago|reply

The history behind this actually goes the other way if I remember correctly; it was unclear for a time whether or not this sort of thing was allowed. The standard was modified so that Duff's device would be legal code after it was being used.

[+] ssp|15 years ago|reply

Tom Duff himself writes about it here:

http://doc.cat-v.org/bell_labs/duffs_device

In particular,

    The device is legal dpANS C. I cannot quote chapter and verse, but
    Larry Rosler, who was chairman of the language subcommittee (I think),
    has assured me that X3J11 considered it carefully and decided that
    it was legal. Somewhere I have a note from dmr certifying that all the
    compilers that he believes in accept it. Of course, the device is also
    legal C++, since Bjarne uses it in his book.

[+] unknown|15 years ago|reply

[deleted]

[+] Jach|15 years ago|reply

You can use this to implement coroutines in C when you're stuck without Python/Scheme/al.

http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

[+] yread|15 years ago|reply

Really interesting! But why wouldn't you copy the data in a simple while loop? What do you need the switch statement for?

[+] acqq|15 years ago|reply

Kids these days. I guess the OP didn't really understand everything or didn't bother to explain, especially my favorite bit about "computed goto" and the fact that if you code your compiler to emit assembly instructions with labels it's quite "natural" to implement it.

Duff's device is an example of the "loop unwinding." Say you have n divisible by 8, then you can loop n / 8 times and have eight assignments in the loop stated explicitly. Now instead executing n assignments and n looping instructions (there can be more instructions needed for one loop and the cost of jump can also be visible) you can have less, n / 8 looping instructions and still n assignments.

When such thing should be done? 1) If you measured your loop and it was too slow with a one assignment and one loop pass for every n. 2) you know that your compiler can't do the loop unwinding automatically -- modern compilers can do such things even if you write a normal loop if the compiler estimates that this can really help and when you give to it a specific direction to produce such code (e.g. aggressively optimize for speed).

Only now comes the reason for using Duff's device: how to code the case where n is not divisible by 8? Well you can have a loop for the remaining number R of elements, but then you have R assignments and R looping instructions. More clever thing is to have the sequence of the assignments in the code stream, one after another and to jump directly inside to execute exactly R assignments.

Turns out, C was made in seventies, but Fortran was made in fifties and already had the construct which allowed exactly that, the direct jump. So authors of C created something smarter -- a switch statement was implemented so that it compiles to the series of if ... goto or even to the "computed goto" sequence, not needing to do more comparisons if the values for jump conditions are one after another.

So now we're able to have one sequence for a direct jump (say 8 assignments in a switch) and one sequence for unwound loop (loop around 8 assignments). But why not trying combining them? That was Duff's insight, and the Duff's device is a result of it. The benefit: instead having 16 assignments in the generated code, you'll have just 8 which are going to be used both for R assignments and for unwound loop.

Just that last step (reducing the number of the assignments in the resulting code from 16 to 8) is something that you probably won't measure as a direct speed benefit, you'll only have a bit smaller resulting code. So I guess it was more a curiosity than something that must have been done exactly so even as it was written. Still it nicely demonstrates how people who know what is produced by the compiler are able to think, on one side, and how C can be used for "assembly level" optimizations, on another.

[+] owyn|15 years ago|reply

The original post explains why...

http://www.lysator.liu.se/c/duffs-device.html

The VAX C compiler compiles the loop into 2 instructions (a movw and a sobleq, I think.) As it turns out, this loop was the bottleneck in a real-time animation playback program which ran too slowly by about 50%.

[+] gojomo|15 years ago|reply

Fewer tests for the end-condition, and fewer jumps back to the start of the loop -- by a factor N, how many consecutive steps are 'unrolled'. (In these examples, N is 8 -- but it could be larger.)

But what if the number of steps isn't evenly divisible by N? That's where Duff's Device comes in; it lets one 'switch' at the beginning enter the loop at just the right place for the right number of remainder steps to be added to full iterations.

12 comments