top | item 9677591

(no title)

Very interesting. So Zatocoding [1] allows for a card to have multiple index entries, and then any of those could be used to retrieve it.

For completeness, the 1951 paper is here [1]. Apparently has been covered in undergraduate algorithms at U. of I. [2] (with Bloom filters) so be getting more exposure.

Is the "needed improvement" of the 1976 paper due to better methods available (e.g. proof rigor, or understanding of sorting algorithms) or a better explanation of the methods (perhaps because of more widespread knowledge of information theory, and a better defined terminology).

I thought edge-notched cards [3] had been used for a long time (and have: since 1896)--I remember reading about them used for fingerprint card retrieval (indexed by feature, on each finger. This didn't use superimposed coding, but instead was a type of content-addressable memory--the time to retrieve all cards with e.g. a whorl on the thumb is O(1).

Apparently the cards are still used some places, see Kevin Kelly's [3] site for some images and interesting comments.

Finally found a fingerprint filing reference: "For example, as early as 1934 the FBI tried a punchcard and sorting system for searching fingerprints, but the technology at that time could not handle the large number of records in the Ident files." [5]

    1. https://courses.engr.illinois.edu/cs473/fa2013/misc/zatocoding.pdf
       (or buy from Wiley for $38)
    2. https://courses.engr.illinois.edu/cs473/fa2013/lectures.html
    3. http://en.wikipedia.org/wiki/Edge-notched_card
    4. http://kk.org/thetechnium/one-dead-media/
    5. Chapter 3: Evolution to Computerized Criminal History Records
       https://www.princeton.edu/~ota/disk3/1982/8203/820306.PDF

discuss

dalke|10 years ago

Mooers' original work was presented at the ACS in the September 1947. Knuth cites that, as well as the Am. Doc. (1951) citation you gave, in TAOCP v3 p571. His 1948 Master's thesis from MIT, which goes through the derivation, is at http://dspace.mit.edu/handle/1721.1/12664 . This is why I say it's from the 1940s, not 1950s. I also think the chapter "Mathematical Analysis of Coding Systems" by Carl Wise, from "Punched cards; their applications to science and industry" (2nd ed. 1958), at http://babel.hathitrust.org/cgi/pt?id=uc1.b3958636;view=1up;... gives an excellent treatment of the topic.

Nice find with the U. of I. course. Interestingly, if you listen to the presentation the speaker says: the topic won't be on the test, perhaps the rigor behind the method isn't as good as desired for the class, but it's something the students ought to know exists. Plus the speaker "just found out about" the topic (at 08:19), like you, cites a 1950s date, and describes it as using a fixed number of bits per category .. which is why he says that Zatocoding "reappears" years later a a Bloom filter. They are both superimposed codes, but not the same thing.

I find the mention of the two problems with superimposed codes to be interesting. 1) researchers don't like false drops and instead expect perfect matches (in Mooers' chapter in 'Punched Cards' his spins it as serendipitous matches; Knuth does something similar in TAOCP with 'false drop cookies'.), and 2) librarians love to make hierarchical categorizations, which isn't needed with Zatocoding.

(BTW, from what I read, American Documentation was the journal for information theory in the 1950s and covered many topics. I interpret Mooers' paper more as advertisement to a wider audience, because it doesn't go into the technical details of his method. He was trying to drum up work for his consulting business.)

The "needed improvement" comes with chemical descriptors. Suppose one of your descriptors is "contains a carbon", another is "contains 3 carbons in a row" and a third is "contain 6 carbons in a row". I'll write this as 1C, 3C, and 6C. Zatocoding treats all descriptors as independent, though there's a paper where he describes a correction for when there are correlations. But in this case whenever 6C exists then both 3C and 1C also exist. These are hierarchical descriptors.

I was wrong about the date. The improvement paper is Feldman and Hodes, JCICS 1975 15 (3) pp 147-152, not 1976. In the original Zatocoding, the number of bits $k$ is given as -log2(descriptor frequency). In Feldman and Hodes, $k$ for the root descriptors are given the same way, but $k$ for a child descriptor is given as log2(parent frequency/child frequency). It's possible for a fragment to have multiple parents (consider that "CC" and "CO" are both parents of "CCO"). In that case, use the least frequent parent.

In addition, ensure that the bits selected for the child do not overlap with the bits set for the parent.

This ends up giving a first-order correction to Zatocoding for hierarchical data.

The "needed improvement" therefore is the ability to handle hierarchical descriptors, which are frequently found in chemical substructure screens.

While you say "still used some places", Kelly's site concerns dead media. I spent some time trying to find modern edge notched cards, including, though a friend, asking on a mailing list of people interested in old computing tech. No success. There were a couple of used ones on eBay, but I wanted 500 unused ones so I could make a data set myself. Instead, this is the 21st century, and I think I can use a paper die cut machines to make them for me, with pre-cut holes even.

You mention "sorting" several time. My use is for selection, not sorting.

For punched cards the selection time will be O(N). The only way to get O(1) is with an inverted index of just the descriptor you're looking for. Mooers in 1951 proposed 'Doken' (see http://www.historyofinformation.com/expanded.php?id=4243) as a way to search '100 million items in about 2 minutes'.