top | item 3844003

(no title)

grepherder | 14 years ago

The tip is of course valid and pro, and I'd recommend the same, but it's already being done, under machine translation. Also, in this area big data loses its meaning, as you don't really need traditional databases, you just process raw text. There are literally thousands researching how to intelligently select and process this data.

discuss

gliese1337|14 years ago

It's not just machine translation; it's image processing / cleanup (to handle huge amounts of data for multispectral imaging and figure out how to combine it into sets of false-color images that people can read), optical character recognition (for ancient handwriting in weird writing systems), system-level programming to run the scanners, etc. There's a big ol' book on this, "Rome Wasn't Digitized in a Day": http://www.clir.org/pubs/reports/pub150/pub150.pdf BYU (which I attend and whom I work for) has done a huge amount of work in this field: http://maxwellinstitute.byu.edu/about/cpart.php

A few years ago I was writing web applications to support transcription of images of medieval documents in Old French- avoiding close-to-insurmountable OCR problems using grad students, but that still requires segmenting images properly. The LDS church does similar stuff on a very large scale to digitize genealogical records. It makes research a whole lot easier, but there's still plenty of room for improvement; image maps don't always reliably match up with the fields that you're trying to read/transcribe on images of documents, and that's kind of a pain.

TheAmazingIdiot|14 years ago

What we need here are true eyeballs to read the scripts.

I do medieval and renaissance dance reconstruction and dance performance. Having just been to an event, I took a class on the Dances of the Gresley Manuscript.

Well, what is this manuscript? It isn't a dance treatise, or anything of the sort. Gresley was a law student from the 1530-1550's (we know from latter court cases by a lawyer Gresley). These dance instructions come from the margins of his law book.

He wrote in musical notation, dance notation and other descriptive words. He even left words that have no meaning in the dance community. We have to deduce what he meant by a multitude of methods, none of which we can guarantee.

But back to the topic of OCR... How does these document scanners and OCR's plan to deduce this kind of source written in the margins?