killa_bee's comments

killa_bee | 14 years ago | on: Programming is hard, let's go scripting..

At the risk of being annoying, virtually all the research done in linguistics departments anywhere in the world is research in the cognitive science of language, so a social science, not a humanity. This certainly includes Berkeley around 1970 or so (assuming that's when Wall was there).

killa_bee | 14 years ago | on: XeTeX: could it be TeX's saviour?

I use xelatex in my work and it's still embarassingly fragmented and outdated. We need to start over on a new TeX-like project (also so that it can be ported to mobile).

killa_bee | 14 years ago | on: There's no speed limit (2009)

People who think "going to college is for chumps" (and there are a lot of you on HN) should do this instead. It's not that hard to graduate college in 2-3 years.

killa_bee | 14 years ago | on: You used Python for what?

I'm not an expert on this, just a linguist who happens to code a lot, but there is some serious work on the complexity of frequency ordering. The algorithms proposed by R. Rivest (1976, Communications of the ACM) produce near-optimal frequency orders online, so if you can settle for "near optimal" then the problem is hardly traveling salesman as this author claims.

killa_bee | 14 years ago | on: Ask HN: Why don't we use subtitled films/tv to train speech recognition?

I happen to know that they do this at the Linguistics Data Consortium (http://www.ldc.upenn.edu/), at least with cable news shows. They mostly do that to obtain data for languages with more minimal resources though, and for the purposes of transcription, not for speech recognition qua engineering research. The real issue though is the research community is interested in increasing the accuracy of recognizers on standard datasets by developing better models, not increasing accuracy per se. Having used more data isn't publishable. Further, in terms of real gains, the data is sparse (power law distributed), and so we need more than just a constant increase in the amount of data. This issue is general to any machine-learning scenario but is particularly pronounced in anything built on language.

Some related papers ~ Moore R K. 'There's no data like more data (but when will enough be enough?)', Proc. Inst. of Acoustics Workshop on Innovation in Speech Processing, IoA Proceedings vol.23, pt.3, pp.19-26, Stratford-upon-Avon, 2-3 April (2001). Charles Yang. Who's afraid of George Kingsley Zipf? Ms., University of Pennsylvania. http://www.ling.upenn.edu/~ycharles/papers/zipfnew.pdf

page 1