(no title)
willcannings | 15 years ago
That said, I don't think tagging for them would be very simple like you say. For a start they're dealing with multiple languages, probably many languages without any human annotated training corpora. Even for the languages with training data, web pages are difficult to tag & parse because they often contain very 'slack' grammar and domain specific/slang words. The standard English training corpus is the Penn Treebank (Wall Street Journal text), can you imagine trying to read and understand youtube comments if all you'd ever read was the WSJ? Even tagging search queries would be difficult because they're not even sentence fragments you could use viterbi with, they're often just words strung together without any grammar construct at all so you can't rely on the tag order you know from your corpus to help you tag a query.
So I'm very impressed that they're doing any tagging at all, on the scale they're doing it at, and with presumably decent enough results for it to be useful.
No comments yet.