top | item 2643515

Head to Head Comparison of Text Extraction Algorithms

40 points| dusano | 14 years ago |readwriteweb.com

4 comments

m_eiman|14 years ago

Link to actual comparison: http://tomazkovacic.com/blog/122/evaluating-text-extraction-...

vannevar|14 years ago

The metrics used in the comparison seems significantly flawed. It compares a reference set of tokens produced by applying the following rules:

1) Remove any sort of remaining inline html tags 2) Remove all punctuation characters 3) Remove all control characters 4) Remove all non ascii characters (due to unreliable information of the document encoding) 5) Normalize to lowercase 6) Split on whitespace

This seems to me a case of measuring what is easy to measure rather than measuring what is right. What would the author think about adding a rule to 'remove all vowels' or 'arbitrarily split words'? Yet he happily removes meaning and context in the form of punctuation and case. If the underlying text extraction algorithms are not similarly handicapped, then one or more of them might be a better standard of measurement than the one the author applies. Rather like measuring the accuracy of an atomic clock by using a rusty stopwatch.

sigil|14 years ago

Author submitted the post directly to HN a few days ago: http://news.ycombinator.com/item?id=2639214

jannes|14 years ago

Just out of curiosity. How did this url fragment #.TfMwNJgETxs;hackernews end up in the URL? How did ReadWriteWeb know that this URL would be posted on Hacker News?

Edit: Oh, I found out by myself. There's a link on the page to post the story on Hacker News. So it seems like the url fragment was added by http://www.addthis.com

beagledude|14 years ago

Goose also does image extraction there is a demo online here:

http://jimplush.com/blog/goose

sdfgweg|14 years ago

[deleted]