top | item 9300099

(no title)

diasks2 | 11 years ago

Good points. I'd love to test it on some of the typically used corpora. The issues I have are:

1) Most segmentation research papers are done by Universities which have access to the Penn Treebank data (WSJ and Brown corpus). However, the cost of that data is $1,700 https://catalog.ldc.upenn.edu/LDC99T42

2) The Brown corpus is available for free in NLTK (http://www.nltk.org/nltk_data/). However it is the tagged corpus. I've contacted the researchers for all of the top segmentation libraries but never received an answer to any of the following questions:

a) I’m assuming you preprocessed the text by removing the tags. Is this correct? Or did you use the untagged version, and if so do you have a link to that as I only found the tagged version in the NLTK data?

b) When removing the tags did you also remove each carriage return and newline so the text was one long string, each sentence separated by just one whitespace?

c) The download contains 100+ files. Did you analyze each individually? Or did you create one combined file? If you created a combined file how did you space each individual file within the larger file? Also, if you combined them what order did you combine them in?

So sure, all of these papers use the same data, but we have no idea if they are actually using that data in the same way, as none of the papers actually release their code and tests, or tell the steps they used to preprocess the corpus.

To test more broad coverage on my library I added the full text of Alice in Wonderland https://github.com/diasks2/pragmatic_segmenter/blob/master/s.... A grad student from Stanford offered to test my library on the WSJ corpus a few months ago which was very kind, but I'm still waiting to hear back on that.

discuss

vseloved|11 years ago

Hi Kevin, thanks for great comments. I wanted to share a hack with you: Penn Treebank is included as part of OntoNotes which is free-of-charge :)