top | item 13351361

(no title)

atpaino | 9 years ago

Yep, it definitely has room to improve. The work thus far has primarily been a proof-of-concept for the methodology used to generate training samples (i.e. starting with grammatically correct text and introducing errors). Next step is to try to include more high quality data, after which I may try out comment data from HN, etc. I think it would be interesting to see what the effect of somewhat noisier data like that would have on the model.

discuss

walrus1066|9 years ago

Programmatically generating incorrect grammar from correct sentences must be really tough. There are so many more ways to incorrectly structure a sentence than there are correct ones.

Random idea: what happens if you use Google translate to generate the incorrect sentence, I.e. Translate it to other languages and then back again. If the resulting sentence doesn't match the original, add it to the dataset.