What I was saying is that because you need to go out of your way to make sure it's tokenized properly, I wouldn't be surprised if there are enough non properly tokenized examples in the dataset.
If that was the case, it would make it difficult to generalize these concepts.
No comments yet.