The actual tokenizer often does not matter since you can add pre processors/normalizers. I assume they did it like this because capitalization matters in a lot of contexts
Similarly, pre-processing can be harmful. I think there are reasonable predictive differences when predicting the next-word follow up to a sentence that's properly capitalized versus one that's all lowercase. Not only will the "all lowercase" convention likely prevail in forward predictions, it also indicates something about the context of the writing, the author, their sense of style.
It's hard to argue that this information isn't (a) being captured by GPTs and (b) important. If you just threw it away, GPTs would have less information available to absorb.
A good example is the initially released BERT-multilingual-uncased model back from the first BERT paper, which (without even mentioning it anywhere) not only collapsed the case but also removed diacritic marks from latin characters, thus killing its performance on those languages which heavily rely on them.
tel|2 years ago
It's hard to argue that this information isn't (a) being captured by GPTs and (b) important. If you just threw it away, GPTs would have less information available to absorb.
PeterisP|2 years ago
A good example is the initially released BERT-multilingual-uncased model back from the first BERT paper, which (without even mentioning it anywhere) not only collapsed the case but also removed diacritic marks from latin characters, thus killing its performance on those languages which heavily rely on them.