(no title)
IshanMi | 1 year ago
One of the best open source datasets that are freely available is The Pile by EleutherAI [1]. It's a few years old now (~2020), but they did some really diligent work in putting together the dataset and documenting it. A more recent and even larger dataset would be the Falcon-RefinedWeb dataset [2].
[1]: https://arxiv.org/abs/2101.00027 [2]: https://arxiv.org/abs/2306.01116
No comments yet.