(no title)
shnkr | 1 year ago
>At Databricks, we believe that every enterprise should have the ability to control its data and its destiny in the emerging world of GenAI.
>The main process of building DBRX - including pretraining, post-training, evaluation, red-teaming, and refining - took place over the course of three months.
simonw|1 year ago
Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!
A couple of things I've written about this:
- https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-the...
- https://simonwillison.net/2023/Apr/17/redpajama-data/
ssgodderidge|1 year ago
shnkr|1 year ago
tempusalaria|1 year ago
This is then cleaned up to remove nonsense, some technical files, and repeated files.
From this, they tend to weight some sources more - e.g. Wikipedia gets a pretty high weighting in the data mix. Overall these data mixes have multiple trillion token counts.
GPT-4 apparently trained on multiple epochs of the same data mix. So would assume this one did too as it’s a similar token count
sanxiyn|1 year ago
IshanMi|1 year ago
One of the best open source datasets that are freely available is The Pile by EleutherAI [1]. It's a few years old now (~2020), but they did some really diligent work in putting together the dataset and documenting it. A more recent and even larger dataset would be the Falcon-RefinedWeb dataset [2].
[1]: https://arxiv.org/abs/2101.00027 [2]: https://arxiv.org/abs/2306.01116