top | item 46592156

(no title)

HighFreqAsuka | 1 month ago

Take a look at The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text (https://arxiv.org/pdf/2506.05209). They build a reasonable 7B parameter model using only open-licensed data.

discuss

order

nickpsecurity|1 month ago

They mostly do that. They risked legal contamination by using Whisper-derived text and web text which might have gotchas. Other than that, it was a great collection for low-risk training.