top | item 46104356 (no title) voiper1 | 3 months ago Surely there's AI usage that's not morally reprehensible.Models that are trained only on public domain material. For value add usage, not simply marketing or gamification gimmicks... discuss order hn newest qingcharles|3 months ago How many models are only trained on legal[0] data? Adobe's Firefly model is one commercial model I can think of.[0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested Eisenstein|3 months ago > How many models are only trained on legal[0] data?None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is1. Common crawl2. Github3. Wikipedia, Wikibooks4. Reddit (pre-2023)5. Semantic Scholar6. Project Gutenberg* https://arxiv.org/pdf/2402.00159 load replies (1)
qingcharles|3 months ago How many models are only trained on legal[0] data? Adobe's Firefly model is one commercial model I can think of.[0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested Eisenstein|3 months ago > How many models are only trained on legal[0] data?None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is1. Common crawl2. Github3. Wikipedia, Wikibooks4. Reddit (pre-2023)5. Semantic Scholar6. Project Gutenberg* https://arxiv.org/pdf/2402.00159 load replies (1)
Eisenstein|3 months ago > How many models are only trained on legal[0] data?None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is1. Common crawl2. Github3. Wikipedia, Wikibooks4. Reddit (pre-2023)5. Semantic Scholar6. Project Gutenberg* https://arxiv.org/pdf/2402.00159 load replies (1)
qingcharles|3 months ago
[0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested
Eisenstein|3 months ago
None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is
1. Common crawl
2. Github
3. Wikipedia, Wikibooks
4. Reddit (pre-2023)
5. Semantic Scholar
6. Project Gutenberg
* https://arxiv.org/pdf/2402.00159