top | item 46104356

(no title)

voiper1 | 3 months ago

Surely there's AI usage that's not morally reprehensible.

Models that are trained only on public domain material. For value add usage, not simply marketing or gamification gimmicks...

discuss

order

qingcharles|3 months ago

How many models are only trained on legal[0] data? Adobe's Firefly model is one commercial model I can think of.

[0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested

Eisenstein|3 months ago

> How many models are only trained on legal[0] data?

None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is

1. Common crawl

2. Github

3. Wikipedia, Wikibooks

4. Reddit (pre-2023)

5. Semantic Scholar

6. Project Gutenberg

* https://arxiv.org/pdf/2402.00159