1. Open datasets for pretrains, including the tooling used to label and maintain
2. Open model, training, and inference code. Ideally with the research paper that guides the understanding of the approach and results. (Typically we have the latter, but I've seen some cases where that's omitted.)
3. Open pretrained foundation model weights, fine tunes, etc.
Opening data is an invitation to lawsuits. That is why even the most die-hard open source enthusiasts are reluctant. It is also why people train a model and generate data with it, rather than sharing the original datasets.
These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.
I understand the reasoning and I hope there is legislation in the future that basically goes "If you can't produce the data, you can't charge more than this for it". Basically, LLM producers will have to treat their product as a commodity product that can only be priced based on the compute resources plus some overhead.
It is pirated material / material that breaks various terms of service but as I understand it is the stuff you can see in Anna's Archive and a bunch of "artificial" training data from queries to OpenAI ChatGPT and other LLMs.
echelon|1 year ago
We need:
1. Open datasets for pretrains, including the tooling used to label and maintain
2. Open model, training, and inference code. Ideally with the research paper that guides the understanding of the approach and results. (Typically we have the latter, but I've seen some cases where that's omitted.)
3. Open pretrained foundation model weights, fine tunes, etc.
Open AI = Data + Code + Paper + Weights
buyucu|1 year ago
These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.
johnla|1 year ago
sdesol|1 year ago
tway223|1 year ago
chvid|1 year ago