top | item 43171828

(no title)

Well technically even DeepSeek is not as OSS as OLMo or Open Euro, because they didn't open the data.

discuss

echelon|1 year ago

We're 2/3rds of the way there.

We need:

1. Open datasets for pretrains, including the tooling used to label and maintain

2. Open model, training, and inference code. Ideally with the research paper that guides the understanding of the approach and results. (Typically we have the latter, but I've seen some cases where that's omitted.)

3. Open pretrained foundation model weights, fine tunes, etc.

Open AI = Data + Code + Paper + Weights

buyucu|1 year ago

Opening data is an invitation to lawsuits. That is why even the most die-hard open source enthusiasts are reluctant. It is also why people train a model and generate data with it, rather than sharing the original datasets.

These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.

johnla|1 year ago

Sounds like a job for AI.

sdesol|1 year ago

I understand the reasoning and I hope there is legislation in the future that basically goes "If you can't produce the data, you can't charge more than this for it". Basically, LLM producers will have to treat their product as a commodity product that can only be priced based on the compute resources plus some overhead.

tway223|1 year ago

For understandable reasons

chvid|1 year ago

It is pirated material / material that breaks various terms of service but as I understand it is the stuff you can see in Anna's Archive and a bunch of "artificial" training data from queries to OpenAI ChatGPT and other LLMs.