I'd like to point out that llama 3.1 is not open source[1] (I was recently made aware of that fact by [2], when it was on HN front page)
While it's very nice to see a peak of interest for local, "open-weights" LLMs, this is an unfortunate choice of words, as it undermines the quite important differences between llama's license model and open-source.
The license question does not seem to be addressed at all in the article.[1]: https://www.llama.com/llama3_1/license/
[2]: https://csvbase.com/blog/14
gptisms|1 year ago
yi: previously non-commercial but Apache 2.0 now?
deepseek: usage policy
larger gemma models: usage policy
databricks: similar to llama: no more than 700 million MAUs, usage policy, additional restrictions on using outputs to train other models
qwen: no more than 100 million MAUs
mistral: non-commercial
command-r: CC BY-NC
starling: CC BY-NC
There are a handful of niche models released under MIT/Apache but the norm is licences similar to or more restrictive than the Llama Community Licence, and I really doubt the situation would be better if Meta wasn't first.
>"open-weights" LLMs
I doubt this is the point you're making, but the training data really isn't useful even if it could be released under a permissive licence. Most models use similar datasets: reddit (no licence afaik, copyright belongs to comment authors), stackoverflow (CC BY-SA), wikipedia (CC BY-SA), Project Gutenberg (public domain?), previously books3 (books under copyright by publishers with more money and lawyers than reddit users), etc, with various degrees of filtering to remove harmful data. You can't do much with this much data unless you have millions of dollars worth of compute laying around, and you can't rebuild llama any more than any other company using the same data have 'rebuilt llama' - all models trained in a similar manner on the same data are going to converge in outputs eventually. Compare with Linux distributions, they all use the same packages but you're not going to get the same results.
sergiotapia|1 year ago