top | item 41610870

(no title)

wazdra | 1 year ago

I'd like to point out that llama 3.1 is not open source[1] (I was recently made aware of that fact by [2], when it was on HN front page) While it's very nice to see a peak of interest for local, "open-weights" LLMs, this is an unfortunate choice of words, as it undermines the quite important differences between llama's license model and open-source. The license question does not seem to be addressed at all in the article.

[1]: https://www.llama.com/llama3_1/license/

[2]: https://csvbase.com/blog/14

discuss

gptisms|1 year ago

It's not an open source licence, but to Meta's credit it's also not a licence you have to sell your soul to use. No more than 700 million monthly active users without requesting a licence, and don't use the model for illegal purposes and follow an AUP. It's as good as open source if you're using the models locally. If you're serious about building a commercial product with llama the AUP is almost certainly going to line up with your own terms of service for your product. Compare this with other models that are also billed as 'open source':

yi: previously non-commercial but Apache 2.0 now?

deepseek: usage policy

larger gemma models: usage policy

databricks: similar to llama: no more than 700 million MAUs, usage policy, additional restrictions on using outputs to train other models

qwen: no more than 100 million MAUs

mistral: non-commercial

command-r: CC BY-NC

starling: CC BY-NC

There are a handful of niche models released under MIT/Apache but the norm is licences similar to or more restrictive than the Llama Community Licence, and I really doubt the situation would be better if Meta wasn't first.

>"open-weights" LLMs

I doubt this is the point you're making, but the training data really isn't useful even if it could be released under a permissive licence. Most models use similar datasets: reddit (no licence afaik, copyright belongs to comment authors), stackoverflow (CC BY-SA), wikipedia (CC BY-SA), Project Gutenberg (public domain?), previously books3 (books under copyright by publishers with more money and lawyers than reddit users), etc, with various degrees of filtering to remove harmful data. You can't do much with this much data unless you have millions of dollars worth of compute laying around, and you can't rebuild llama any more than any other company using the same data have 'rebuilt llama' - all models trained in a similar manner on the same data are going to converge in outputs eventually. Compare with Linux distributions, they all use the same packages but you're not going to get the same results.

sergiotapia|1 year ago

that ship sailed 13 years ago dude.