> Imagine if Linux published only a binary without the codebase. Or published the codebase without the compiler used to make the binary. This is where we are today.
This was such a helpful way to frame the problem! Something felt off about the "open source models" out there; this highlights the problem incredibly well.
I think a better analogy is firmware binary blobs in the Linux kernel, or VM bytecodes.
The LLM inference engine (architecture implementation) is like a kernel driver that loads a firmware binary blob, or a virtual machine that loads bytecode. The inference engine is open source. The problem is that the weights (firmware blobs, VM bytecodes) are opaque: you don't have the means to reproduce them.
The Linux community has long argued that drivers that load firmware blobs are cheating: they don't count as open source.
Still, the "open source" LLMs are more open than "API-gated" LLMs. It's a step in the right direction, but I hope we don't stop there.
>Or published the codebase without the compiler used to make the binary.
A slightly offtopic complaint, but too often I have seen tutorials for open source stuff (coughopenglcough) where they don't provide the proper commands to compile and link everything required to build it. Figuring it out makes the "getting started" portion even more tedious.
Open Source and Free Software wasn't formulated to deal with the need for this level of gargantuan amounts of data and compute.
Can the public compete? What percentage of the technical public could we expect to participate, and how much data, compute, and data quality improvement could they bring to the table? I suspect that large corporations are at least an order of magnitude advantaged economically.
I think the process of data acquisition isn't so clear-cut. Take CERN as an example: they release loads of data from various experiments under the CC0 license [1]. This isn't just a few small datasets for classroom use; we're talking big-league data, like the entire first run data from LHCb [2].
On their portal, they don't just dump the data and leave you to it. They've got guides on analysis and the necessary tools (mostly open source stuff like ROOT [3] and even VMs). This means anyone can dive in. You could potentially discover something new or build on existing experiment analyses. This setup, with open data and tools, ticks the boxes for reproducibility. But does it mean people need to recreate the data themselves?
Ideally, yeah, but realistically, while you could theoretically rebuild the LHC (since most technical details are public), it would take an army of skilled people, billions of dollars, and years to do it.
This contrasts with open source models, where you can retrain models using data to get the weights. But getting hold of the data and the cost to reproduce the weights is usually prohibitive. I get that CERN's approach might seem to counter this, but remember, they're not releasing raw data (which is mostly noise), but a more refined version. Try downloading several petabytes of raw data if not; good luck with that. But for training something like a LLM, you might need the whole dataset, which in many cases have its own problems with copyrights…etc.
You're right that most people have neither the need nor the ability to recreate the data themselves. But the same applies to using open-source software in the first place: most people who use OSS have neither the need nor the ability to compile the software from source themselves. But the whole point of OSS is that that source is available for those who want to use it, whether to study it, to diagnose a bug, or something else. I think the same is true for the LHC's technical details or a model's training data: most people won't recreate it at home, but it's important to make it available, and even someone who can't rebuild the whole thing themselves might spot an important bug or omission by going through the data collection details.
I think the biggest issue is with publishing the datasets.
Then people and companies would discover that it's full of their copyrighted content and sue.
I wouldn't be surprised if they slurped in the whole Z-Library et Al into their models. Or Google their entire Google Books Dataset
Somewhat unrelated, but here is a thought experiment...
If a human knows a song "by heart" (imperfectly), it is not considered copyright infringement.
If a LLM knows a song as part of its training data, then it is copyright infringement.
But what if you developed a model with no prepared training data and forced it to learn from it's own sensory inputs. Instead of shoveling it bits, you played it this particular song and it (imperfectly) recorded the song with it's sensory input device. The same way humans listen to and experience music.
Is the latter learning model infringing on the copyright of the song?
The Open Source Initiative, who maintain the Open Source Definition, have been running a whole series over the past year to collect input from all sorts of stakeholders about what it means for an AI to be open source. I was lucky enough to participate in an afternoon long session with about a hundred other people last year at All Things Open.
Can you summarize? I'm reading https://deepdive.opensource.org/wp-content/uploads/2023/02/D... but it seems to tackle too many questions when I'm really only interested on what criteria to use when deciding whether (for example) Stable Diffusion is open source or not.
Anyway, to go on a tangent, some day maybe with zero knowledge proofs we will be able to prove that a given pretrained model was indeed the result of training using a given dataset, in a way that can be verified vastly cheaper than training the model itself from scratch. (This same technique could also be applied to other things like verifying if a binary was compiled from a given source with a given compiler, hopefully verified in a cheaper way than compiling and applying all optimizations from scratch).
If this ever materialize, then we can just demand proofs.
Applying the term "open source" to AI models is a bit more nuanced than to software. Many consider reproducibility the bar to get over to earn the label "open source."
For an AI model that means the model itself, the dataset, and the training recipe (e.g. process, hyperparameters) often also released as source code. With that (and a lot of compute) you can train the model to get the weights.
Many companies are using "open source" as marketing rather then actually releasing open source software and models. No data? Not open source. Special license cutting out self-hosting or competitive use? Not open source.
"the project does not benefit from the OSS feedback loop" It's not like you can submit PRs to training data that fixes specific issues the way you can submit bug fixes, so I'm skeptical you would see much of a feedback loop.
"it’s hard to verify that the model has no backdoors (eg sleeper agents)" Again given the size of the datasets and the opaque way training works, I am skeptical that anyone would be able tell if there is a backdoor in the training data.
"impossible to verify the data and content filter and whether they match your company policy" I don't totally know what this means. For one, you can/probably should apply company policies to the model outputs, which you can do without access to training data. Is the idea that every company could/should filter input data and train their own models?
"you are dependent on the company to refresh the model" At the current cost, this is probably already true for most people.
"A true open-source LLM project — where everything is open from the codebase to the data pipeline — could unlock a lot of value, creativity, and improve security." I am overall skeptical that this is true in the case of LLMs. If anything, I think this creates a larger surface for bad actors to attack.
You can grep for bad words. What you can't do(unless hoops are jumped through) is to verify that weights came from the same dataset. You can set the same random seed and still get different results. Calculations are not that deterministic. (https://pytorch.org/docs/stable/notes/randomness.html#reprod...).
>I am overall skeptical that this is true in the case of LLMs
This skepticism seems reasonable. EleutherAI have documentation to reproduce training (https://github.com/EleutherAI/pythia#reproducing-training). So far I haven't seen it leading to anything.
Lots of arxiv papers I've seen complain about time and budget constraint even regarding finetunes, forget pretraining.
The company policy/backdoors issues are possibly like the whole Getty Images debacle. If a company contracts with a provider or just uses a given model themselves, they may have no idea that it's taking from a ton of copyrighted work AND with enough of a trail where the infringed party could probably win a suit.
Backdoors I'd think of is if there are some sneaky words (maybe not even english) that all of a sudden causes it to emit NSFW outputs. Microsoft's short-lived @TayandYou comes to mind (but I don't think anyone's making that mistake again, where multiple users' sessions are pooled).
I don't agree, and the analogy is poor. One can do the things he lists with a trained model. Having the data is basically a red herring. I wish this got more attention. Open/free software is about exercising freedoms, and they all can be exercised if you've got the model weights and code.
But one of the four freedoms is being able to modify/tweek things, including the model. If all you have is the model weights, then you can't easily tweak the model. The model weights is hardly the preferred form for making changes to update the model.
The equivalent would be someone which gives you only the binary to Libreoffice. That's perfectly fine for editing documents and spreadsheets, but suppose you want to fix a bug in Libreoffice? Just having the binary is going to make it quite difficult to fix things.
Simiarly, suppose you find that the model has a bias in terms of labeling African Americans as criminals; or women as lousy computer programmers. If all you have is the model weights of the trained model, how easily can you fix the model? And how does that compare with running emacs on the Libreoffice binary?
My main concern is that if all you have are weights you're stuck hoping for the benevolence of whatever organization is actually able to train the model with their secret dataset.
When they get bought by Oracle and progress slows to a crawl because it's not profitable enough to interest them, you can't exactly do a LibreOffice. Or they can turn around and say "license change, future versions may not be used for <market that controlling company would like to dominate>" and now you're stuck with whatever old version of the model while they steamroll your project with newer updates.
Open weights are worth nothing in terms of long term security of development, they're a toy that you can play with but you have no assurances of anything for the future.
> The “source code” for a work means the preferred form of the work for making modifications to it.
-- gplv3
These AI/ML models are interesting in that the weights are derived from something else (training set), but if you're modifying them you don't need that. Lots of "how to do fine-tuning" tutorials floating around, and they don't need access to the original training set.
Are there any true open-source LLM models, where all the training data is publicly-available (with a compatible license) and the training software can reproduce bit-identical models?
Is training nondeterministic? I know LLM outputs are purposely nondeterministic.
>Are there any true open-source LLM models, where all the training data is publicly-available (with a compatible license)
Mamba has a version, trained on publicly available SlimPajama. RedPajama-INCITE was trained on non-slimmed version of the dataset(it's only one dataset).
I'm not sure if training scripts are available.
Pythia definitely has scripts. However it was trained on the pile, so you have to find books3 on your own.
Also I believe LLM360 is an explicit attempt to do it with llama.
>Is training nondeterministic?
Correct. Torch documentation has a section on reproducibility of a training.
I think the answer is in the name. The "source" has always been what you need to build the thing. In this context I think we can agree that the thing is the model. Based on that the model is no more open source than a binary program.
I'll venture to say the majority of these "open access models" are meant to serve as advertisements of capabilities (either of hardware, research, or techniques) and nothing more. MPT being one of the most obvious example.
Many don't offer any information, some do offer information but provide no new techniques and just threw a bunch of compute and some data to make a sub-par model that shows up on a specific leaderboard.
Everyone is trying to save a card up their sleeve so they can sell it. And showing up on scoreboards is a great advertisement.
Publish your data and prepare to get vilified by professional complainers because the data doesn't conform to their sensibilities. Lots of downside with very little of the opposite.
> if you can’t reproduce the model then it’s not truly open-source.
Open-source means open source, it does not make reproducibility guarantees. You get the code and you can use the code. Pushed to the extreme this is like saying Chromium is not open-source because my 4GB laptop can't compile it.
Getting training code for GPT-4 under MIT would be mostly useless, but it would still be open source.
> Pushed to the extreme this is like saying Chromium is not open-source because my 4GB laptop can't compile it.
Not really, an analog would be if Chromium shipped LLVM IR as its source but no one could get any version of LLVM to output the exact same IR no matter what configurations they tried, and thus any "home grown" Chromium was a little off.
95% of the value comes from the model being freely downloadable and analyzable (i.e. not obfuscated/crippled post-hoc). Sure there is some difference, but as researchers I care far more about open access than making every "gnuight" on the internet happy that we used the right terminology.
I would argue that while technically correct, it is not what most people really care. What they care about are the following:
1. Can I download it?
2. Can I run it on my hardware?
3. Can I modify it?
4. Can I share my modifications with others?
If those questions are in the affirmative, then I think most people consider it open enough, and it is a huge step for freedom compared to the models such as OpenAI.
It's a great observation. People simply want their free stuff.
The potential challenge arises in the future. Today's models will probably look weak compared to models we'll have in 1, 3 or 10 years which means that today's models will likely be irrelevant in years hence. Every competitive "open" model today is tied closely to a controlling organization weather it's Meta, Mistral.AI, TII, 01.AI, etc.
If they simply choose not to publish the next iteration of their model and follow OpenAI's path that's the end of the line.
A truly open model could have some life beyond that of its original developer/organization. Of course it would still take great talent, updated datasets, and serious access to compute to keep a model moving forward and developing but if this is done in the "open" community then we'd have some guarantee for the future.
Imagine if Linux was actually owned by a for-profit corporation and they could simply choose not to release a future version AND it was not possible for another organization to fork and carry on "open" Linux?
Some people want more than that, e.g. they want to fix their printer but the driver is closed source, so they start the GNU project and the broader free software movement, responsible for almost all software innovation for decades.
The #3 is an issue. If I get a binary of some software with a permissive license, I technically could patch that binary to modify some functionality, but I'd rather really like to have the source code instead.
Similarly, if I have a LLM model with a permissive license, I technically could fine-tune it to modify its behavior, but for some kinds of modifications I'd really rather re-run (parts of) the training differently.
„Can it be trusted?“ is the question many people will care about, when the awareness of the risks becomes higher. If this question can be answered without publishing the source, fine, but that would probably mean that publisher must be liable for damages from model output.
ssgodderidge|2 years ago
This was such a helpful way to frame the problem! Something felt off about the "open source models" out there; this highlights the problem incredibly well.
fzliu|2 years ago
As much as I appreciate Mis(x)tral, I would've loved it even more if they released code for gathering data.
FooBarWidget|2 years ago
The LLM inference engine (architecture implementation) is like a kernel driver that loads a firmware binary blob, or a virtual machine that loads bytecode. The inference engine is open source. The problem is that the weights (firmware blobs, VM bytecodes) are opaque: you don't have the means to reproduce them.
The Linux community has long argued that drivers that load firmware blobs are cheating: they don't count as open source.
Still, the "open source" LLMs are more open than "API-gated" LLMs. It's a step in the right direction, but I hope we don't stop there.
nullc|2 years ago
Zuiii|2 years ago
Android is not open source.
jncfhnb|2 years ago
This analogy is bad. Models are unlike code bases in this way.
unknown|2 years ago
[deleted]
MiddleEndian|2 years ago
A slightly offtopic complaint, but too often I have seen tutorials for open source stuff (coughopenglcough) where they don't provide the proper commands to compile and link everything required to build it. Figuring it out makes the "getting started" portion even more tedious.
stcredzero|2 years ago
Can the public compete? What percentage of the technical public could we expect to participate, and how much data, compute, and data quality improvement could they bring to the table? I suspect that large corporations are at least an order of magnitude advantaged economically.
elashri|2 years ago
On their portal, they don't just dump the data and leave you to it. They've got guides on analysis and the necessary tools (mostly open source stuff like ROOT [3] and even VMs). This means anyone can dive in. You could potentially discover something new or build on existing experiment analyses. This setup, with open data and tools, ticks the boxes for reproducibility. But does it mean people need to recreate the data themselves?
Ideally, yeah, but realistically, while you could theoretically rebuild the LHC (since most technical details are public), it would take an army of skilled people, billions of dollars, and years to do it.
This contrasts with open source models, where you can retrain models using data to get the weights. But getting hold of the data and the cost to reproduce the weights is usually prohibitive. I get that CERN's approach might seem to counter this, but remember, they're not releasing raw data (which is mostly noise), but a more refined version. Try downloading several petabytes of raw data if not; good luck with that. But for training something like a LLM, you might need the whole dataset, which in many cases have its own problems with copyrights…etc.
[1] https://opendata.cern.ch/docs/terms-of-use
[2] https://opendata.cern.ch/docs/lhcb-releases-entire-run1-data...
[3] https://root.cern/
lmm|2 years ago
albert180|2 years ago
zelon88|2 years ago
If a human knows a song "by heart" (imperfectly), it is not considered copyright infringement.
If a LLM knows a song as part of its training data, then it is copyright infringement.
But what if you developed a model with no prepared training data and forced it to learn from it's own sensory inputs. Instead of shoveling it bits, you played it this particular song and it (imperfectly) recorded the song with it's sensory input device. The same way humans listen to and experience music.
Is the latter learning model infringing on the copyright of the song?
anticorporate|2 years ago
https://deepdive.opensource.org/
I encourage you to go check out what's already being done here. I promise it's way more nuanced than anything than is going to fit on a tweet.
nextaccountic|2 years ago
Anyway, to go on a tangent, some day maybe with zero knowledge proofs we will be able to prove that a given pretrained model was indeed the result of training using a given dataset, in a way that can be verified vastly cheaper than training the model itself from scratch. (This same technique could also be applied to other things like verifying if a binary was compiled from a given source with a given compiler, hopefully verified in a cheaper way than compiling and applying all optimizations from scratch).
If this ever materialize, then we can just demand proofs.
Here's a study on that
https://montrealethics.ai/experimenting-with-zero-knowledge-...
https://dl.acm.org/doi/10.1145/3576915.3623202
And here is another
https://eprint.iacr.org/2023/1174
mgreg|2 years ago
For an AI model that means the model itself, the dataset, and the training recipe (e.g. process, hyperparameters) often also released as source code. With that (and a lot of compute) you can train the model to get the weights.
darrenBaldwin03|2 years ago
dbish|2 years ago
tqi|2 years ago
"it’s hard to verify that the model has no backdoors (eg sleeper agents)" Again given the size of the datasets and the opaque way training works, I am skeptical that anyone would be able tell if there is a backdoor in the training data.
"impossible to verify the data and content filter and whether they match your company policy" I don't totally know what this means. For one, you can/probably should apply company policies to the model outputs, which you can do without access to training data. Is the idea that every company could/should filter input data and train their own models?
"you are dependent on the company to refresh the model" At the current cost, this is probably already true for most people.
"A true open-source LLM project — where everything is open from the codebase to the data pipeline — could unlock a lot of value, creativity, and improve security." I am overall skeptical that this is true in the case of LLMs. If anything, I think this creates a larger surface for bad actors to attack.
Trapais|2 years ago
>I am overall skeptical that this is true in the case of LLMs
This skepticism seems reasonable. EleutherAI have documentation to reproduce training (https://github.com/EleutherAI/pythia#reproducing-training). So far I haven't seen it leading to anything. Lots of arxiv papers I've seen complain about time and budget constraint even regarding finetunes, forget pretraining.
nick238|2 years ago
Backdoors I'd think of is if there are some sneaky words (maybe not even english) that all of a sudden causes it to emit NSFW outputs. Microsoft's short-lived @TayandYou comes to mind (but I don't think anyone's making that mistake again, where multiple users' sessions are pooled).
andy99|2 years ago
https://www.marble.onl/posts/considerations_for_copyrighting...
tytso|2 years ago
The equivalent would be someone which gives you only the binary to Libreoffice. That's perfectly fine for editing documents and spreadsheets, but suppose you want to fix a bug in Libreoffice? Just having the binary is going to make it quite difficult to fix things.
Simiarly, suppose you find that the model has a bias in terms of labeling African Americans as criminals; or women as lousy computer programmers. If all you have is the model weights of the trained model, how easily can you fix the model? And how does that compare with running emacs on the Libreoffice binary?
wlesieutre|2 years ago
When they get bought by Oracle and progress slows to a crawl because it's not profitable enough to interest them, you can't exactly do a LibreOffice. Or they can turn around and say "license change, future versions may not be used for <market that controlling company would like to dominate>" and now you're stuck with whatever old version of the model while they steamroll your project with newer updates.
Open weights are worth nothing in terms of long term security of development, they're a toy that you can play with but you have no assurances of anything for the future.
tbrownaw|2 years ago
-- gplv3
These AI/ML models are interesting in that the weights are derived from something else (training set), but if you're modifying them you don't need that. Lots of "how to do fine-tuning" tutorials floating around, and they don't need access to the original training set.
cpeterso|2 years ago
Is training nondeterministic? I know LLM outputs are purposely nondeterministic.
Trapais|2 years ago
Mamba has a version, trained on publicly available SlimPajama. RedPajama-INCITE was trained on non-slimmed version of the dataset(it's only one dataset).
I'm not sure if training scripts are available.
Pythia definitely has scripts. However it was trained on the pile, so you have to find books3 on your own.
Also I believe LLM360 is an explicit attempt to do it with llama.
>Is training nondeterministic?
Correct. Torch documentation has a section on reproducibility of a training.
beardyw|2 years ago
declaredapple|2 years ago
Many don't offer any information, some do offer information but provide no new techniques and just threw a bunch of compute and some data to make a sub-par model that shows up on a specific leaderboard.
Everyone is trying to save a card up their sleeve so they can sell it. And showing up on scoreboards is a great advertisement.
pabs3|2 years ago
https://salsa.debian.org/deeplearning-team/ml-policy/
nathanasmith|2 years ago
ramesh31|2 years ago
belval|2 years ago
Open-source means open source, it does not make reproducibility guarantees. You get the code and you can use the code. Pushed to the extreme this is like saying Chromium is not open-source because my 4GB laptop can't compile it.
Getting training code for GPT-4 under MIT would be mostly useless, but it would still be open source.
camgunz|2 years ago
Not really, an analog would be if Chromium shipped LLVM IR as its source but no one could get any version of LLVM to output the exact same IR no matter what configurations they tried, and thus any "home grown" Chromium was a little off.
stcredzero|2 years ago
emadm|2 years ago
Der_Einzige|2 years ago
edoardo-schnell|2 years ago
fragmede|2 years ago
robblbobbl|2 years ago
deadbabe|2 years ago
[deleted]
unknown|2 years ago
[deleted]
RcouF1uZ4gsC|2 years ago
1. Can I download it?
2. Can I run it on my hardware?
3. Can I modify it?
4. Can I share my modifications with others?
If those questions are in the affirmative, then I think most people consider it open enough, and it is a huge step for freedom compared to the models such as OpenAI.
mgreg|2 years ago
The potential challenge arises in the future. Today's models will probably look weak compared to models we'll have in 1, 3 or 10 years which means that today's models will likely be irrelevant in years hence. Every competitive "open" model today is tied closely to a controlling organization weather it's Meta, Mistral.AI, TII, 01.AI, etc.
If they simply choose not to publish the next iteration of their model and follow OpenAI's path that's the end of the line.
A truly open model could have some life beyond that of its original developer/organization. Of course it would still take great talent, updated datasets, and serious access to compute to keep a model moving forward and developing but if this is done in the "open" community then we'd have some guarantee for the future.
Imagine if Linux was actually owned by a for-profit corporation and they could simply choose not to release a future version AND it was not possible for another organization to fork and carry on "open" Linux?
camgunz|2 years ago
PeterisP|2 years ago
Similarly, if I have a LLM model with a permissive license, I technically could fine-tune it to modify its behavior, but for some kinds of modifications I'd really rather re-run (parts of) the training differently.
ivan_gammel|2 years ago