I’ve been using the GPTQ 4 bit quantized 13B with Text generation web UI and it’s been amazing. Probably the closest to ChatGPT I have used so far. I still get an issue where it keeps on talking to itself by generating its own prompt and then answering it. Has anyone experienced the same thing?
I've been testing quite a few of these models lately.
For me, the absolute best is still the 65B 4-bit quantized llama model with the correct prompt and parameters, both for programming, language and general questions.
I am actually getting about 2 tokens/second with the latest llama.cpp using 16 threads on a 5950x with 64 gb of ram. 16 threads seems to be the sweet spot, any higher and it slows down, any lower and it is less consistent in the time to produce a token.
I am 100% convinced that the AI "market" will be a local thing. Running this and having access to all the information stored in the weights easily and without internet is just so great I think :)
Edit: the responding to itself "bug" is most likely an issue with the prompt you issue. The recent llama.cpp has a good starting point in the examples/chat-13b.sh
I am using a modified version of that where I set the 65B model, change the moscow stuff to cairo and the node.js to a small C program.
I'm in the process of testing the various self-hosted LLMs. I have an M2 MBA laptop and a 5950X w/ 64GB RAM and an RTX 4090 (24GB VRAM).
I've used ChatGPT 3.5 and 4 quite a bit, and have done a bunch of comparisons w/ nat.dev's Playground between a variety of models (claude-instant provides gpt-3.5-turbo level output and is about 3-4X faster; gpt-3.5-turbo, text-davinci-003 to me are about equal and about the cutoff level of where they are generally useful for me - reliability as an end user for summarizations, Q&A, code assistance, etc).
I found all the raw LLaMA variants I could run (up to 30B) to not be very coherent or useful. pythia, gpt-j, gpt-neox, chatglm and the other open raw models I found to be much worse than what the various eval scores would suggest (PIQA, HellaSwag, WinoGrande, ARC-e, etc)? I did a fair amount of playing w/ inference hyper-parameters early on to no avail, but did not do much k-shot learning or proper prompts (like the one's Scale AI uses for training).
I tried a bunch of other Alpaca/instruction-tuned models and they're better, but IMO still not very good. GPT4All w/ the unfiltered checkpoint was the only one that did OK until I tried Vicuna (13B load-8-bit on GPU; I tried Baiz but wasn't impressed, have yet to try Koala, but don't have high expectations). Vicuna does a better job than GPT4All, but I did notice some of the going off the rails/not stopping - it however strongly leans on "as an AI language model..." responses - IMO, any fine-tune based on ChatGPT output really should filter that out, it really knee caps the responses.
One surprise, while it generally doesn't perform quite as well, it tends to be more lucid and in some cases does a significantly better job, is RWKV Raven (ChatRWKV is pretty easy to get going; I can run the v7 14B fw/ fp16int8 in about 16GB of VRAM).
The rate of advancement over just a few weeks is really impressive and it's been really fun catching up on the state of the art on LLMs (I wasn't paying much attention before, despite playing around a bunch w/ SD image generation models previously) and I'm still learning, but after poking around w/ these "smaller" self-hosted models makes me wonder if there's some threshold (50B+ params?) or other secret sauce that captures the "magic" that gpt-3.5 seems to reach (from benchmarks LLaMA 65B is supposed to outperform Chichilla 70B, Gopher 280B, and even match PaLM 540B - gpt-3.5 is ~175-200B, gpt-4 is estimated at 1T parameters).
I have not experienced that problem, but it sounds both annoying and funny. How often do you encounter it? A few times a day but it varies quite a bit. Have you been able to tell what causes it? Shorter prompts sometimes cause it but I have seen it on longer prompts as well. Is there a new version that fixes the issue? Not that I've seen released but it is worth checking.
Seems like llama derived model are flourishing. However with llama is licensed as academic only and noncommercial model, what is the path for bringing this to production of for profit purpose?
The methodology for alpaca has proven powerful and it's being applied to model with better licensing. It's hard to track lineage, but I think openassistant models are the most permissive at the moment, they use a openly sourced set of data to build an instruct model on top of phiia, which itself is a gptneox trained on a duplicated version of the famous the pile dataset.
The problem is verifying the licensing claims for these composed solutions is becoming exceedingly hard.
The Silicon Valley ethos has always been - do it first worry about legality later. If you go bust - nobody will care. If you become small - you will be ignored. If you go big - lawyers will figure something out to cut a deal.
> llama is licensed as academic only and noncommercial model
Are weights even copyrightable? I was under the impression that they weren't (although it hasn't been tested, and there's a chance they may run afoul of database rights).
Apple's unified memory should allow running large models like 65B that will not fit on a consumer GPU, but mostly I see people talking about the smaller 7B sizes that can run anywhere.
Vicuna-13B loads and idles at ~26GB RAM usage on a M1Max/64GB. When answering questions, that grows to around 75GB, and yes, you can feel it (and the machine) slow down significantly when it starts hitting swap. I think realistically you'd be wanting to stick to the 7B model on a 32G machine (even if you could get the weight deltas to apply correctly).
I think you have it backwards. The python (ie, huggingface, etc) implementations of transformers are the complex ones with dependency hell so bad even there's even a layer of package manager / env hell. This version of fastchat (there's 2) required a particular commit of huggingface libs for quite a while. Something that only changed recently. And it'll happen again in the future. Python just hides this complexity... until it doesn't. Like beautiful but rapidly rotting fruit.
llama.cpp will remain a single two line project (git clone https://github.com/ggerganov/llama.cpp, make -j) that will compile easily and run on anything. No external deps to pin to a particular commit (that will only have a lifetime of some months) as things change rapidly.
That said, the changes in the ggml weights format the last 2 weeks were annoying, but now that the mmap-style weights are settled on it should be less converting. In that sense huggingface wins, it only has two incompatible weights formats. llama.cpp's ggml has had 3.
Did they release the merged weights, yet? I'd love to try this model.
Afaict from the docs, you still need to request the original Llama weights from Meta (or get ahold of them another way), then apply the diff-weights requiring 60GB RAM?
MacBook with M1 chip here.python installed with homebrew
tried to install with:
pip install fschat
then tried to run it with:
python3 -m fastchat.serve.cli --model -name vicuna-7b --device mps --load-8bit
got this:
traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/opt/homebrew/lib/python3.11/site-packages/fastchat/serve/cli.py", line 9, in <module>
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer
ModuleNotFoundError: No module named 'transformers'
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1126, in _get_module
return importlib.import_module("." + module_name, self.__name__)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.2_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 940, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/__init__.py", line 15, in <module>
from . import (
File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/mt5/__init__.py", line 29, in <module>
from ..t5.tokenization_t5 import T5Tokenizer
File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5.py", line 26, in <module>
from ...tokenization_utils import PreTrainedTokenizer
File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 26, in <module>
from .tokenization_utils_base import (
File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 74, in <module>
from tokenizers import AddedToken
File "/opt/homebrew/lib/python3.11/site-packages/tokenizers/__init__.py", line 80, in <module>
from .tokenizers import (
ImportError: dlopen(/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so, 2): no suitable image found. Did find:
/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture
/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/opt/homebrew/lib/python3.11/site-packages/fastchat/serve/cli.py", line 9, in <module>
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer
File "<frozen importlib._bootstrap>", line 1231, in _handle_fromlist
File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1116, in __getattr__
module = self._get_module(self._class_to_module[name])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1128, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback):
dlopen(/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so, 2): no suitable image found. Did find:
/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture
/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture
You need to use the transformers from the main branch instead of the pypi version, because the llama support is recently added. According to the readme of the repo, you need to install transformers with: pip3 install git+https://github.com/huggingface/transformers
My one take away after playing with both chat mode and text completion modes is that gpt4all 7B 4bit stays on the chat rails (doesn't start taking the role of the user, or spewing fine tuning boilerplate) much better than vicuna 7B 4bit. In text completion they're about the same but I'd still prefer the vanilla llama 7B in that case.
syntaxing|2 years ago
tyfon|2 years ago
I am actually getting about 2 tokens/second with the latest llama.cpp using 16 threads on a 5950x with 64 gb of ram. 16 threads seems to be the sweet spot, any higher and it slows down, any lower and it is less consistent in the time to produce a token.
I am 100% convinced that the AI "market" will be a local thing. Running this and having access to all the information stored in the weights easily and without internet is just so great I think :)
Edit: the responding to itself "bug" is most likely an issue with the prompt you issue. The recent llama.cpp has a good starting point in the examples/chat-13b.sh I am using a modified version of that where I set the 65B model, change the moscow stuff to cairo and the node.js to a small C program.
lhl|2 years ago
I've used ChatGPT 3.5 and 4 quite a bit, and have done a bunch of comparisons w/ nat.dev's Playground between a variety of models (claude-instant provides gpt-3.5-turbo level output and is about 3-4X faster; gpt-3.5-turbo, text-davinci-003 to me are about equal and about the cutoff level of where they are generally useful for me - reliability as an end user for summarizations, Q&A, code assistance, etc).
I found all the raw LLaMA variants I could run (up to 30B) to not be very coherent or useful. pythia, gpt-j, gpt-neox, chatglm and the other open raw models I found to be much worse than what the various eval scores would suggest (PIQA, HellaSwag, WinoGrande, ARC-e, etc)? I did a fair amount of playing w/ inference hyper-parameters early on to no avail, but did not do much k-shot learning or proper prompts (like the one's Scale AI uses for training).
I tried a bunch of other Alpaca/instruction-tuned models and they're better, but IMO still not very good. GPT4All w/ the unfiltered checkpoint was the only one that did OK until I tried Vicuna (13B load-8-bit on GPU; I tried Baiz but wasn't impressed, have yet to try Koala, but don't have high expectations). Vicuna does a better job than GPT4All, but I did notice some of the going off the rails/not stopping - it however strongly leans on "as an AI language model..." responses - IMO, any fine-tune based on ChatGPT output really should filter that out, it really knee caps the responses.
One surprise, while it generally doesn't perform quite as well, it tends to be more lucid and in some cases does a significantly better job, is RWKV Raven (ChatRWKV is pretty easy to get going; I can run the v7 14B fw/ fp16int8 in about 16GB of VRAM).
The rate of advancement over just a few weeks is really impressive and it's been really fun catching up on the state of the art on LLMs (I wasn't paying much attention before, despite playing around a bunch w/ SD image generation models previously) and I'm still learning, but after poking around w/ these "smaller" self-hosted models makes me wonder if there's some threshold (50B+ params?) or other secret sauce that captures the "magic" that gpt-3.5 seems to reach (from benchmarks LLaMA 65B is supposed to outperform Chichilla 70B, Gopher 280B, and even match PaLM 540B - gpt-3.5 is ~175-200B, gpt-4 is estimated at 1T parameters).
brookst|2 years ago
siraben|2 years ago
taf2|2 years ago
UncleOxidant|2 years ago
wejick|2 years ago
I certainly interested doing so.
avereveard|2 years ago
The problem is verifying the licensing claims for these composed solutions is becoming exceedingly hard.
ReptileMan|2 years ago
alwayslikethis|2 years ago
messe|2 years ago
Are weights even copyrightable? I was under the impression that they weren't (although it hasn't been tested, and there's a chance they may run afoul of database rights).
unknown|2 years ago
[deleted]
tric|2 years ago
19h|2 years ago
wmf|2 years ago
matwood|2 years ago
messe|2 years ago
Because a MacBook with 96GB of RAM is cheaper than a GPU with anything close to that.
nickthegreek|2 years ago
unknown|2 years ago
[deleted]
dchuk|2 years ago
"Vicuna-13B This conversion command needs around 60 GB of CPU RAM."
Does this mean I simply cannot run that model at all? Or will it rip into HD swap or something to make the model weights and just take forever?
acchow|2 years ago
FLT8|2 years ago
UncleOxidant|2 years ago
I'm wondering if anyone is torrenting these Vicuna-13B weights?
GaggiX|2 years ago
MMMercy2|2 years ago
youssefabdelm|2 years ago
E.g. try asking it "Who is Tyler Volk?"
Then try asking GPT-4 "Who is Tyler Volk?"
Then check who he is online.
circuit10|2 years ago
psychphysic|2 years ago
Neat none the less but hardly a standout in my opinion.
Everything is state of the art at the moment I guess so can't criticise that too much.
ThorsBane|2 years ago
zhisbug|2 years ago
superkuh|2 years ago
llama.cpp will remain a single two line project (git clone https://github.com/ggerganov/llama.cpp, make -j) that will compile easily and run on anything. No external deps to pin to a particular commit (that will only have a lifetime of some months) as things change rapidly.
That said, the changes in the ggml weights format the last 2 weeks were annoying, but now that the mmap-style weights are settled on it should be less converting. In that sense huggingface wins, it only has two incompatible weights formats. llama.cpp's ggml has had 3.
sottol|2 years ago
Afaict from the docs, you still need to request the original Llama weights from Meta (or get ahold of them another way), then apply the diff-weights requiring 60GB RAM?
bawana|2 years ago
then tried to run it with: python3 -m fastchat.serve.cli --model -name vicuna-7b --device mps --load-8bit
got this:
traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/opt/homebrew/lib/python3.11/site-packages/fastchat/serve/cli.py", line 9, in <module> from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer ModuleNotFoundError: No module named 'transformers'
so I did this:
tried again:python3 -m fastchat.serve.cli --model -name vicuna-7b --device mps --load-8bit
got:
Traceback (most recent call last): File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1126, in _get_module return importlib.import_module("." + module_name, self.__name__) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.2_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen importlib._bootstrap>", line 1206, in _gcd_import File "<frozen importlib._bootstrap>", line 1178, in _find_and_load File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "<frozen importlib._bootstrap>", line 1206, in _gcd_import File "<frozen importlib._bootstrap>", line 1178, in _find_and_load File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 690, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 940, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/__init__.py", line 15, in <module> from . import ( File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/mt5/__init__.py", line 29, in <module> from ..t5.tokenization_t5 import T5Tokenizer File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5.py", line 26, in <module> from ...tokenization_utils import PreTrainedTokenizer File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 26, in <module> from .tokenization_utils_base import ( File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 74, in <module> from tokenizers import AddedToken File "/opt/homebrew/lib/python3.11/site-packages/tokenizers/__init__.py", line 80, in <module> from .tokenizers import ( ImportError: dlopen(/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so, 2): no suitable image found. Did find: /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/opt/homebrew/lib/python3.11/site-packages/fastchat/serve/cli.py", line 9, in <module> from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer File "<frozen importlib._bootstrap>", line 1231, in _handle_fromlist File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1116, in __getattr__ module = self._get_module(self._class_to_module[name]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1128, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback): dlopen(/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so, 2): no suitable image found. Did find: /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture
zhwu|2 years ago
rnosov|2 years ago
Casteil|2 years ago
superkuh|2 years ago
There are a couple versions of gpt4all fine-tuned llama 7B and my favorite is the unfiltered one (gpt4all-lora-unfiltered-quantized.bin). https://github.com/nomic-ai/gpt4all#try-it-yourself
weichiang|2 years ago
"What if the Internet had been invented during the Renaissance period?"
Check out their responses: https://imgur.com/a/mPrdZ1W More questions here: https://vicuna.lmsys.org/eval/
Note: not an apple-to-apple comparison but that's the model checkpoint I found on their git repo.