State-of-the-Art Chatbot, Vicuna-7B, now runs on MacBook with GPU acceleration

syntaxing|2 years ago

I’ve been using the GPTQ 4 bit quantized 13B with Text generation web UI and it’s been amazing. Probably the closest to ChatGPT I have used so far. I still get an issue where it keeps on talking to itself by generating its own prompt and then answering it. Has anyone experienced the same thing?

tyfon|2 years ago

I've been testing quite a few of these models lately. For me, the absolute best is still the 65B 4-bit quantized llama model with the correct prompt and parameters, both for programming, language and general questions.

I am actually getting about 2 tokens/second with the latest llama.cpp using 16 threads on a 5950x with 64 gb of ram. 16 threads seems to be the sweet spot, any higher and it slows down, any lower and it is less consistent in the time to produce a token.

I am 100% convinced that the AI "market" will be a local thing. Running this and having access to all the information stored in the weights easily and without internet is just so great I think :)

Edit: the responding to itself "bug" is most likely an issue with the prompt you issue. The recent llama.cpp has a good starting point in the examples/chat-13b.sh I am using a modified version of that where I set the 65B model, change the moscow stuff to cairo and the node.js to a small C program.

lhl|2 years ago

I'm in the process of testing the various self-hosted LLMs. I have an M2 MBA laptop and a 5950X w/ 64GB RAM and an RTX 4090 (24GB VRAM).

I've used ChatGPT 3.5 and 4 quite a bit, and have done a bunch of comparisons w/ nat.dev's Playground between a variety of models (claude-instant provides gpt-3.5-turbo level output and is about 3-4X faster; gpt-3.5-turbo, text-davinci-003 to me are about equal and about the cutoff level of where they are generally useful for me - reliability as an end user for summarizations, Q&A, code assistance, etc).

I found all the raw LLaMA variants I could run (up to 30B) to not be very coherent or useful. pythia, gpt-j, gpt-neox, chatglm and the other open raw models I found to be much worse than what the various eval scores would suggest (PIQA, HellaSwag, WinoGrande, ARC-e, etc)? I did a fair amount of playing w/ inference hyper-parameters early on to no avail, but did not do much k-shot learning or proper prompts (like the one's Scale AI uses for training).

I tried a bunch of other Alpaca/instruction-tuned models and they're better, but IMO still not very good. GPT4All w/ the unfiltered checkpoint was the only one that did OK until I tried Vicuna (13B load-8-bit on GPU; I tried Baiz but wasn't impressed, have yet to try Koala, but don't have high expectations). Vicuna does a better job than GPT4All, but I did notice some of the going off the rails/not stopping - it however strongly leans on "as an AI language model..." responses - IMO, any fine-tune based on ChatGPT output really should filter that out, it really knee caps the responses.

One surprise, while it generally doesn't perform quite as well, it tends to be more lucid and in some cases does a significantly better job, is RWKV Raven (ChatRWKV is pretty easy to get going; I can run the v7 14B fw/ fp16int8 in about 16GB of VRAM).

The rate of advancement over just a few weeks is really impressive and it's been really fun catching up on the state of the art on LLMs (I wasn't paying much attention before, despite playing around a bunch w/ SD image generation models previously) and I'm still learning, but after poking around w/ these "smaller" self-hosted models makes me wonder if there's some threshold (50B+ params?) or other secret sauce that captures the "magic" that gpt-3.5 seems to reach (from benchmarks LLaMA 65B is supposed to outperform Chichilla 70B, Gopher 280B, and even match PaLM 540B - gpt-3.5 is ~175-200B, gpt-4 is estimated at 1T parameters).

brookst|2 years ago

I have not experienced that problem, but it sounds both annoying and funny. How often do you encounter it? A few times a day but it varies quite a bit. Have you been able to tell what causes it? Shorter prompts sometimes cause it but I have seen it on longer prompts as well. Is there a new version that fixes the issue? Not that I've seen released but it is worth checking.

siraben|2 years ago

Haven't used the text generation web UI, but if you're using the CLI, use the "reverse prompt" option to hand control back to the user.

  ./bin/main -i --interactive-first -r '### Human:' -t 8 -n 512 --instruct -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin --color

taf2|2 years ago

I got this and after I redownloaded the model/ggml file it was fixed... could be some corruption in the model file?

UncleOxidant|2 years ago

I've gotten that which alpaca.cpp. It'll start asking itself questions and then answer them.

wejick|2 years ago

Seems like llama derived model are flourishing. However with llama is licensed as academic only and noncommercial model, what is the path for bringing this to production of for profit purpose?

I certainly interested doing so.

avereveard|2 years ago

The methodology for alpaca has proven powerful and it's being applied to model with better licensing. It's hard to track lineage, but I think openassistant models are the most permissive at the moment, they use a openly sourced set of data to build an instruct model on top of phiia, which itself is a gptneox trained on a duplicated version of the famous the pile dataset.

The problem is verifying the licensing claims for these composed solutions is becoming exceedingly hard.

ReptileMan|2 years ago

The Silicon Valley ethos has always been - do it first worry about legality later. If you go bust - nobody will care. If you become small - you will be ignored. If you go big - lawyers will figure something out to cut a deal.

alwayslikethis|2 years ago

Copilot style. Train a distilled model based on it, and now it's a new model unencumbered by copyright.

messe|2 years ago

> llama is licensed as academic only and noncommercial model

Are weights even copyrightable? I was under the impression that they weren't (although it hasn't been tested, and there's a chance they may run afoul of database rights).

unknown|2 years ago

[deleted]

tric|2 years ago

Why is there so much focus on running GPT models on Mac OS? Is there something special about Apple's new chip, or Mac OS?

19h|2 years ago

Unified memory allows both CPU and GPU to use the same memory, effectively giving a MacBook with 96GB of memory 96GB of VRAM (minus OS overhead obv).

wmf|2 years ago

Apple's unified memory should allow running large models like 65B that will not fit on a consumer GPU, but mostly I see people talking about the smaller 7B sizes that can run anywhere.

matwood|2 years ago

The shared ram and neural engine make for an interesting/powerful platform if people are willing to port to it.

messe|2 years ago

> Why is there so much focus on running GPT models on Mac OS?

Because a MacBook with 96GB of RAM is cheaper than a GPU with anything close to that.

nickthegreek|2 years ago

I can run the 30b 4bit model on my m2 air that has 24gb of ram.

unknown|2 years ago

[deleted]

dchuk|2 years ago

So if I have a 32GB RAM Macbook Pro, and the instructions say this:

"Vicuna-13B This conversion command needs around 60 GB of CPU RAM."

Does this mean I simply cannot run that model at all? Or will it rip into HD swap or something to make the model weights and just take forever?

acchow|2 years ago

Can someone explain why computing a delta needs to hold the entire model at once? Can't it just do one layer at time?

FLT8|2 years ago

Vicuna-13B loads and idles at ~26GB RAM usage on a M1Max/64GB. When answering questions, that grows to around 75GB, and yes, you can feel it (and the machine) slow down significantly when it starts hitting swap. I think realistically you'd be wanting to stick to the 7B model on a 32G machine (even if you could get the weight deltas to apply correctly).

UncleOxidant|2 years ago

I just reached that step on my Linux laptop which has 32GB of RAM. I'm about to give it a try anyway, but I'm not hopeful based on that comment.

I'm wondering if anyone is torrenting these Vicuna-13B weights?

GaggiX|2 years ago

Someone really needs to write a script that does not load both entire models into memory to do this.

MMMercy2|2 years ago

You can try the smaller 7B version.

youssefabdelm|2 years ago

So far I think what these models lack is memory of people and other things. Especially if not as popular. And probably a ton more.

E.g. try asking it "Who is Tyler Volk?"

Then try asking GPT-4 "Who is Tyler Volk?"

Then check who he is online.

circuit10|2 years ago

Probably because the parameter count is way lower so it's less able to memorize things

psychphysic|2 years ago

The language is also quite unnatural feeling.

Neat none the less but hardly a standout in my opinion.

Everything is state of the art at the moment I guess so can't criticise that too much.

ThorsBane|2 years ago

This is cool, but how can we get Facebook’s fingers out of the pie with open source weights?

zhisbug|2 years ago

No llama.cpp nor any compilation complexity. Run with two Python commands!

superkuh|2 years ago

I think you have it backwards. The python (ie, huggingface, etc) implementations of transformers are the complex ones with dependency hell so bad even there's even a layer of package manager / env hell. This version of fastchat (there's 2) required a particular commit of huggingface libs for quite a while. Something that only changed recently. And it'll happen again in the future. Python just hides this complexity... until it doesn't. Like beautiful but rapidly rotting fruit.

llama.cpp will remain a single two line project (git clone https://github.com/ggerganov/llama.cpp, make -j) that will compile easily and run on anything. No external deps to pin to a particular commit (that will only have a lifetime of some months) as things change rapidly.

That said, the changes in the ggml weights format the last 2 weeks were annoying, but now that the mmap-style weights are settled on it should be less converting. In that sense huggingface wins, it only has two incompatible weights formats. llama.cpp's ggml has had 3.

sottol|2 years ago

Did they release the merged weights, yet? I'd love to try this model.

Afaict from the docs, you still need to request the original Llama weights from Meta (or get ahold of them another way), then apply the diff-weights requiring 60GB RAM?

bawana|2 years ago

MacBook with M1 chip here.python installed with homebrew tried to install with: pip install fschat

then tried to run it with: python3 -m fastchat.serve.cli --model -name vicuna-7b --device mps --load-8bit

got this:

traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/opt/homebrew/lib/python3.11/site-packages/fastchat/serve/cli.py", line 9, in <module> from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer ModuleNotFoundError: No module named 'transformers'

so I did this:

  pip install transformers command

tried again:

python3 -m fastchat.serve.cli --model -name vicuna-7b --device mps --load-8bit

got:

Traceback (most recent call last): File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1126, in _get_module return importlib.import_module("." + module_name, self.__name__) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.2_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen importlib._bootstrap>", line 1206, in _gcd_import File "<frozen importlib._bootstrap>", line 1178, in _find_and_load File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "<frozen importlib._bootstrap>", line 1206, in _gcd_import File "<frozen importlib._bootstrap>", line 1178, in _find_and_load File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 690, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 940, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/__init__.py", line 15, in <module> from . import ( File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/mt5/__init__.py", line 29, in <module> from ..t5.tokenization_t5 import T5Tokenizer File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5.py", line 26, in <module> from ...tokenization_utils import PreTrainedTokenizer File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 26, in <module> from .tokenization_utils_base import ( File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 74, in <module> from tokenizers import AddedToken File "/opt/homebrew/lib/python3.11/site-packages/tokenizers/__init__.py", line 80, in <module> from .tokenizers import ( ImportError: dlopen(/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so, 2): no suitable image found. Did find: /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/opt/homebrew/lib/python3.11/site-packages/fastchat/serve/cli.py", line 9, in <module> from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer File "<frozen importlib._bootstrap>", line 1231, in _handle_fromlist File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1116, in __getattr__ module = self._get_module(self._class_to_module[name]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1128, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback): dlopen(/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so, 2): no suitable image found. Did find: /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture

zhwu|2 years ago

You need to use the transformers from the main branch instead of the pypi version, because the llama support is recently added. According to the readme of the repo, you need to install transformers with: pip3 install git+https://github.com/huggingface/transformers

rnosov|2 years ago

It looks like you're on python 3.11 which has some issues with Pytorch. Downgrade to python 3.10 and try running it again.

Casteil|2 years ago

Anyone here who's used both this and GPT4All? Any thoughts/input on how they compare?

superkuh|2 years ago

My one take away after playing with both chat mode and text completion modes is that gpt4all 7B 4bit stays on the chat rails (doesn't start taking the role of the user, or spewing fine tuning boilerplate) much better than vicuna 7B 4bit. In text completion they're about the same but I'd still prefer the vanilla llama 7B in that case.

There are a couple versions of gpt4all fine-tuned llama 7B and my favorite is the unfiltered one (gpt4all-lora-unfiltered-quantized.bin). https://github.com/nomic-ai/gpt4all#try-it-yourself

weichiang|2 years ago

I asked GPT4All one of Vicuna's benchmark questions:

"What if the Internet had been invented during the Renaissance period?"

Check out their responses: https://imgur.com/a/mPrdZ1W More questions here: https://vicuna.lmsys.org/eval/

Note: not an apple-to-apple comparison but that's the model checkpoint I found on their git repo.

84 comments