State-of-the-art open-source chatbot, Vicuna-13B, just released model weights

a5huynh|2 years ago

Note that what they released are the delta weights from the og LLaMa model. To play around with it, you'll need to grab the original LLaMA 13B model and apply the changes.

  > We release Vicuna weights as delta weights to comply with the LLaMA model
  > license. You can add our delta to the original LLaMA weights to obtain
  > the Vicuna weights.

Edit: took me a while to find it, here's a direct link to the delta weights: https://huggingface.co/lmsys/vicuna-13b-delta-v0

superkuh|2 years ago

That's what they say but I just spent 10 minutes searching the git repo, reading the relavent .py files and looking at their homepage and the vicuna-7b-delta and vicuna-13b-delta-v0 files are no where to be found. Am I blind or did they announce a release without actually releasing?

swyx|2 years ago

so an extra licensing issue to get around the original non commercial license... this is just a research curiosity is it not?

stcredzero|2 years ago

Vicuna at huggingface.com? This keeps making me think of "facehuggers" from Aliens and Vecna from Stranger Things.

(I know a vicuna is a llama like animal.)

0cf8612b2e1e|2 years ago

Not a lawyer, but that still feels like dubious territory. I would still be on the hook for acquiring the original download, which Facebook has been launching dmca takedown requests for the llama-dl project.

gigel82|2 years ago

One of my favorite queries to assess an LLM is Is it legal for a man to marry his widow's sister?

Vicuna-13B: The legality of a man marrying his widow's sister (widow marriage) varies depending on the country and cultural context. In some societies, widow marriage was traditionally practiced or tolerated.

GPT-4: It is not possible for a man to marry his widow's sister, as the term "widow" refers to a deceased man's wife. Therefore, if a man has a widow, he is deceased and therefore cannot marry anyone.

andai|2 years ago

GPT 3.5: No, it is not possible for a man to marry his widow's sister because by definition, a widow is a woman whose spouse has died. If a man is deceased, then he cannot marry anyone, including his widow's sister. This is a logical impossibility and would not be recognized as a valid marriage under any legal system.

Koala: The legality of a man marrying his widow's sister varies by jurisdiction. In some countries, such as the United States, it is generally legal for a man to marry his widow's sister. In other countries, such as the United Kingdom, it is illegal for a man to marry his widow's sister because it is considered incest. It is always important to check the laws and regulations in the jurisdiction in which the marriage will take place to ensure that it is legal.

https://chat.lmsys.org/?model=koala-13b

stevenhuang|2 years ago

Nice test, cool to see gpt4 got it.

You'd probably need to come up with a new one now though, or confirm knowledge cutoff for the next evaluation :p

muyuu|2 years ago

would still be possibly legal on the basis that if it's not illegal then it's legal - in British jurisprudence tradition at least https://en.wikipedia.org/wiki/Everything_which_is_not_forbid... - namely it's not law that impedes it (also in some places there's posthumous marriage)

ode|2 years ago

Is there some single page that keeps a running status of the various LLVM's and the software to make them runnable on consumer hardware?

takantri|2 years ago

Hi! Funnily enough I couldn't find much on it either, so that's exactly what I've been working on for the past few months: just in case this kind of question got asked.

I've recently opened a GitHub repository which includes information for both AI model series[0] and frontends you can use to run them[1]. I've wrote a Reddit post beforehand that's messier, but a lot more technical[2].

I try to keep them as up-to-date as possible, but I might've missed something or my info may not be completely accurate. It's mostly to help get people's feet wet.

[0] - https://github.com/Crataco/ai-guide/blob/main/guide/models.m...

[1] - https://github.com/Crataco/ai-guide/blob/main/guide/frontend...

[2] - https://old.reddit.com/user/Crataco/comments/zuowi9/opensour...

muyuu|2 years ago

consumer hardware is a bit vague of a limitation, which I guess it's partly why people are not tracking precisely what runs on what very closely

these could be useful:

https://nixified.ai

https://github.com/Crataco/ai-guide/blob/main/guide/models.m... -> https://old.reddit.com/user/Crataco/comments/zuowi9/opensour...

https://github.com/cocktailpeanut/dalai

the 4-bit quantized version of LLaMA 13B runs on my laptop without a dedicated GPU and I guess the same would apply to quantized vicuna 13B but I haven't tried that yet (converted as in this link but for 13B instead of 7B https://github.com/ggerganov/llama.cpp#usage )

GPT4All Lora's also works, perhaps the most compelling results I've got yet in my local computer - I have to try quantized Vicuna to see how that one goes, but processing the files to get a 4bit quantized version will take many hours so I'm a bit hesitant

PS: converting 13B Llama took my laptop's i7 around 20 hours and required a large swap file on top of its 16GB of RAM

feel free to answer back if you're trying any of these things this week (later I might lose track)

sr-latch|2 years ago

Not a single page, but almost all large language models with open weights are published on this website: https://huggingface.co/models

sillysaurusx|2 years ago

This model is surprisingly resistant to jailbreaks. Can anyone get any to work via the web UI? https://chat.lmsys.org/

I tried a few from https://www.jailbreakchat.com/ and it refused them all. Interesting.

a2128|2 years ago

That might not be surprising considering these jailbreaks are written and tested specifically against ChatGPT and ChatGPT alone. This model probably has its own jailbreaks that would also be refused by ChatGPT

xbmcuser|2 years ago

Just when you think Nvidia will go down something happens that changes it. These days unless you were into gaming or a machine learning dev the integrated graphics were good enough. But now first time in a long time I am interested in getting a gpu for running some of these chatbots locally.

yreg|2 years ago

As a very occasional gamer who uses an iMac for work I thought about getting a gaming PC for like 6 years.

Last fall it seemed that all the stars have aligned. The crypto winter and Ethereum switching to proof of stake meant that GPU prices fell to a reasonable level, I knew i would have a bit of a time to play some game during the holidays and as soon as Stable Diffusion was first posted on hacker news I knew that that's my excuse and my sign.

So far I think I have spent more time tinkering with the 20 python environments I have[0] for all the ML projects than playing RDR2.

[0] https://xkcd.com/1987/

atemerev|2 years ago

This model is also censored to the brim, it refuses to answer half of my questions, some of them perfectly legal. It’s useless, we already have GPT-4 (and Vicuna is even more censored/guarded).

Alpaca-30B is much better, it will even tell you how to build a nuclear weapon (incorrectly, of course, it’s not that smart).

I am waiting for Coati13B weights, these should work great.

Beaver117|2 years ago

Why is it locked down? What's the point? Is it locked down if you run locally too or just on the web demo?

adeon|2 years ago

This looks really good for a run-it-on-your-own-hardware model from the examples and sibling comments. I've been working on a pure AVX2 Rust implementation of LLaMA but was starting to lose interest and been waiting for whatever is the next hot downloadable model, but now I want to add this thing to it.

I'll be busy next few days. Heck yeah.

sr-latch|2 years ago

Are you the GGML dev?

simse|2 years ago

It's actually very impressive. I gave it the task of converting a query and an OpenAPI spec into an API call, and it worked! I've not been succesful in getting GPT-3.5 to do this without rambling on about the reasoning for its decision.

capableweb|2 years ago

Usually if I want code from the GPT family I always add "Just show me the code, no extra words or explanation" in the end of the prompt, and it works 99% of the time.

Edit: just finished the conversion of Vicuna myself now and been doing some light testing, seems to work in ~80% of the cases for it, not as high success-rate as with GPT for sure. Probably there is a better way of structuring the prompt for Vicuna.

zhwu|2 years ago

Wow, that is very interesting. Would you mind sharing the prompt you used to query the model?

mesmertech|2 years ago

Amazing model, close and probably better than Bard. Journey to getting the weights was a fun one : )

mlboss|2 years ago

Not bad.

https://pastebin.com/urDUsEew

weichiang|2 years ago

See the original Vicuna post: https://news.ycombinator.com/item?id=35378683

sr-latch|2 years ago

I have a universal benchmark for judging how much knowledge a language model stores, and it's asking about the G-FOLD paper (https://www.lpi.usra.edu/meetings/marsconcepts2012/pdf/4193....), because I noticed GPT-3.5 hallucinates when asked about it, whereas GPT-4 is capable of providing a high-level overview.

pmarreck|2 years ago

Is there any way yet to train one of these on my entire online output and correspondence in order to create a hyper-personal “autocomplete” or a me-chatbot? lol

vlugorilla|2 years ago

From the git repo:

> This conversion command needs around 60 GB of CPU RAM.

Ok. I don't have that. Has/will someone release the full weights with the deltas applied?

capableweb|2 years ago

Create a swapfile then, all you need is 60GB free disk space.

akiselev|2 years ago

https://huggingface.co/jeffwan/vicuna-13b

unknown|2 years ago

[deleted]

gigel82|2 years ago

All of the llama derivatives are tainted by Meta's license, which makes commercial use illegal and even personal use dubious.

Is all the training material used for Llama available as open source? Maybe lots of folks can pool their resources and create fully open clean models / weights instead.

7to2|2 years ago

> All of the llama derivatives are tainted by Meta's license, which makes commercial use illegal and even personal use dubious.

This is not true if you never agreed to Meta's license. If you haven't, you either can't redistribute the weights or you're completely free to use them as you see fit depending on whether weights are copyrightable (very likely) or not. We'll have to wait for the llama-dl lawsuit to find out for sure.

r00fus|2 years ago

That is, unless you can "clean room" an alternative while experimenting in secret with a Llama derivative.

noobcoder|2 years ago

Is it worth it to host this on an EC2 which might take ~1.5$ per hour (on demand) than running GPT3.5 API for this purpose? What is the breakeven number of queries (~2000 tokens/query) to justify the hosting of such model?

Animats|2 years ago

Nice. You need a 28GB GPU, so it's not exactly something people can run on their laptop.

Everybody's server costs are about to go the roof.

akiselev|2 years ago

I'm running it on my Thinkpad in CPU-only mode w/ 64GB ram. It's takes two to five seconds per token but it's perfectly usable.

capableweb|2 years ago

Or use the CPU and be limited by RAM instead of VRAM. Luckily, even with less than 32GB RAM, you can always add a swapfile to use your disk as RAM :)

lhl|2 years ago

The default loader doesn't seem to let you load quantized models but if you use something like https://github.com/oobabooga/text-generation-webui you can 1) use the model with `--load-in-8bit` which halves the memory (runs on my 24GB consumer card w/o an issue then, probably would fit on a 16GB card). There are also 4-bit quantized models and you can run probably `anon8231489123/vicuna-13b-GPTQ-4bit-128g --model_type LLaMA --wbits 4 --groupsize 128` although there have been reports that bitsandbytes have problems w/ 4bit perf on some cards: https://github.com/TimDettmers/bitsandbytes/issues/181

qeternity|2 years ago

4bit quantized should run on less than 9gb vram.

holoduke|2 years ago

Wouldnt it be time for a somewhat older gpu with a lot of memory. Or is that hard to achieve?

Animats|2 years ago

Has anyone tried the Biren GPUs from China? What's pricing like?

phoenixreader|2 years ago

Could someone explain how to test this? Applying the delta conversion requires 60GB of CPU RAM. Do you just have 60GB RAM on your machine?

taf2|2 years ago

I got the 64gb MacBook Pro but already realizing the 96gb laptop would have made sense now - I got it in Jan right before all the ai crazy really lite up - distinctly remember thinking who would ever need more then 64gb of ram…

Jiocus|2 years ago

Actually, yes..

2 x 32GiB (SDDR4 3200MHz) can be had for 170€ and probably less than that if doing the research. Took a bit of faith and a lot of impulse decision-making as this device was/is specified for up to 32GiB RAM only - but it went through.

This is precisely the use case I had in mind

*Lenovo 16ACH6H

progman32|2 years ago

This struck me as well. Is the entire model being loaded before the deltas are applied? Would it be possible to apply the delta blockwise?

capableweb|2 years ago

If you don't have enough RAM, adding swap can be a "quick" (slower, but works) workaround.

d4rkp4ttern|2 years ago

I have an M1 MBP 64GB. Can I run it on my M1 Or do I need a GPU ?

swsdsailor|2 years ago

I got it to work with MPS by having pytorch with mps support and then editing the cli.py file to allow the use of mps:

Allow passing in --device="mps": ie: choices=["cuda", "cpu", "mps"]

Set kwargs: kwargs = { "torch_dtype": torch.float16 }

then adding to("mps") on line 98: model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, *kwargs).to('mps')

commenting out: raise ValueError(f"Invalid device: {args.device}")

and changing cuda to mps on line 80: if args.device == "mps":

I'm not sure it's working correctly but at least it's a step. It's told me how to catch a duck but it often falls into some "renewable energy" sequence. :D

pugio|2 years ago

I have it running, slowly, on the same machine. I would love for someone to get support running for MPS backend (The GPU) but it does run on the CPU.

nerdchum|2 years ago

what are model weights?

detrites|2 years ago

A large array of uniquely-set floating point values. (AKA "parameters".)

In a language model, a word is put in one end (as a numerical index to a wordlist), and then it and the weights multiplied together, and then a new word comes out (again as an index).

Numbers in, numbers out, and a small bit of logic that maps words to numbers and back at either end. ("Encodings".)

"Training" is the typically expensive process of feeding huge amounts of data into the model, to get it to choose the magic values for its weights that allow it to do useful stuff that looks and feels like that training data.

Something else that can be done with weights is they can be "fine-tuned", or "tweaked" slightly to give different overall results out of the model, therefore tailored to some new use-case. Often the model gets a new name after.

In this case, what's been released is not actually the weights. It's a set of these tweaks ("deltas"), which are intended to be added to Meta's LLaMA model weights to end up with the final intended LLaMA-based model, called "Vicuna".

superkuh|2 years ago

Essentially a computer neural network is just a lot of addition (and matrix multiplication) of floating point numbers. The parameters are the "strength" or "weights" of the connections between neurons on different layers and the "bias" of each neuron. If neuron Alice is connected to neuron Bob and Alice has a value of 0.7, and the weight of Alice's connection to bob is 0.5, then the value sent from Alice to Bob is 0.35. This value (and the values from all the other incoming connections) are summed at added to the neuron's negative bias.

I highly recommend checking out 3blue1brown series on how neural nets, gradient descent, and the dot product (implemented as a matrix multiplication) all tie together: https://www.youtube.com/watch?v=aircAruvnKk

ozmodiar|2 years ago

They basically encapsulate what a model has "learned." ML models without their weights are useless because the output is essentially random noise. You then train the model on data, and it changes the weights into numbers that cause the whole thing to work. Training data and processing power are usually very expensive so the resulting weights are valuable.

MMMercy2|2 years ago

They are the parameters of this large language model. There are 13B fp16 numbers.

tomp|2 years ago

the secret sauce of AI

139 comments