top | item 46681100

(no title)

dajonker | 1 month ago

Yes I usually run Unsloth models, however you are linking to the big model now (355B-A32B), which I can't run on my consumer hardware.

The flash model in this thread is more than 10x smaller (30B).

discuss

a_e_k|1 month ago

When the Unsloth quant of the flash model does appear, it should show up as unsloth/... on this page:

https://huggingface.co/models?other=base_model:quantized:zai...

Probably as:

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

homarp|1 month ago

it'a a new architecture. Not yet implemented in llama.cpp

issue to follow: https://github.com/ggml-org/llama.cpp/issues/18931

dumbmrblah|1 month ago

One thing to consider is that this version is a new architecture, so it’ll take time for Llama CPP to get updated. Similar to how it was with Qwen Next.

latchkey|1 month ago

There are a bunch of 4bit quants in the GGUF link and the 0xSero has some smaller stuff too. Might still be too big and you'll need to ungpu poor yourself.

disiplus|1 month ago

yeah there is no way to run 4.7 on a 32g vram this flash is something that im also waiting to try later tonight