top | item 46682454 (no title) omneity | 1 month ago Why not? Run it with vLLM latest and enable 4bit quantization with bnb, and it will quantize the original safetensors on the fly and fit your vram. discuss order hn newest disiplus|1 month ago because how huge glm 4.7 is https://huggingface.co/zai-org/GLM-4.7 omneity|1 month ago Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit. load replies (1)
disiplus|1 month ago because how huge glm 4.7 is https://huggingface.co/zai-org/GLM-4.7 omneity|1 month ago Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit. load replies (1)
omneity|1 month ago Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit. load replies (1)
disiplus|1 month ago
omneity|1 month ago