We use Webllm under the hood and for text-to-text generation, the model compression is awesome and RAM usage is also less. But we are conducting more experiments, One thing we noticed is some quantized models using MLC sometimes start throwing gibberish, so will get back to you after more experiments on which is better.
sauravpanda|1 year ago