Mistral NeMo | WingNews

[+] yjftsjthsd-h|1 year ago|reply

> Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B.

> We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.

So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.

[+] wongarsu|1 year ago|reply

You could consider the improvement in model performance a bit of a cheat - they beat other models "in the same size category" that have 30% fewer parameters.

I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move

[+] xena|1 year ago|reply

Easy head math: parameter count times parameter size plus 20-40% for inference slop space. Anywhere from 8-40GB of vram required depending on quantization levels being used.

[+] renewiltord|1 year ago|reply

According to nvidia https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/ it was made to fit on a 4090 so it should work with 24 GB.

[+] bernaferrari|1 year ago|reply

if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you might be able to do with with ~6-8. So any 16gb Macbook could run it (but not much else).

[+] Bumblonono|1 year ago|reply

It fits a 4090. Nvidia lists the models and therefore i assume 24gig is min

[+] exe34|1 year ago|reply

tensors look about 20gb. not sure what that's like in vram.

[+] minimaxir|1 year ago|reply

> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.

[+] rockinghigh|1 year ago|reply

The SentencePiece library also implements Byte-pair-encoding. That's what the LLaMA models use and the original Mistral models were essentially a copy of LLaMA2.

[+] zwaps|1 year ago|reply

SentencePiece is not a different algorithm to WordPiece or BPE, despite its naming.

One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.

[+] numeri|1 year ago|reply

SentencePiece is a tool and library for training and using tokenizers, and supports two algorithms: Byte-Pair Encoding (BPE) and Unigram. You could almost say it is the library for tokenizers, as it has been standard in research for years now.

Tiktoken is a library which only supports BPE. It has also become synonymous with the tokenizer used by GPT-3, ChatGPT and GPT-4, even though this is actually just a specific tokenizer included in tiktoken.

What Mistral is saying here (in marketing speak) is that they trained a new BPE model on data that is more balanced multilingually than their previous BPE model. It so happens that they trained one with SentencePiece and the other with tiktoken, but that really shouldn't make any difference in tokenization quality or compression efficiency. The switch to tiktoken probably had more to do with latency, or something similar.

[+] alecco|1 year ago|reply

Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/

> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.

> *Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.

> The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.

[+] dpflan|1 year ago|reply

These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/funding, and the scaling laws seem to be fun to play with and tweak more interesting things out of these and find cool "emergent" behavior as billions of data points get correlated.

But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.

[+] eigenvalue|1 year ago|reply

There are a lot of models coming out, but in my view, most don't really matter or move the needle. There are the frontier models which aren't open (like GPT-4o) and then there are the small "elite" local LLMs like Llama3 8B. The rest seem like they are mostly about manipulating benchmarks. Whenever I try them, they are worse in actual practice than the Llama3 models.

[+] hdhshdhshdjd|1 year ago|reply

I don’t see any indication this beats Llama3 70B, but still requires a beefy GPU, so I’m not sure the use case. I have an A6000 which I use for a lot of things, Mixtral was my go-to until Llama3, then I switched over.

If you could run this on say, stock CPU that would increase the use cases dramatically, but if you still need a 4090 I’m either missing something or this is useless.

[+] mcemilg|1 year ago|reply

I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.

[+] jorgesborges|1 year ago|reply

I’m AI stupid. Does anyone know if training on multiple languages provides “cross-over” — so training done in German can be utilized when answering a prompt in English? I once went through various Wikipedia articles in a couple languages and the differences were interesting. For some reason I thought they’d be almost verbatim (forgetting that’s not how Wikipedia works!) and while I can’t remember exactly I felt they were sometimes starkly different in tone and content.

[+] eigenvalue|1 year ago|reply

I have to say, the experience of trying to sign up for Nvidia Enterprise so you can try the "NIM" packaged version of this model, is just icky and and awful now that I've gotten used to actually free and open models and software. It feels much nicer and more free to be able to clone llama.cpp and wget a .gguf model file from huggingface without any registration at all. Especially since it has now been several hours since I signed up for the Nvidia account and it still says on the website "Your License Should be Active Momentarily | We're setting up your credentials to download NIMs."

I really don't get Nvidia's thinking with this. They basically have a hardware monopoly. I shelled out the $4,000 or so to buy two of their 4090 GPUs. Why are they still insisting on torturing me with jumping through these awful hoops? They should just be glad that they're winning and embrace freedom.

[+] lopuhin|1 year ago|reply

Also I don't think you can use NIM packages in production without a subscription, and I wasn't able to find the cost without signing up. Also NIM package for Mistral Nemo is not yet available anyways.

[+] pennomi|1 year ago|reply

This is what you get when managers design a software tool instead of engineers designing it.

[+] andrethegiant|1 year ago|reply

I still don’t understand the business model of releasing open source gen AI models. If this took 3072 H100s to train, why are they releasing it for free? I understand they charge people when renting from their platform, but why permit people to run it themselves?

[+] pixelatedindex|1 year ago|reply

Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.

EDIT: This a 1W light bulb moment for me, thank you!

[+] simpaticoder|1 year ago|reply

Justine Tunney (of redbean fame) is actively working on getting LLMs to run well on CPUs, where RAM is cheap. If successful this would eliminate an enormous bottleneck to running local models. If anyone can do this, she can. (And thank you to Mozilla for financially supporting her work). See https://justine.lol/matmul/ and https://github.com/mozilla-Ocho/llamafile

[+] bezbac|1 year ago|reply

AFAIK, Ollama supports most of these models locally and will expose a REST API[0]

[0]: https://github.com/ollama/ollama/blob/main/docs/api.md

[+] codetrotter|1 year ago|reply

I’d probably check https://ollama.com/library?q=Nemo in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.

[+] RockyMcNuts|1 year ago|reply

You will need enough VRAM, 1080ti is not going to work very well, maybe get a 3090 with 24GB VRAM.

I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air

[+] Raed667|1 year ago|reply

First thing I did when i saw the headline was to look for it on ollma but it didn't land there yet: https://ollama.com/library?sort=newest&q=NeMo

[+] unknown|1 year ago|reply

[deleted]

[+] nostromo|1 year ago|reply

Yes.

If you're on a Mac, check out LM Studio.

It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.

[+] homarp|1 year ago|reply

llama.cpp supports multi gpu across local network https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...

and expose an OpenAI compatible server, or you can use their python bindings

[+] d13|1 year ago|reply

Try Lm Studio or Ollama. Load up the model, and there you go.

[+] kanwisher|1 year ago|reply

llama.cpp or ollama both have apis for most models

[+] simonw|1 year ago|reply

I wonder why Mistral et al don't prepare GGUF versions of these for launch day?

If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.

[+] Patrick_Devine|1 year ago|reply

Some of the major vendors _do_ create the GGUFs for their models, but often they have the wrong parameter settings, need changes in the inference code, or don't include the correct prompt template. We (i.e. Ollama) have our own conversion scripts and we try to work with the model vendors to get everything working ahead of time, but unfortunately Mistral doesn't usually give us a heads up before they release.

[+] a2128|1 year ago|reply

llama.cpp is still under development and they sometimes come out with breaking changes or new quantization methods, and it can be a lot of work to keep up with these changes as you publish more models over time. It's easier to just publish a standard float32 safetensors that works with PyTorch, and let the community deal with other runtimes and file formats.

If it's a new architecture, then there's also additional work needed to add support in llama.cpp, which means more dev time, more testing, and potentially loss of surprise model release if the development work has to be done out in the open

[+] bugglebeetle|1 year ago|reply

Interested in the new base model for fine tuning. Despite Llama3 being a better instruct model overall, it’s been highly resistant to fine-tuning, either owing to some bugs or being trained on so much data (ongoing debate about this in the community). Mistral’s base model are still best in class for small model you can specialize.

[+] madeofpalk|1 year ago|reply

I find it interesting how coding/software development still appears to be the one category that these most popular models release specialised models for. Where's the finance or legal models from Mistral or Meta or OpenAI?

Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.

[+] adt|1 year ago|reply

That's 3 releases for Mistral in 24 hours.

https://lifearchitect.ai/models-table/

[+] pants2|1 year ago|reply

Exciting, I think 12B is the sweet spot for running locally - large enough to be useful, fast enough to run on a decent laptop.

[+] zone411|1 year ago|reply

Interesting that the benchmarks they show have it outperforming Gemma 2 9B and Llama 3 8B, but it does a lot worse on my NYT Connections benchmark (5.1 vs 16.3 and 12.3). The new GPT-4o mini also does better at 14.3. It's just one benchmark though, so looking forward to additional scores.

[+] Workaccount2|1 year ago|reply

Is "Parameter Creep" going to becomes a thing? They hold up Llama-8b as a competitor despite NeMo having 50% more parameters.

The same thing happened with gemma-27b, where they compared it to all the 7-9b models.

It seems like an easy way to boost benchmarks while coming off as "small" at first glance.

[+] voiper1|1 year ago|reply

Oddly, they are only charging slightly more for their hosted version:

open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m tokens

https://mistral.ai/technology/#pricing

[+] marci|1 year ago|reply

For the benchmarks, it depends on how you interpret it. The other models are quite popular so many can have a starting point. Now, if you regularly use them you can assess: "just 3% better on some benchmark, 80% to 83, and at the cost of almost twice the inference speed and base base RAM requirement, but 16x context window, and for commercial usage..." and at the end "for my use case, is it worth it?"

[+] eyeswideopen|1 year ago|reply

As written here: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

"It significantly outperforms existing models smaller or similar in size." is a statement that goes in that direction and would allow the comparison of a 1.7T param model with a 7b one

[+] causal|1 year ago|reply

Yeah it will be interesting to see if we ever settle on standard sizes here. My preference would be:

- 3B for CPU inference or running on edge devices.

- 20-30B for maximizing single consumer GPU potential.

- 70B+ for those who can afford it.

7-9B never felt like an ideal size.

[+] PoignardAzur|1 year ago|reply

> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

From Mistral's page about Tekken:

> Our newest tokenizer, tekken, uses the Byte-Pair Encoding (BPE) with Tiktoken.

Does that mean that Mistral found that BPE is more efficient than unigram models?

Because otherwise, I don't understand why AI companies keep using BPE for their token sets. Unigram methods leads to more legible tokens, fewer glitch tokens, fewer super-long outlier tokens, etc.

[+] danielhanchen|1 year ago|reply

I just managed to make Mistral NeMo 4bit QLoRA finetuning fit in under 12GB, so it fits in a free Google Colab with a Tesla T4 GPU! VRAM is shaved by 60% and finetuning is also 2x faster! Colab: https://colab.research.google.com/github/unslothai/studio/bl...

[+] wkcheng|1 year ago|reply

Does anyone know whether the 128K is input tokens only? There are a lot of models that have a large context window for input but a small output context. If this actually has 128k tokens shared between input and output, that would be a game changer.

[+] hislaziness|1 year ago|reply

I just checked huggingface and the model files download is about 25GB but in a comment below someone mentioned it is 8fp quantized model. Trying to understand how the quantization affects the model (and RAM) size. Can someone please enlighten.

[+] frontierkodiak|1 year ago|reply

Sure. The talk about 8bit refers to quantization-aware training. Pretty common in image models these days to reduce the impact of quantization on accuracy.

Typically this might mean that you simulate an 8bit forward pass to ensure that the model is robust to quantization ‘noise’. You still use FP16/32 for backward pass & weight updates for numerical stability.

It’s just a way to optimize the model in anticipation of future quantization. The experience of using an 8-bit Nemo quant should more closely mirror that of using the full-fat bf16 model compared to if they hadn’t used QAT.

162 comments