> Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B.
> We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.
So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.
You could consider the improvement in model performance a bit of a cheat - they beat other models "in the same size category" that have 30% fewer parameters.
I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move
Easy head math: parameter count times parameter size plus 20-40% for inference slop space. Anywhere from 8-40GB of vram required depending on quantization levels being used.
if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you might be able to do with with ~6-8. So any 16gb Macbook could run it (but not much else).
> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.
Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.
The SentencePiece library also implements Byte-pair-encoding. That's what the LLaMA models use and the original Mistral models were essentially a copy of LLaMA2.
SentencePiece is not a different algorithm to WordPiece or BPE, despite its naming.
One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.
SentencePiece is a tool and library for training and using tokenizers, and supports two algorithms: Byte-Pair Encoding (BPE) and Unigram. You could almost say it is the library for tokenizers, as it has been standard in research for years now.
Tiktoken is a library which only supports BPE. It has also become synonymous with the tokenizer used by GPT-3, ChatGPT and GPT-4, even though this is actually just a specific tokenizer included in tiktoken.
What Mistral is saying here (in marketing speak) is that they trained a new BPE model on data that is more balanced multilingually than their previous BPE model. It so happens that they trained one with SentencePiece and the other with tiktoken, but that really shouldn't make any difference in tokenization quality or compression efficiency. The switch to tiktoken probably had more to do with latency, or something similar.
> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.
> *Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.
> The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.
These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/funding, and the scaling laws seem to be fun to play with and tweak more interesting things out of these and find cool "emergent" behavior as billions of data points get correlated.
But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.
There are a lot of models coming out, but in my view, most don't really matter or move the needle. There are the frontier models which aren't open (like GPT-4o) and then there are the small "elite" local LLMs like Llama3 8B. The rest seem like they are mostly about manipulating benchmarks. Whenever I try them, they are worse in actual practice than the Llama3 models.
I don’t see any indication this beats Llama3 70B, but still requires a beefy GPU, so I’m not sure the use case. I have an A6000 which I use for a lot of things, Mixtral was my go-to until Llama3, then I switched over.
If you could run this on say, stock CPU that would increase the use cases dramatically, but if you still need a 4090 I’m either missing something or this is useless.
I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.
I’m AI stupid. Does anyone know if training on multiple languages provides “cross-over” — so training done in German can be utilized when answering a prompt in English? I once went through various Wikipedia articles in a couple languages and the differences were interesting. For some reason I thought they’d be almost verbatim (forgetting that’s not how Wikipedia works!) and while I can’t remember exactly I felt they were sometimes starkly different in tone and content.
I have to say, the experience of trying to sign up for Nvidia Enterprise so you can try the "NIM" packaged version of this model, is just icky and and awful now that I've gotten used to actually free and open models and software. It feels much nicer and more free to be able to clone llama.cpp and wget a .gguf model file from huggingface without any registration at all. Especially since it has now been several hours since I signed up for the Nvidia account and it still says on the website "Your License Should be Active Momentarily | We're setting up your credentials to download NIMs."
I really don't get Nvidia's thinking with this. They basically have a hardware monopoly. I shelled out the $4,000 or so to buy two of their 4090 GPUs. Why are they still insisting on torturing me with jumping through these awful hoops? They should just be glad that they're winning and embrace freedom.
Also I don't think you can use NIM packages in production without a subscription, and I wasn't able to find the cost without signing up. Also NIM package for Mistral Nemo is not yet available anyways.
I still don’t understand the business model of releasing open source gen AI models. If this took 3072 H100s to train, why are they releasing it for free? I understand they charge people when renting from their platform, but why permit people to run it themselves?
Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.
EDIT: This a 1W light bulb moment for me, thank you!
Justine Tunney (of redbean fame) is actively working on getting LLMs to run well on CPUs, where RAM is cheap. If successful this would eliminate an enormous bottleneck to running local models. If anyone can do this, she can. (And thank you to Mozilla for financially supporting her work). See https://justine.lol/matmul/ and https://github.com/mozilla-Ocho/llamafile
I’d probably check https://ollama.com/library?q=Nemo in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.
It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.
I wonder why Mistral et al don't prepare GGUF versions of these for launch day?
If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.
Some of the major vendors _do_ create the GGUFs for their models, but often they have the wrong parameter settings, need changes in the inference code, or don't include the correct prompt template. We (i.e. Ollama) have our own conversion scripts and we try to work with the model vendors to get everything working ahead of time, but unfortunately Mistral doesn't usually give us a heads up before they release.
llama.cpp is still under development and they sometimes come out with breaking changes or new quantization methods, and it can be a lot of work to keep up with these changes as you publish more models over time. It's easier to just publish a standard float32 safetensors that works with PyTorch, and let the community deal with other runtimes and file formats.
If it's a new architecture, then there's also additional work needed to add support in llama.cpp, which means more dev time, more testing, and potentially loss of surprise model release if the development work has to be done out in the open
Interested in the new base model for fine tuning. Despite Llama3 being a better instruct model overall, it’s been highly resistant to fine-tuning, either owing to some bugs or being trained on so much data (ongoing debate about this in the community). Mistral’s base model are still best in class for small model you can specialize.
I find it interesting how coding/software development still appears to be the one category that these most popular models release specialised models for. Where's the finance or legal models from Mistral or Meta or OpenAI?
Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.
Interesting that the benchmarks they show have it outperforming Gemma 2 9B and Llama 3 8B, but it does a lot worse on my NYT Connections benchmark (5.1 vs 16.3 and 12.3). The new GPT-4o mini also does better at 14.3. It's just one benchmark though, so looking forward to additional scores.
For the benchmarks, it depends on how you interpret it. The other models are quite popular so many can have a starting point. Now, if you regularly use them you can assess: "just 3% better on some benchmark, 80% to 83, and at the cost of almost twice the inference speed and base base RAM requirement, but 16x context window, and for commercial usage..." and at the end "for my use case, is it worth it?"
"It significantly outperforms existing models smaller or similar in size."
is a statement that goes in that direction and would allow the comparison of a 1.7T param model with a 7b one
> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.
From Mistral's page about Tekken:
> Our newest tokenizer, tekken, uses the Byte-Pair Encoding (BPE) with Tiktoken.
Does that mean that Mistral found that BPE is more efficient than unigram models?
Because otherwise, I don't understand why AI companies keep using BPE for their token sets. Unigram methods leads to more legible tokens, fewer glitch tokens, fewer super-long outlier tokens, etc.
I just managed to make Mistral NeMo 4bit QLoRA finetuning fit in under 12GB, so it fits in a free Google Colab with a Tesla T4 GPU! VRAM is shaved by 60% and finetuning is also 2x faster! Colab: https://colab.research.google.com/github/unslothai/studio/bl...
Does anyone know whether the 128K is input tokens only? There are a lot of models that have a large context window for input but a small output context. If this actually has 128k tokens shared between input and output, that would be a game changer.
I just checked huggingface and the model files download is about 25GB but in a comment below someone mentioned it is 8fp quantized model. Trying to understand how the quantization affects the model (and RAM) size. Can someone please enlighten.
Sure. The talk about 8bit refers to quantization-aware training. Pretty common in image models these days to reduce the impact of quantization on accuracy.
Typically this might mean that you simulate an 8bit forward pass to ensure that the model is robust to quantization ‘noise’. You still use FP16/32 for backward pass & weight updates for numerical stability.
It’s just a way to optimize the model in anticipation of future quantization. The experience of using an 8-bit Nemo quant should more closely mirror that of using the full-fat bf16 model compared to if they hadn’t used QAT.
[+] [-] yjftsjthsd-h|1 year ago|reply
> We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.
So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.
[+] [-] wongarsu|1 year ago|reply
I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move
[+] [-] xena|1 year ago|reply
[+] [-] renewiltord|1 year ago|reply
[+] [-] bernaferrari|1 year ago|reply
[+] [-] Bumblonono|1 year ago|reply
[+] [-] exe34|1 year ago|reply
[+] [-] minimaxir|1 year ago|reply
Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.
[+] [-] rockinghigh|1 year ago|reply
[+] [-] zwaps|1 year ago|reply
One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.
[+] [-] numeri|1 year ago|reply
Tiktoken is a library which only supports BPE. It has also become synonymous with the tokenizer used by GPT-3, ChatGPT and GPT-4, even though this is actually just a specific tokenizer included in tiktoken.
What Mistral is saying here (in marketing speak) is that they trained a new BPE model on data that is more balanced multilingually than their previous BPE model. It so happens that they trained one with SentencePiece and the other with tiktoken, but that really shouldn't make any difference in tokenization quality or compression efficiency. The switch to tiktoken probably had more to do with latency, or something similar.
[+] [-] alecco|1 year ago|reply
> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.
> *Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.
> The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.
[+] [-] dpflan|1 year ago|reply
But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.
[+] [-] eigenvalue|1 year ago|reply
[+] [-] hdhshdhshdjd|1 year ago|reply
If you could run this on say, stock CPU that would increase the use cases dramatically, but if you still need a 4090 I’m either missing something or this is useless.
[+] [-] mcemilg|1 year ago|reply
[+] [-] jorgesborges|1 year ago|reply
[+] [-] eigenvalue|1 year ago|reply
I really don't get Nvidia's thinking with this. They basically have a hardware monopoly. I shelled out the $4,000 or so to buy two of their 4090 GPUs. Why are they still insisting on torturing me with jumping through these awful hoops? They should just be glad that they're winning and embrace freedom.
[+] [-] lopuhin|1 year ago|reply
[+] [-] pennomi|1 year ago|reply
[+] [-] andrethegiant|1 year ago|reply
[+] [-] pixelatedindex|1 year ago|reply
EDIT: This a 1W light bulb moment for me, thank you!
[+] [-] simpaticoder|1 year ago|reply
[+] [-] bezbac|1 year ago|reply
[0]: https://github.com/ollama/ollama/blob/main/docs/api.md
[+] [-] codetrotter|1 year ago|reply
[+] [-] RockyMcNuts|1 year ago|reply
I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air
[+] [-] Raed667|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] nostromo|1 year ago|reply
If you're on a Mac, check out LM Studio.
It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.
[+] [-] homarp|1 year ago|reply
and expose an OpenAI compatible server, or you can use their python bindings
[+] [-] d13|1 year ago|reply
[+] [-] kanwisher|1 year ago|reply
[+] [-] simonw|1 year ago|reply
If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.
[+] [-] Patrick_Devine|1 year ago|reply
[+] [-] a2128|1 year ago|reply
If it's a new architecture, then there's also additional work needed to add support in llama.cpp, which means more dev time, more testing, and potentially loss of surprise model release if the development work has to be done out in the open
[+] [-] bugglebeetle|1 year ago|reply
[+] [-] madeofpalk|1 year ago|reply
Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.
[+] [-] adt|1 year ago|reply
https://lifearchitect.ai/models-table/
[+] [-] pants2|1 year ago|reply
[+] [-] zone411|1 year ago|reply
[+] [-] Workaccount2|1 year ago|reply
The same thing happened with gemma-27b, where they compared it to all the 7-9b models.
It seems like an easy way to boost benchmarks while coming off as "small" at first glance.
[+] [-] voiper1|1 year ago|reply
open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m tokens
https://mistral.ai/technology/#pricing
[+] [-] marci|1 year ago|reply
[+] [-] eyeswideopen|1 year ago|reply
"It significantly outperforms existing models smaller or similar in size." is a statement that goes in that direction and would allow the comparison of a 1.7T param model with a 7b one
[+] [-] causal|1 year ago|reply
- 3B for CPU inference or running on edge devices.
- 20-30B for maximizing single consumer GPU potential.
- 70B+ for those who can afford it.
7-9B never felt like an ideal size.
[+] [-] PoignardAzur|1 year ago|reply
From Mistral's page about Tekken:
> Our newest tokenizer, tekken, uses the Byte-Pair Encoding (BPE) with Tiktoken.
Does that mean that Mistral found that BPE is more efficient than unigram models?
Because otherwise, I don't understand why AI companies keep using BPE for their token sets. Unigram methods leads to more legible tokens, fewer glitch tokens, fewer super-long outlier tokens, etc.
[+] [-] danielhanchen|1 year ago|reply
[+] [-] wkcheng|1 year ago|reply
[+] [-] hislaziness|1 year ago|reply
[+] [-] frontierkodiak|1 year ago|reply
Typically this might mean that you simulate an 8bit forward pass to ensure that the model is robust to quantization ‘noise’. You still use FP16/32 for backward pass & weight updates for numerical stability.
It’s just a way to optimize the model in anticipation of future quantization. The experience of using an 8-bit Nemo quant should more closely mirror that of using the full-fat bf16 model compared to if they hadn’t used QAT.