Mixtral of experts

[+] BryanLegend|2 years ago|reply

Andrej Karpathy's Take:

Official post on Mixtral 8x7B: https://mistral.ai/news/mixtral-of-experts/

Official PR into vLLM shows the inference code: https://github.com/vllm-project/vllm/commit/b5f882cc98e2c9c6...

New HuggingFace explainer on MoE very nice: https://huggingface.co/blog/moe

In naive decoding, performance of a bit above 70B (Llama 2), at inference speed of ~12.9B dense model (out of total 46.7B params).

Notes: - Glad they refer to it as "open weights" release instead of "open source", which would imo require the training code, dataset and docs. - "8x7B" name is a bit misleading because it is not all 7B params that are being 8x'd, only the FeedForward blocks in the Transformer are 8x'd, everything else stays the same. Hence also why total number of params is not 56B but only 46.7B. - More confusion I see is around expert choice, note that each token and also each layer selects 2 different experts (out of 8). - Mistral-medium

Source: https://twitter.com/karpathy/status/1734251375163511203

[+] nojvek|2 years ago|reply

Anyone have a feeling karpathy may leave openAI to join an actual Open AI startup where he can openly speak about training tweaks, the datasets architecture etc.

It seems recently OpenAI is the least open startup. Even Gemini talks more about their architecture.

OpenAI still doesn’t openly mention GPT4 is a mixture of experts model.

[+] korvalds|2 years ago|reply

More models available at Huggingface now: https://huggingface.co/search/full-text?q=mixtral

Already available from both Mistralai and TheBloke https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF

[+] loudmax|2 years ago|reply

Note that running these GGUF models currently requires a forked version of llama.cpp: https://github.com/ggerganov/llama.cpp/pull/4406

The GGUF handling for Mistral's mixture of experts hasn't been finalized yet. TheBloke and ggerganov and friends are still figuring out what works best.

The Q5_K_M gguf model is about 32GB. That's not going to fit into any consumer grade GPU, but it should be possible to run on a reasonably powerful workstation or gaming rig. Maybe not fast enough to be useful for everyday productivity, but it should run well enough to get a sense of what's possible. Sort of a glimpse into the future.

[+] xena|2 years ago|reply

The GGUF variant looks promising because then I can run it on my MacBook (barely)

[+] pugio|2 years ago|reply

> We’re currently using Mixtral 8x7B behind our endpoint mistral-small...

So 45 billion parameters is what they consider their "small" model? I'm excited to see what/if their larger models will be.

[+] seydor|2 years ago|reply

> A proper preference tuning can also serve this purpose. Bear in mind that without such a prompt, the model will just follow whatever instructions are given.

Mistral does not censor its models and is committed to a hands free approach, according to their CEO https://www.youtube.com/watch?v=EMOFRDOMIiU

> Mixtral 8x7B masters French, German, Spanish, Italian, and English.

EU budget cut by half

[+] trash_cat|2 years ago|reply

This is very exciting and I think this is the future of AI until we get another, next-gen architecture beyond transformers. I don't think we will get a lot better until that happens and the effort will go into making the models a lot cheaper to run without sacrificing too much accuracy. MoD is a viable solution.

[+] cuuupid|2 years ago|reply

Mamba has shown SSMs are highly likely to be contenders for the replacement to transformers in a really similar fashion to when transformers were first introduced as enc-dec models. I’m personally very excited for those models as they’re also built for even faster inference (a major feature of transformers being wildly faster inference than with LSTMs)

[+] reqo|2 years ago|reply

Can someone explain why MoE works? Is there any downside to MoE compared to a regular model?

[+] gorbypark|2 years ago|reply

I'm still new to most of this (so please take this with a grain of salt/correct me), but it seems that in this specific model, there's eight separate 7B models. There is also a ~2B "model" that acts as a router, in a way, where it picks the best two 7B models for the next token. Those two models then generate the next token and somehow they are added together.

Upside: much more efficient to run versus a single larger model. The press release states 45B total parameters across the 8x7B models, but it only takes 12B parameters worth of RAM to run.

Downside: since the models are still "only" 7B, the output in theory would be not as good as a single 45B param model. However, how much less so is probably open for discussion/testing.

No one knows (outside of OpenAI) for sure the size/architecture of GPT-4, but it's rumoured to have a similar architecture, but much larger. 1.8 trillion total params, but split up into 16 experts at around 111B params each is what some are guessing/was leaked.

[+] dwrodri|2 years ago|reply

MoE is all about tradeoffs. You get the "intelligence" of a 45B model but only pay the operational cost of multiplying against 12B of those params per token. The cost is that it's now up to the feedforward block to decide early which portions of those 45B params matter, whereas a non-MoE 45B model doesn't encode that decision explicitly into the architecture, it would only arise from (near) zero activations in the attention heads across layers found through gradient descent, instead of just siloing the "experts" entirely. From a quick look at the benchmark results, it looks like in particular it suffers in pronoun resolution vs larger models.

Richard Sutton's Bitter Lesson[1] has served as a guiding mantra for this generation of AI research: the less structure that the researcher explicitly imposes upon the computer in order to learn from the data the better. As humans, we're inclined to want to impose some structure based on our domain knowledge that should guide the model towards making the right choice from the data. It's unintuitive, but it turns out we're much better off imposing as little structure as possible, and the structure that we do place should only exist to effectively enable some form of computation to capture relationships in the data. Gradient descent over next token-prediction isn't very energy efficient, but it leverages compute quite well and it turns out it has scaled up to the limits of just about every research cluster we've been able to build to date. If you're trying to push the envelope and build something which advances the state of the Art in a reasoning task, you're better off leaning as heavily as you can on compute-first approaches unless the nature of the problem involves a lack of data/compute.

Professor Sutton does a much better job than I discussing this concept, so I do encourage you to read the blog post.

1: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

[+] bjornsing|2 years ago|reply

I haven’t worked on LLMs/transformers specifically, but I’ve “independently invented” MoE and experimented with it on simple feedforward convolutional networks for vision. The basic idea is pretty simple: The “router” outputs a categorical distribution over the “experts”, essentially mixing the experts probabilistically (e.g. 10% of expert A, 60% of expert B and 30% of expert C). Training time you compute the expected value of the loss over this “mixture” (or use the Gumbel-Softmax trick), so you need to backprop through all the “experts”. But inference time you just sample from the categorical distribution (or pick highest probability), so you pick a single “expert” that is executed. A mathematically sound way of making inference cheaper, basically.

Mixtral seems to use a much more elaborate scheme (e.g. picking two “experts” and additively combining them, at every layer), but the basic math behind it is probably the same.

[+] osanseviero|2 years ago|reply

This blog post might be interesting - https://huggingface.co/blog/moe

MoEs are especially useful for much faster pre-training. During inference, the model will be fast but still require a very high amount of VRAM. MoEs don't do great in fine-tuning but recent work shows promising instruction-tuning results. There's also quite a bit of ongoing work around MoEs quantization.

In general, MoEs are interesting for high throughput cases with high number of machines, so this is not so so exciting for a local setup, but the recent work in quantization makes it more appealing.

[+] hexaga|2 years ago|reply

LLM scaling laws tell us that more parameters make models better, in general.

The key intuition behind why MoE works is that as long as those parameters are available during training, they count toward this scaling effect (to some extent).

Empirically, we can see that even if the model architecture is such that you only have to consult some subset of those parameters at inference time - the optimizer finds a way.

The inductive bias in this style of MoE model is to specialize (as there is a gating effect between 'experts'), which does not seem to present much of an impediment to gradient flow.

[+] tarruda|2 years ago|reply

Disclaimer: I'm a ML newbie, so this might be all incorrect.

My intuition is that there are 8 7b models trained on knowledge domains. For example, one of those 7b models might be good at coding, while another one might be good at storytelling.

And there's the router model, which is trained to select which of the 8 experts are good at completing the text in the context. So for every new token added to the context, the router selects a model and the context is forwarded to that expert which will generate the next token.

The common wisdom is that a even single 7B fine tuned model might surpass much bigger models at the specific task that they're trained on, so it is easy to see how having 8x 7B models might create a bigger model that is very good at many tasks. In the article you can see that even though this is only 45B base model, it surpassed GPT 3.5 (which is instruction fine tuned) on most benchmarks.

Another upside is that the model will be fast at inference, since only a small subset of those 45B weights are activated when doing inference, so the performance should be similar to a 12B model.

I can't think of any downsides except the bigger VRAM requirements when compared to a Non-MoE model of the same size as the experts.

[+] unknown|2 years ago|reply

[deleted]

[+] brucethemoose2|2 years ago|reply

Short version:

You trade off increased VRAM usage for better training/runtime speed and better splittability.

The balance of this tradeoff is an open question.

[+] esafak|2 years ago|reply

It is an application of specialization.

[+] mijoharas|2 years ago|reply

> It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs.

Is there any link to the model and weights? I don't see it if so.

[+] chandureddyvari|2 years ago|reply

Sorry if this is a dumb question. Can someone explain why it’s called 8x7B(56B) but it has only 46.7B params? and it uses 12.9B params per token generation but there are 2 experts(2x7B) chosen by a 2B model? I’m finding it difficult to wrap my head around this.

[+] dizzydes|2 years ago|reply

Honest question: if they're only beating GPT 3.5 with their latest model (not GPT 4) and OpenAI/Google have infrastructure on tap and a huge distribution advantage via existing products - what chance do they stand?

How do people see things going in the future?

[+] smcleod|2 years ago|reply

Mistral and its hybrids are a lot better than GPT3.5, and while not as good as GPT4 in general tasks - they’re extremely fast and powerful with specific tasks. In the time it takes GPT4 to apologise that it’s not allowed to do something I can be three iterations deep getting highly targeted responses from mistral - and best yet - I can run it 100% offline, locally and on my laptop.

[+] antirez|2 years ago|reply

1. This is an open source model that can run on people's hardware at a fraction of the cost of GPT. No cloud services in the middle.

2. This model is not censored as GPT.

3. This model has a lot less latency than GPT.

4. In their endpoint this model is called mistral-small. Probably they are training something much larger than can compete with GPT4.

5. This model can be fine tuned.

[+] Palmik|2 years ago|reply

Beyond what others said, I think this is an extremely impressive showing. Consider that their efforts started years behind Google's, and yet their relatively small model (they call is mistral-small, and also offer mistral-medium) is beating or on par with Gemini Pro on many benchmarks (Google's best currently available model).

On top of that Mixtral is truly open source (Apache 2.0), and extremely easy to self host or run on a cloud provider of your choice -- this unlocks many possibilities, and will definitely attract some business customers.

EDIT: The just announced mistral-medium (larger version of the just open sourced mixtral 8x7b) is beating GPT3.5 with significant margin, and also Gemini Pro (on available benchmarks).

[+] jillesvangurp|2 years ago|reply

The demand for using AI models for whatever is going through the roof. Right now it's mostly people typing things manually in chat gpt, bard, or wherever. But that's not going to stay like that. Models being queried as part of all sorts of services is going to be a thing. The problem with this is that running these models at scale is still really expensive.

So, instead of using the best possible model at any cost for absolutely everything, the game is actually good enough models that can run cheaply at scale that do a particular job. Not everything is going to require models trained on the accumulated volume of human knowledge on the internet. It's overkill for a lot of use cases.

Model runtime cost is a showstopper for a lot of use cases. I saw a nice demo of a big ecommerce company in Berlin that had built a nice integration with openai's APIs to provide a shopping assistent. Great demo. Then somebody asked them when this was launching. And the answer was that token cost was prohibitively expensive. It just doesn't make any sense until that comes down a few orders of magnitudes. Companies this size already have quite sizable budgets that they use on AI model training and inference.

[+] intellectronica|2 years ago|reply

If you're purely looking for capabilities and not especially interested in running an open model, this might not be that interesting. But even so, this positions Mistral as currently the most promising company in the open models camp, having released the first thing that not only competes well with GPT-3.5 but also competes with other open models like Llama-2 on cost/performance and presents the most technological innovation in the open models space so far. Now that they raised $400MM the question to ask is - what happens if they continue innovating and scale their next model sufficiently to compete with GPT-4 / Gemini? The prospects have never seemed better than they do today after this release.

[+] Shrezzing|2 years ago|reply

>How do people see things going in the future?

The EU and other European governments will throw absolute boatloads of money at Mistral, even if that only keeps them at a level on par with the last generation. AI is too big of a technological leap for the bloc to ride America's coattails on.

Mistral doesn't just exist to make competitive AI products, it's an existential issue for Europe that someone on the continent is near the vanguard on this tech, and as such, they'll get enormous support.

[+] pbmonster|2 years ago|reply

They are focusing hard on small models. Sooner or later, you'll be able to run their product offline, even on mobile devices.

Google was criticized [0] for offloading pretty much all generative AI tasks onto the cloud - instead of running it on the Tensor G3 built into its Pixel Phones specifically for that purpose. The reason being, of course, that the Tensor G3 is much to small for almost all modern generative models.

So Mistral is focusing specifically on an area the big players are failing right now.

[0] https://news.ycombinator.com/item?id=37966569

[+] dataking|2 years ago|reply

Microsoft, Apple, and Google also have more resources at their disposal yet Linux is doing just fine (to put it mildly). As long as Mistral delivers something unique, they'll be fine.

[+] sgt101|2 years ago|reply

As I read it they are doing this with 8 * 7Bn parameter models. So, their model should run pretty well as fast as a 7Bn model and at the cost of a 56bn parameter model.

That a lot quicker and cheaper than GPT-4

Also this is kinda a promissory note, they've been able to do this in a few months and create a service on top of it. Does this intimate that they have the capability to create and run SoA models? Possibly. If I were a VC I could see a few ways for this bet to go well.

The big killer is moat - maybe this just demonstrates that there is no LLM moat.

[+] masa331|2 years ago|reply

Another advantage over Google or OpenAI for me would be that it is not from Google or OpenAI

[+] spacebanana7|2 years ago|reply

Perhaps they’re hoping some enterprises will be willing to pay extra for a 3.5 grade model that can run on prem?

A niche market but I can imagine some demand there.

Biggest challenge would be Llama models.

[+] yodsanklai|2 years ago|reply

Also, considering mistral is open source, what will prevent their competitor to integrate any innovation they make?

Another thing I don't understand, how a 20 people company can provide a similar system as OpenAI (1000 employees)? what do they do themselves, and what do they re-use?

[+] raincole|2 years ago|reply

> How do people see things going in the future?

A niche thing that thrives in its own niche. Just like most open source apps without big corperations behind them.

[+] HarHarVeryFunny|2 years ago|reply

Google started late with any serious LLM effort. It takes time to iterate on something so complex and slow to train. I expect Google will match OpenAI in next iteration or two, or at worst stay one step behind, but it takes time.

OTOH Google seem to be the Xerox Parc of our time (who were famous for state of the art research and failure to productize). Microsoft, and hence Microsoft-OpenAI, seem much better positioned to actually benefit from this type of generative AI.

[+] wrsh07|2 years ago|reply

1) as a developer or founder looking to experiment quickly and cheaply with llm ideas, this (and llama etc) are huge gifts

2) for the research community, making this work available helps everyone (even OpenAI and Google, insofar as they've done something not yet tried at those larger orgs)

3) Mistral is well positioned to get money from investors or as consultants for large companies looking to fine tune or build models for super custom use cases

The world is big and there's plenty of room for everyone!! Google and OpenAI haven't tried all permutations of research ideas - most researchers at the cutting edge have dozens of ideas they still want to try, so having smaller orgs trying things at smaller scales is really great for pushing the frontier!

Of course it's always possible that some major tech co playing from behind (ahem, apple) might acquire some LLM expertise too

[+] nuz|2 years ago|reply

They might be willing to do things like crawl libgen which google possibly isn't, giving them an advantage. They might be more skilled at generating useful synthetic data which is a bit of an art and subject to taste, which other competitors might not be as good at.

[+] data-ottawa|2 years ago|reply

Google BARD/AI isn’t available in Canada or the EU, so there’s one big competitive advantage.

OpenAI is of course the big incumbent to beat and is in those markets.

They only started this year, so beating ChatGPT3.5 is I think a great milestone for 6 months of work.

Plus they will get a strategic investment as the EU’s answer to AI, which may become incredibly valuable to control and regulate.

Edit: I fact checked myself and bard is available in the EU, I was working off outdated information.

https://support.google.com/bard/answer/13575153?hl=en

[+] jstummbillig|2 years ago|reply

Is there a non-obvious reason that models keep getting compared to GPT-3.5 instead of 4?

[+] ensocode|2 years ago|reply

Is anyone using this models self-hosted in production? What cloud hosting provider/plan do you use and how is it performance wise?

[+] maelito|2 years ago|reply

Just tried to register, haven't received the confirmation email.

[+] mercat|2 years ago|reply

Can someone please explain how this works to a software engineer who used to work with heuristically observable functions and algorithms? I'm having a hard time comprehending how a mix of experts can work.

In SE, to me, it would look like (sorting example):

- Having 8 functions that do some stuff in parallel

- There's 1 function that picks the output of a function that (let's say) did the fastest sorting calculation and takes the result further

But how does that work in ML? How can you mix and match what seems like simple matrix transformations in a way that resembles if/else flowchart logic?

[+] HarHarVeryFunny|2 years ago|reply

Interesting to see Mistral in the news raising EUR 450M at a EUR 2B valuation. tbh I'd not even heard of them before this Mixtral release. Amazing how fast this field is developing!

[+] unknown|2 years ago|reply

[deleted]

[+] legel|2 years ago|reply

The Sparse Mixture of Experts neural network architecture is actually an absolutely brilliant move here.

It scales fantastically, when you consider that (1) GPU RAM is way too expensive, in financial dollars, (2) SSD / CPU RAM are relatively cheap, and (3) you can have "experts" running on their own computers, i.e. it's a natural distributed computing partitioning strategy for neural networks.

I did my M.S. thesis on large-scale distributed deep neural networks in 2013 and can say that I'm delighted to point our where this came from.

In 2017, it emerged from a Geoffrey Hinton / Jeff Dean / Quoc Le publication called "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer".

Here is the abstract: "The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost."

So, here's a big A.I. idea for you: what if we all get one of these sparse Mixture of Experts (MoEs) that's a 100 GB on our SSDs, contains all of the "outrageously large" neural network insights that would otherwise take specialized computers, and is designed to run effectively on a normal GPU or even smaller (e.g. smartphone)?

Source: https://arxiv.org/abs/1701.06538

[+] i8comments|2 years ago|reply

The claim that it is better than GPT3.5, in practice, should be taken with a grain of salt, since the benchmarks themselves aren't... ideal.

Despite the questionable marketing claim, it is a great LLM for other reasons.

[+] inChargeOfIT|2 years ago|reply

It sounds like the same requirements as a 70b+ model, but if someone manages to get inference running locally on a single rtx4090 (AMD 7950x3D w/ 64GB ddr5) reasonably well, please let me know.

[+] gardenhedge|2 years ago|reply

They're only paying 80,000€ for a full stack dev and want the candidate to have a "portfolio of successful projects"

[+] bilsbie|2 years ago|reply

I wonder if the brain uses a mixture of experts?

300 comments