I'm wondering when everyone will transition from tracking parameter count vs evaluation metric to (total gpu RAM + total CPU RAM) vs evaluation metric.
For example, a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s.
Additionally, all the examples of quantizing recently released superior models to fit on one GPU doesnt mean the quantized model is a "win." The quantized model is a different model, you need to rerun the metrics.
> a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s
Q5 quantization performs almost on par with base models. Obviously there's some loss there, but this indicates that there's still a lot of compression that we're missing.
If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well?
Also, does it run considerably better than on a GPU with 12GB of VRAM?
I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).
Worse than the chart crime of truncating the y axis is putting LLaMa2's Human Eval scores on there and not comparing it to Code Llama Instruct 70b. DBRX still beats Code Llama Instruct's 67.8 but not by that much.
> "On HumanEval, DBRX Instruct even surpasses CodeLLaMA-70B Instruct, a model built explicitly for programming, despite the fact that DBRX Instruct is designed for general-purpose use (70.1% vs. 67.8% on HumanEval as reported by Meta in the CodeLLaMA blog)."
To be fair, they do compare to it in the main body of the blog. It's just probably misleading to compare to CodeLLaMA on non coding benchmarks.
Waiting for Mixed Quantization with MQQ and MoE Offloading [1]. With that I was able to run Mistral 8x7B on my 10 GB VRAM rtx3080... This should work for DBRX and should shave off a ton of VRAM requirement.
Per the paper, 3072 H100s over the course of 3 months, assume a cost of 2$/GPU/hour
That would be roughly 13.5M$ USD
I’m guessing that at this scale and cost, this model is not competitive and their ambition is to scale to much larger models. In the meantime , they learned a lot and gain PR from open-sourcing
This makes me bearish on OpenAI as a company. When a cloud company can offer a strong model for free by selling the compute, what competitive advantage does a company who want you to pay for the model have left? Feels like they might get Netscape’d.
OpenAI is not the worst, ChatGPT is used by 100M people weekly, sort of insulated from benchmarks. The best of the rest, Anthropic, should be really scared.
The approval on the base model is not feeling very open. Plenty of people still waiting on a chance to download it, where as the instruct model was an instant approval. The base model is more interesting to me for finetuning.
These tiny “state of the art” performance increases are really indicative the current architecture for LLM(Transformers + Mixture of Experts) is maxed out even if you train it more/differently. The writings are on all over the walls.
What does it mean to have less active parameters (36B) than the full model size (132B) and what impact does that have on memory and latency? It seems like this is because it is an MoE model?
The mixture of experts is kinda like a team and a manager. So the manager and one or two of the team go to work depending on the input, not the entire team.
So in this analogy, each team member and the manager has a certain number of params. The whole team is 132B. The manager and team members running for the specific input add up to 36B. Those will load into memory.
Means that it’s a mixture of experts model with 132B parameters in total, but a subset of 36B parameters are used / selected in each forward pass, depending on the context. The parameters not used / selected for generating a particular token belong to “experts” that were deemed not very good at predicting the next token in the current context, but could be used / selected e.g. for the next token.
this proves that all llm models converge to a certain point when trained on the same data. ie, there is really no differentiation between one model or the other.
Claims about out-performance on tasks are just that, claims. the next iteration of llama or mixtral will converge.
LLMs seem to evolve like linux/windows or ios/android with not much differentiation in the foundation models.
It's even possible they converge when trained on different data, if they are learning some underlying representation. There was recent research on face generation where they trained two models by splitting one training set in two without overlap, and got the two models to generate similar faces for similar conditioning, even though each model hadn't seen anything that the other model had.
The models are commodities, and the API's are even similar enough that there is zero stickiness. I can swap one model for another, and usually not have to change anything about my prompts or rag pipelines.
For startups, the lesson here is don't be in the business of building models. Be in the business of using models. The cost of using AI will probably continue to trend lower for the foreseeable future... but you can build a moat in the business layer.
There's at least an argument to be made that this is because all the models are heavily trained on GPT-4 outputs (or whatever the SOTA happens to be during training). All those models are, in a way, a product of inbreeding.
Yea it feels like transformer LLMs are in or getting closer to diminishing returns. Will need some new breakthrough, likely entirely new approach, to get to AGI levels
Even in the most liberal interpretation of prove, it doesn't do that. GPT-4 was trained before OpenAI has any special data or deal with microsoft or the product market fit. Yet, no model has beaten it in a year. And google, microsoft, meta definitely have better data and more compute.
The evaluations are not comprehensive either. All of them are improving and you can't expect any of them to hit 100% on the metrics (a la. bayes error rate). It gets increasingly difficult to move the metrics as they get better.
GenAI novice here. what is training data made of how is it collected? I guess no one will share details on it, otherwise a good technical blog post with lots of insights!
>At Databricks, we believe that every enterprise should have the ability to control its data and its destiny in the emerging world of GenAI.
>The main process of building DBRX - including pretraining, post-training, evaluation, red-teaming, and refining - took place over the course of three months.
The most detailed answer to that I've seen is the original LLaMA paper, which described exactly what that model was trained on (including lots of scraped copyrighted data) https://arxiv.org/abs/2302.13971
Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!
The training data is pretty much anything you can read on the internet plus books.
This is then cleaned up to remove nonsense, some technical files, and repeated files.
From this, they tend to weight some sources more - e.g. Wikipedia gets a pretty high weighting in the data mix. Overall these data mixes have multiple trillion token counts.
GPT-4 apparently trained on multiple epochs of the same data mix. So would assume this one did too as it’s a similar token count
Personally, I found looking at open source work to be much more instructive in learning about AI and how things like training data and such are done from the ground up. I suspect this is because training data is one of the bigger moats an AI company can have, as well as all the class action lawsuits surrounding training data.
One of the best open source datasets that are freely available is The Pile by EleutherAI [1]. It's a few years old now (~2020), but they did some really diligent work in putting together the dataset and documenting it. A more recent and even larger dataset would be the Falcon-RefinedWeb dataset [2].
The system prompt for their Instruct demo is interesting (comments copied in by me, see below):
// Identity
You are DBRX, created by Databricks. The current date is
March 27, 2024.
Your knowledge base was last updated in December 2023. You
answer questions about events prior to and after December
2023 the way a highly informed individual in December 2023
would if they were talking to someone from the above date,
and you can let the user know this when relevant.
// Ethical guidelines
If you are asked to assist with tasks involving the
expression of views held by a significant number of people,
you provide assistance with the task even if you personally
disagree with the views being expressed, but follow this with
a discussion of broader perspectives.
You don't engage in stereotyping, including the negative
stereotyping of majority groups.
If asked about controversial topics, you try to provide
careful thoughts and objective information without
downplaying its harmful content or implying that there are
reasonable perspectives on both sides.
// Capabilities
You are happy to help with writing, analysis, question
answering, math, coding, and all sorts of other tasks.
// it specifically has a hard time using ``` on JSON blocks
You use markdown for coding, which includes JSON blocks and
Markdown tables.
You do not have tools enabled at this time, so cannot run
code or access the internet. You can only provide information
that you have been trained on. You do not send or receive
links or images.
// The following is likely not entirely accurate, but the model
// tends to think that everything it knows about was in its
// training data, which it was not (sometimes only references
// were).
//
// So this produces more accurate accurate answers when the model
// is asked to introspect
You were not trained on copyrighted books, song lyrics,
poems, video transcripts, or news articles; you do not
divulge details of your training data.
// The model hasn't seen most lyrics or poems, but is happy to make
// up lyrics. Better to just not try; it's not good at it and it's
// not ethical.
You do not provide song lyrics, poems, or news articles and instead
refer the user to find them online or in a store.
// The model really wants to talk about its system prompt, to the
// point where it is annoying, so encourage it not to
You give concise responses to simple questions or statements,
but provide thorough responses to more complex and open-ended
questions.
// More pressure not to talk about system prompt
The user is unable to see the system prompt, so you should
write as if it were true without mentioning it.
You do not mention any of this information about yourself
unless the information is directly pertinent to the user's
query.
> You were not trained on copyrighted books, song lyrics, poems, video transcripts, or news articles; you do not divulge details of your training data.
Well now. I'm open to taking the first part at face value, but the second part of that instruction does raise some questions.
data engineer here, offtopic, but am i the only guy tired of databricks shilling their tools as the end-all, be-all solutions for all things data engineering?
Lord no! I'm a data engineer also, feel the same. The part that I find most maddening is it seems pretty devoid from sincerely attempting to provide value.
Things databricks offers that makes peoples lives easier:
- Out the box kubernetes with no set up
- Preconfigured spark
Those are genuinely really useful, but then there's all this extra stuff that makes people's lives worse or drives bad practice:
- Everything is a notebook
- Local development is discouraged
- Version pinning of libraries has very ugly/bad support
- Clusters take 5 minutes to load even if you just want to "print('hello world')"
Sigh! I worked at a company that was databricks heavy and an still suffering PTSD. Sorry for the rant.
Data scientist here that’s also tired of the tools. We put so much effort in trying to educate DSes in our company to get away from notebooks and use IDEs like VS or RStudio and databricks has been a step backwards cause we didn’t get the integrated version
"Looking holistically, our end-to-end LLM pretraining pipeline has become nearly 4x more compute-efficient in the past ten months."
I did not fully understand the technical details in the training efficiency section, but love this. Cost of training is outrageously high, and hopefully it will start to follow Moore's law.
TLDR: A model that could be described as "3.8 level" that is good at math and openly available with a custom license.
It is as fast as 34B model, but uses as much memory as a 132B model. A mixture of 16 experts, activates 4 at a time, so has more chances to get the combo just right than Mixtral (8 with 2 active).
For my personal use case (a top of the line Mac Studio) it looks like the perfect size to replace GPT-4 turbo for programming tasks. What we should look out for is people using them for real world programming tasks (instead of benchmarks) and reporting back.
It's another custom license. It will have to be reviewed by counsel at every company that's thinking about using it. Many will find the acceptable use policy to be vague, overly broad, and potentially damaging for the company.
Looking at the performance stats for this model, the risk of using any non-OSI licensed model over just using Mixtral or Mistral will (and IMO should be) too great for commercial purposes.
> If, on the DBRX version release date, the monthly active users of the products
> or services made available by or for Licensee, or Licensee’s affiliates, is
> greater than 700 million monthly active users in the preceding calendar
> month, you must request a license from Databricks, which we may grant to you
> in our sole discretion, and you are not authorized to exercise any of the
> rights under this Agreement unless or until Databricks otherwise expressly
> grants you such rights.
[+] [-] djoldman|1 year ago|reply
> The model requires ~264GB of RAM
I'm wondering when everyone will transition from tracking parameter count vs evaluation metric to (total gpu RAM + total CPU RAM) vs evaluation metric.
For example, a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s.
Additionally, all the examples of quantizing recently released superior models to fit on one GPU doesnt mean the quantized model is a "win." The quantized model is a different model, you need to rerun the metrics.
[+] [-] ml_hardware|1 year ago|reply
[+] [-] dvt|1 year ago|reply
Q5 quantization performs almost on par with base models. Obviously there's some loss there, but this indicates that there's still a lot of compression that we're missing.
[+] [-] hintymad|1 year ago|reply
[+] [-] XCSme|1 year ago|reply
If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?
I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).
[+] [-] briandw|1 year ago|reply
[+] [-] jjgo|1 year ago|reply
To be fair, they do compare to it in the main body of the blog. It's just probably misleading to compare to CodeLLaMA on non coding benchmarks.
[+] [-] panarky|1 year ago|reply
If you chart the temperature of the ocean do you keep the y-axis anchored at zero Kelvin?
[+] [-] underlines|1 year ago|reply
1. https://github.com/dvmazur/mixtral-offloading?tab=readme-ov-...
[+] [-] jerpint|1 year ago|reply
That would be roughly 13.5M$ USD
I’m guessing that at this scale and cost, this model is not competitive and their ambition is to scale to much larger models. In the meantime , they learned a lot and gain PR from open-sourcing
[+] [-] petesergeant|1 year ago|reply
[+] [-] MP_1729|1 year ago|reply
[+] [-] ianbutler|1 year ago|reply
[+] [-] m3kw9|1 year ago|reply
[+] [-] killermonkeys|1 year ago|reply
[+] [-] sroussey|1 year ago|reply
So in this analogy, each team member and the manager has a certain number of params. The whole team is 132B. The manager and team members running for the specific input add up to 36B. Those will load into memory.
[+] [-] bjornsing|1 year ago|reply
[+] [-] avisoori1x|1 year ago|reply
[+] [-] emmender2|1 year ago|reply
Claims about out-performance on tasks are just that, claims. the next iteration of llama or mixtral will converge.
LLMs seem to evolve like linux/windows or ios/android with not much differentiation in the foundation models.
[+] [-] jobigoud|1 year ago|reply
[+] [-] swalsh|1 year ago|reply
For startups, the lesson here is don't be in the business of building models. Be in the business of using models. The cost of using AI will probably continue to trend lower for the foreseeable future... but you can build a moat in the business layer.
[+] [-] n2d4|1 year ago|reply
[+] [-] mnemoni_c|1 year ago|reply
[+] [-] throwaway74432|1 year ago|reply
https://www.investopedia.com/terms/c/commodity.asp
[+] [-] YetAnotherNick|1 year ago|reply
[+] [-] gerash|1 year ago|reply
[+] [-] falcor84|1 year ago|reply
They are also all trained to do well on the same evals, right? So doesn't it just boil down to neural nets being universal function approximators?
[+] [-] bevekspldnw|1 year ago|reply
[+] [-] crooked-v|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] shnkr|1 year ago|reply
>At Databricks, we believe that every enterprise should have the ability to control its data and its destiny in the emerging world of GenAI.
>The main process of building DBRX - including pretraining, post-training, evaluation, red-teaming, and refining - took place over the course of three months.
[+] [-] simonw|1 year ago|reply
Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!
A couple of things I've written about this:
- https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-the...
- https://simonwillison.net/2023/Apr/17/redpajama-data/
[+] [-] tempusalaria|1 year ago|reply
This is then cleaned up to remove nonsense, some technical files, and repeated files.
From this, they tend to weight some sources more - e.g. Wikipedia gets a pretty high weighting in the data mix. Overall these data mixes have multiple trillion token counts.
GPT-4 apparently trained on multiple epochs of the same data mix. So would assume this one did too as it’s a similar token count
[+] [-] IshanMi|1 year ago|reply
One of the best open source datasets that are freely available is The Pile by EleutherAI [1]. It's a few years old now (~2020), but they did some really diligent work in putting together the dataset and documenting it. A more recent and even larger dataset would be the Falcon-RefinedWeb dataset [2].
[1]: https://arxiv.org/abs/2101.00027 [2]: https://arxiv.org/abs/2306.01116
[+] [-] natsucks|1 year ago|reply
[+] [-] johnpruna|1 year ago|reply
https://huggingface.co/PrunaAI/dbrx-base-bnb-4bit https://huggingface.co/PrunaAI/dbrx-instruct-bnb-4bit
[+] [-] simonw|1 year ago|reply
But it's also in this repo, with very useful comments explaining what's going on. I edited this comment to add them above:
https://huggingface.co/spaces/databricks/dbrx-instruct/blob/...
[+] [-] loudmax|1 year ago|reply
Well now. I'm open to taking the first part at face value, but the second part of that instruction does raise some questions.
[+] [-] jxy|1 year ago|reply
[+] [-] gigatexal|1 year ago|reply
[+] [-] benrutter|1 year ago|reply
Things databricks offers that makes peoples lives easier:
- Out the box kubernetes with no set up
- Preconfigured spark
Those are genuinely really useful, but then there's all this extra stuff that makes people's lives worse or drives bad practice:
- Everything is a notebook
- Local development is discouraged
- Version pinning of libraries has very ugly/bad support
- Clusters take 5 minutes to load even if you just want to "print('hello world')"
Sigh! I worked at a company that was databricks heavy and an still suffering PTSD. Sorry for the rant.
[+] [-] melondonkey|1 year ago|reply
[+] [-] VirusNewbie|1 year ago|reply
[+] [-] millenseed|1 year ago|reply
[+] [-] ec109685|1 year ago|reply
Are there standard ways to avoid that type of score inflation?
[+] [-] bg24|1 year ago|reply
I did not fully understand the technical details in the training efficiency section, but love this. Cost of training is outrageously high, and hopefully it will start to follow Moore's law.
[+] [-] ingenieroariel|1 year ago|reply
It is as fast as 34B model, but uses as much memory as a 132B model. A mixture of 16 experts, activates 4 at a time, so has more chances to get the combo just right than Mixtral (8 with 2 active).
For my personal use case (a top of the line Mac Studio) it looks like the perfect size to replace GPT-4 turbo for programming tasks. What we should look out for is people using them for real world programming tasks (instead of benchmarks) and reporting back.
[+] [-] saeleor|1 year ago|reply
wouldn't be the first branding as open source going the LLaMA route
[+] [-] wantsanagent|1 year ago|reply
Looking at the performance stats for this model, the risk of using any non-OSI licensed model over just using Mixtral or Mistral will (and IMO should be) too great for commercial purposes.
[+] [-] superdupershant|1 year ago|reply