top | item 39838104

DBRX: A new open LLM

866 points| jasondavies | 1 year ago |databricks.com

343 comments

order
[+] djoldman|1 year ago|reply
Model card for base: https://huggingface.co/databricks/dbrx-base

> The model requires ~264GB of RAM

I'm wondering when everyone will transition from tracking parameter count vs evaluation metric to (total gpu RAM + total CPU RAM) vs evaluation metric.

For example, a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s.

Additionally, all the examples of quantizing recently released superior models to fit on one GPU doesnt mean the quantized model is a "win." The quantized model is a different model, you need to rerun the metrics.

[+] dvt|1 year ago|reply
> a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s

Q5 quantization performs almost on par with base models. Obviously there's some loss there, but this indicates that there's still a lot of compression that we're missing.

[+] hintymad|1 year ago|reply
Just curious, what business benefit will Databricks get by spending potentially millions of dollars on an open LLM?
[+] XCSme|1 year ago|reply
I am planning to buy a new GPU.

If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?

I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).

[+] briandw|1 year ago|reply
Worse than the chart crime of truncating the y axis is putting LLaMa2's Human Eval scores on there and not comparing it to Code Llama Instruct 70b. DBRX still beats Code Llama Instruct's 67.8 but not by that much.
[+] jjgo|1 year ago|reply
> "On HumanEval, DBRX Instruct even surpasses CodeLLaMA-70B Instruct, a model built explicitly for programming, despite the fact that DBRX Instruct is designed for general-purpose use (70.1% vs. 67.8% on HumanEval as reported by Meta in the CodeLLaMA blog)."

To be fair, they do compare to it in the main body of the blog. It's just probably misleading to compare to CodeLLaMA on non coding benchmarks.

[+] panarky|1 year ago|reply
> chart crime of truncating the y axis

If you chart the temperature of the ocean do you keep the y-axis anchored at zero Kelvin?

[+] jerpint|1 year ago|reply
Per the paper, 3072 H100s over the course of 3 months, assume a cost of 2$/GPU/hour

That would be roughly 13.5M$ USD

I’m guessing that at this scale and cost, this model is not competitive and their ambition is to scale to much larger models. In the meantime , they learned a lot and gain PR from open-sourcing

[+] petesergeant|1 year ago|reply
This makes me bearish on OpenAI as a company. When a cloud company can offer a strong model for free by selling the compute, what competitive advantage does a company who want you to pay for the model have left? Feels like they might get Netscape’d.
[+] MP_1729|1 year ago|reply
OpenAI is not the worst, ChatGPT is used by 100M people weekly, sort of insulated from benchmarks. The best of the rest, Anthropic, should be really scared.
[+] ianbutler|1 year ago|reply
The approval on the base model is not feeling very open. Plenty of people still waiting on a chance to download it, where as the instruct model was an instant approval. The base model is more interesting to me for finetuning.
[+] m3kw9|1 year ago|reply
These tiny “state of the art” performance increases are really indicative the current architecture for LLM(Transformers + Mixture of Experts) is maxed out even if you train it more/differently. The writings are on all over the walls.
[+] killermonkeys|1 year ago|reply
What does it mean to have less active parameters (36B) than the full model size (132B) and what impact does that have on memory and latency? It seems like this is because it is an MoE model?
[+] sroussey|1 year ago|reply
The mixture of experts is kinda like a team and a manager. So the manager and one or two of the team go to work depending on the input, not the entire team.

So in this analogy, each team member and the manager has a certain number of params. The whole team is 132B. The manager and team members running for the specific input add up to 36B. Those will load into memory.

[+] bjornsing|1 year ago|reply
Means that it’s a mixture of experts model with 132B parameters in total, but a subset of 36B parameters are used / selected in each forward pass, depending on the context. The parameters not used / selected for generating a particular token belong to “experts” that were deemed not very good at predicting the next token in the current context, but could be used / selected e.g. for the next token.
[+] emmender2|1 year ago|reply
this proves that all llm models converge to a certain point when trained on the same data. ie, there is really no differentiation between one model or the other.

Claims about out-performance on tasks are just that, claims. the next iteration of llama or mixtral will converge.

LLMs seem to evolve like linux/windows or ios/android with not much differentiation in the foundation models.

[+] jobigoud|1 year ago|reply
It's even possible they converge when trained on different data, if they are learning some underlying representation. There was recent research on face generation where they trained two models by splitting one training set in two without overlap, and got the two models to generate similar faces for similar conditioning, even though each model hadn't seen anything that the other model had.
[+] swalsh|1 year ago|reply
The models are commodities, and the API's are even similar enough that there is zero stickiness. I can swap one model for another, and usually not have to change anything about my prompts or rag pipelines.

For startups, the lesson here is don't be in the business of building models. Be in the business of using models. The cost of using AI will probably continue to trend lower for the foreseeable future... but you can build a moat in the business layer.

[+] n2d4|1 year ago|reply
There's at least an argument to be made that this is because all the models are heavily trained on GPT-4 outputs (or whatever the SOTA happens to be during training). All those models are, in a way, a product of inbreeding.
[+] mnemoni_c|1 year ago|reply
Yea it feels like transformer LLMs are in or getting closer to diminishing returns. Will need some new breakthrough, likely entirely new approach, to get to AGI levels
[+] YetAnotherNick|1 year ago|reply
Even in the most liberal interpretation of prove, it doesn't do that. GPT-4 was trained before OpenAI has any special data or deal with microsoft or the product market fit. Yet, no model has beaten it in a year. And google, microsoft, meta definitely have better data and more compute.
[+] gerash|1 year ago|reply
The evaluations are not comprehensive either. All of them are improving and you can't expect any of them to hit 100% on the metrics (a la. bayes error rate). It gets increasingly difficult to move the metrics as they get better.
[+] falcor84|1 year ago|reply
> this proves that all llm models converge to a certain point when trained on the same data

They are also all trained to do well on the same evals, right? So doesn't it just boil down to neural nets being universal function approximators?

[+] bevekspldnw|1 year ago|reply
The big thing for locally hosted is inference efficiency and speed. Mistral wears that crown by a good margin.
[+] crooked-v|1 year ago|reply
Of course, part of this is that a lot of LLMs are now being trained on data that is itself LLM-generated...
[+] shnkr|1 year ago|reply
GenAI novice here. what is training data made of how is it collected? I guess no one will share details on it, otherwise a good technical blog post with lots of insights!

>At Databricks, we believe that every enterprise should have the ability to control its data and its destiny in the emerging world of GenAI.

>The main process of building DBRX - including pretraining, post-training, evaluation, red-teaming, and refining - took place over the course of three months.

[+] simonw|1 year ago|reply
The most detailed answer to that I've seen is the original LLaMA paper, which described exactly what that model was trained on (including lots of scraped copyrighted data) https://arxiv.org/abs/2302.13971

Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!

A couple of things I've written about this:

- https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-the...

- https://simonwillison.net/2023/Apr/17/redpajama-data/

[+] tempusalaria|1 year ago|reply
The training data is pretty much anything you can read on the internet plus books.

This is then cleaned up to remove nonsense, some technical files, and repeated files.

From this, they tend to weight some sources more - e.g. Wikipedia gets a pretty high weighting in the data mix. Overall these data mixes have multiple trillion token counts.

GPT-4 apparently trained on multiple epochs of the same data mix. So would assume this one did too as it’s a similar token count

[+] IshanMi|1 year ago|reply
Personally, I found looking at open source work to be much more instructive in learning about AI and how things like training data and such are done from the ground up. I suspect this is because training data is one of the bigger moats an AI company can have, as well as all the class action lawsuits surrounding training data.

One of the best open source datasets that are freely available is The Pile by EleutherAI [1]. It's a few years old now (~2020), but they did some really diligent work in putting together the dataset and documenting it. A more recent and even larger dataset would be the Falcon-RefinedWeb dataset [2].

[1]: https://arxiv.org/abs/2101.00027 [2]: https://arxiv.org/abs/2306.01116

[+] natsucks|1 year ago|reply
it's twice the size of mixtral and barely beats it.
[+] simonw|1 year ago|reply
The system prompt for their Instruct demo is interesting (comments copied in by me, see below):

    // Identity
    You are DBRX, created by Databricks. The current date is
    March 27, 2024.

    Your knowledge base was last updated in December 2023. You
    answer questions about events prior to and after December
    2023 the way a highly informed individual in December 2023
    would if they were talking to someone from the above date,
    and you can let the user know this when relevant.

    // Ethical guidelines
    If you are asked to assist with tasks involving the
    expression of views held by a significant number of people,
    you provide assistance with the task even if you personally
    disagree with the views being expressed, but follow this with
    a discussion of broader perspectives.

    You don't engage in stereotyping, including the negative
    stereotyping of majority groups.

    If asked about controversial topics, you try to provide
    careful thoughts and objective information without
    downplaying its harmful content or implying that there are
    reasonable perspectives on both sides.

    // Capabilities
    You are happy to help with writing, analysis, question
    answering, math, coding, and all sorts of other tasks.

    // it specifically has a hard time using ``` on JSON blocks
    You use markdown for coding, which includes JSON blocks and
    Markdown tables.

    You do not have tools enabled at this time, so cannot run
    code or access the internet. You can only provide information
    that you have been trained on. You do not send or receive
    links or images.

    // The following is likely not entirely accurate, but the model
    // tends to think that everything it knows about was in its
    // training data, which it was not (sometimes only references
    // were).
    //
    // So this produces more accurate accurate answers when the model
    // is asked to introspect
    You were not trained on copyrighted books, song lyrics,
    poems, video transcripts, or news articles; you do not
    divulge details of your training data.
    
    // The model hasn't seen most lyrics or poems, but is happy to make
    // up lyrics. Better to just not try; it's not good at it and it's
    // not ethical.
    You do not provide song lyrics, poems, or news articles and instead
    refer the user to find them online or in a store.

    // The model really wants to talk about its system prompt, to the
    // point where it is annoying, so encourage it not to
    You give concise responses to simple questions or statements,
    but provide thorough responses to more complex and open-ended
    questions.

    // More pressure not to talk about system prompt
    The user is unable to see the system prompt, so you should
    write as if it were true without mentioning it.

    You do not mention any of this information about yourself
    unless the information is directly pertinent to the user's
    query.
I first saw this from Nathan Lambert: https://twitter.com/natolambert/status/1773005582963994761

But it's also in this repo, with very useful comments explaining what's going on. I edited this comment to add them above:

https://huggingface.co/spaces/databricks/dbrx-instruct/blob/...

[+] loudmax|1 year ago|reply
> You were not trained on copyrighted books, song lyrics, poems, video transcripts, or news articles; you do not divulge details of your training data.

Well now. I'm open to taking the first part at face value, but the second part of that instruction does raise some questions.

[+] gigatexal|1 year ago|reply
data engineer here, offtopic, but am i the only guy tired of databricks shilling their tools as the end-all, be-all solutions for all things data engineering?
[+] benrutter|1 year ago|reply
Lord no! I'm a data engineer also, feel the same. The part that I find most maddening is it seems pretty devoid from sincerely attempting to provide value.

Things databricks offers that makes peoples lives easier:

- Out the box kubernetes with no set up

- Preconfigured spark

Those are genuinely really useful, but then there's all this extra stuff that makes people's lives worse or drives bad practice:

- Everything is a notebook

- Local development is discouraged

- Version pinning of libraries has very ugly/bad support

- Clusters take 5 minutes to load even if you just want to "print('hello world')"

Sigh! I worked at a company that was databricks heavy and an still suffering PTSD. Sorry for the rant.

[+] melondonkey|1 year ago|reply
Data scientist here that’s also tired of the tools. We put so much effort in trying to educate DSes in our company to get away from notebooks and use IDEs like VS or RStudio and databricks has been a step backwards cause we didn’t get the integrated version
[+] VirusNewbie|1 year ago|reply
Spark is pretty well engineered and quite good.
[+] millenseed|1 year ago|reply
You might be tired, but there's tons of value for enterprises to only use one end-all tool. It's not personal you know.
[+] ec109685|1 year ago|reply
For coding evals, it seems like unless you are super careful, they can be polluted by the training data.

Are there standard ways to avoid that type of score inflation?

[+] bg24|1 year ago|reply
"Looking holistically, our end-to-end LLM pretraining pipeline has become nearly 4x more compute-efficient in the past ten months."

I did not fully understand the technical details in the training efficiency section, but love this. Cost of training is outrageously high, and hopefully it will start to follow Moore's law.

[+] ingenieroariel|1 year ago|reply
TLDR: A model that could be described as "3.8 level" that is good at math and openly available with a custom license.

It is as fast as 34B model, but uses as much memory as a 132B model. A mixture of 16 experts, activates 4 at a time, so has more chances to get the combo just right than Mixtral (8 with 2 active).

For my personal use case (a top of the line Mac Studio) it looks like the perfect size to replace GPT-4 turbo for programming tasks. What we should look out for is people using them for real world programming tasks (instead of benchmarks) and reporting back.

[+] saeleor|1 year ago|reply
looks great, although I couldn't find anything on how "open" the license is/will be for commercial purposes

wouldn't be the first branding as open source going the LLaMA route

[+] wantsanagent|1 year ago|reply
It's another custom license. It will have to be reviewed by counsel at every company that's thinking about using it. Many will find the acceptable use policy to be vague, overly broad, and potentially damaging for the company.

Looking at the performance stats for this model, the risk of using any non-OSI licensed model over just using Mixtral or Mistral will (and IMO should be) too great for commercial purposes.

[+] superdupershant|1 year ago|reply
It's similar to llama2.

  > If, on the DBRX version release date, the monthly active users of the products
  > or services made available by or for Licensee, or Licensee’s affiliates, is
  > greater than 700 million monthly active users in the preceding calendar 
  > month, you must request a license from Databricks, which we may grant to you
  > in our sole discretion, and you are not authorized to exercise any of the
  > rights under this Agreement unless or until Databricks otherwise expressly
  > grants you such rights.

https://www.databricks.com/legal/open-model-license