databricks / dbrx Goto Github PK

View Code? Open in Web Editor NEW

2.4K 38.0 229.0 65 KB

Code examples and resources for DBRX, a large language model developed by Databricks

Home Page: https://www.databricks.com/

License: Other

Python 100.00%

databricks gen-ai generative-ai llm llm-inference llm-training mosaic-ai

dbrx's People

Contributors

Stargazers

Watchers

Forkers

dennyglee hubayirp nistrate skbennet kukupigs eugenem321 wlngai asnelling rajkrishnamurthy vijayendram dattgoswami funmu jmac122 serverchief adityaprashant aakashapoorv dodcorp vital121 eonretief spaparaju roysh tkone2018 kafkaqin jwmemail w4ester ototao xmas25 yibit zhutony djdev 021gink starstylesky vionaaru onenotell shenzh1990 mole-bai jieyoujun shimura0 zhangyu789 zinsayon petercao breakmyself tops666 rickyhong hopehit iamleon121 chaffeechenyefei hsiehchou captake genostack bruinxiong baosuning suryacharanteja thy-chan zzmjohn kekewind dsdanielpark techthiyanes fpxtest dpflann lebrosoft ccc0168 shauryashaurya david20080125 nefario7 a43501 skaiphd mypythondemo ajits-github sorokinvld vinicius-ianni patricksilva take2u mikechen66 laobadao tingbaozhao originx-23 panyuyi f901107 chenxingqiang zhengyafei123 itsjameshan ryyzn9 adesh-thakare lin-happiness shihuaxing nepalisagun suaifu azazel9966 jmaigc yingzi6776 yinxx misterypoem omungelwar45 pradipkhomane greensuse rezabehnoud ailabteam oztc dalian-ai

dbrx's Issues

HumanEval

When evaluating humaneval, does dbrx use some specical prompts to improve performance?

Missing tokenizer when use vllm

  File "/home/paas/vllm/vllm/engine/llm_engine.py", line 222, in _init_tokenizer
    self.tokenizer: BaseTokenizerGroup = get_tokenizer_group(
  File "/home/paas/vllm/vllm/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
    return TokenizerGroup(**init_kwargs)
  File "/home/paas/vllm/vllm/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
    self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
  File "/home/paas/vllm/vllm/transformers_utils/tokenizer.py", line 66, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/paas/miniconda3/envs/naie/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 822, in from_pretrained
    return tokenizer_class.from_pretrained(
  File "/home/paas/miniconda3/envs/naie/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2086, in from_pretrained
    return cls._from_pretrained(
  File "/home/paas/miniconda3/envs/naie/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2327, in _from_pretrained
    raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Is DBRX, the most powerful open-source LLM yet?

In the rapidly evolving landscape of natural language processing (NLP), the emergence of DBRX marks a significant milestone. Developed by Databricks, DBRX represents a quantum leap in the realm of large language models (LLMs), boasting unparalleled performance and accuracy across a multitude of benchmarks. This article delves into the intricate details and remarkable statistics that underscore the prowess of DBRX, positioning it as a frontrunner in the field.

Performance on Composite Benchmarks:

DBRX's superiority becomes evident when evaluated against established open and closed models across composite benchmarks. Notably, on the Hugging Face Open LLM Leaderboard, DBRX achieves an exceptional score of 74.5%, surpassing its closest competitor by a significant margin of 1.8%. Similarly, on the Databricks Model Gauntlet, DBRX outshines its peers with a commanding score of 66.8%, underscoring its unrivaled proficiency in diverse tasks encompassing world knowledge, commonsense reasoning, and language understanding.

Dominance in Specialized Domains:

Where DBRX truly shines is in specialized domains such as programming and mathematics. On benchmarks tailored to assess programming prowess like HumanEval and GSM8k, DBRX demonstrates remarkable superiority. For instance, on HumanEval, DBRX achieves an impressive score of 70.1%, outperforming Grok-1 by 6.9%, Mixtral Instruct by 15.3%, and the best-performing LLaMA2-70B variant by 37.9%. Similarly, on GSM8k, DBRX secures a notable 66.9%, surpassing competitors by margins ranging from 4.0% to 12.8%.

Unprecedented Versatility:

DBRX's exceptional performance across diverse domains underscores its versatility and adaptability. Whether tackling complex programming challenges or unraveling intricate linguistic nuances, DBRX consistently delivers unparalleled results. This versatility positions DBRX as a formidable tool for a wide array of applications, ranging from natural language understanding to specialized tasks in programming and mathematics.

Efficiency in Training and Inference:

Beyond its remarkable performance, DBRX also excels in training and inference efficiency. Leveraging a fine-grained mixture-of-experts (MoE) architecture, DBRX achieves superior FLOP efficiency compared to dense models, enabling faster training and inference without compromising on quality. Additionally, DBRX's inference throughput surpasses that of its counterparts, offering up to 150 tokens per second on Mosaic AI Model Serving.

Conclusion:

In conclusion, DBRX represents a paradigm shift in the landscape of large language models. Its exceptional performance, unmatched versatility, and superior efficiency position it as a frontrunner in the field, setting new standards for accuracy and proficiency. As the pinnacle of Databricks' innovation in NLP, DBRX promises to empower enterprises and researchers alike, heralding a new era of breakthroughs in natural language understanding and AI-driven applications.

Whats you think?

How inference efficiency is measured

The tech report described the methodology of the inference efficiency measurement but not in detail. It compared the Llama2-70B and DBRX. We have great interests in the comparison. So we also carried out some tests where we spawned different number synchronous clients in order to stress the service in different QPS. What performance we get is different from the tech report. DBRX is faster than Llama2-70B when the traffic is lower than 0.35 QPS. The Latency vs QPS curve is flipped after that. By the way we use the same prompt length and output length as that in tech report.

So I wonder if you could give more details about how the performance is test.

Silu or Glu activation?

According to Model card on huggingface:

DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA).

However, when I run

config = AutoConfig.from_pretrained('/models/dbrx-instruct/')
print(config.ffn_config)

It shows:

DbrxFFNConfig {
  "ffn_act_fn": {
    "name": "silu"
  },
  "ffn_hidden_size": 10752,
  "moe_jitter_eps": 0,
  "moe_loss_weight": 0.05,
  "moe_normalize_expert_weights": 1,
  "moe_num_experts": 16,
  "moe_top_k": 4,
  "transformers_version": "4.38.1",
  "uniform_expert_assignment": false
}

It is somehow misleading and confusing.

Quantized distilled version

It would be great to have a version that can be better worked with locally - ideally executable on CPU

How to get hands on experience as a newbie

My first option is to run quantized versions.

Quantized

I read this https://github.com/databricks/dbrx#mlx

and then went to https://huggingface.co/mlx-community/dbrx-instruct-4bit

I read this

On my Macbook Pro M2 with 96GB of Unified Memory, DBRX Instruct in 4-bit for the above prompt it eats 70.2GB of RAM.

I am on a macbook pro M1 Max with 64Gb memory.

I guess that's not enough?

Computing

My next version is to figure out what's a cheap way to run the model but the details confuse me.

Can help?

Loading over multiple gpus in 8bit and 4bit with transformers loader

I can load the instruct model using the transformers loader and 8bit bits and bytes, I can get it to load evenly among multiple gpus.

However, I cannot seem to load the model with 4bit precion over multiple gpus, I managed to get the model to load across 1 24GB gpu and then start loading onto a second gpu of equivalent size, but it will not move on to any of the remaining gpus (7 in total). It will oom on the second gpu with the others sitting empty.

I've loaded other transformers based models via 4bit and never experience this heavily unbalanced loading before.

I have encountered a problem：LayerNorm.init() got an unexpected keyword argument 'bias'

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/home/roo/train/dbrx-instruct/generate.py", line 39, in
model = AutoModelForCausalLM.from_pretrained(
File "/home/roo/anaconda3/envs/Meditron/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
return model_class.from_pretrained(
File "/home/roo/anaconda3/envs/Meditron/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3404, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/home/roo/.cache/huggingface/modules/transformers_modules/model/modeling_dbrx.py", line 1261, in init
self.transformer = DbrxModel(config)
File "/home/roo/.cache/huggingface/modules/transformers_modules/model/modeling_dbrx.py", line 1013, in init
self.blocks = nn.ModuleList([
File "/home/roo/.cache/huggingface/modules/transformers_modules/model/modeling_dbrx.py", line 1014, in
DbrxBlock(config, block_idx) for block_idx in range(config.n_layers)
File "/home/roo/.cache/huggingface/modules/transformers_modules/model/modeling_dbrx.py", line 856, in init
self.norm_attn_norm = DbrxNormAttentionNorm(
File "/home/roo/.cache/huggingface/modules/transformers_modules/model/modeling_dbrx.py", line 642, in init
self.norm_1 = nn.LayerNorm(hidden_size, bias=False)
TypeError: LayerNorm.init() got an unexpected keyword argument 'bias'

`convert_ids_to_tokens` not working as expected.

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained('/models/dbrx-instruct/')
t.encode('请问你是谁')
# [15225, 57107, 57668, 21043, 39013, 223]
t.decode([15225, 57107, 57668, 21043, 39013, 223])
# '请问你是谁'
print(t.convert_ids_to_tokens(15225))
# 'è¯·'

I suppose it should output token text like 请？

How to use API?

Hi, thank you for your wonderful work! However, I have to learn how to use API in our project. So please show us the way to use API when you update README.md next time. Thank you so much~

Does the tokenizer of this model have a network to load successfully?

generate.py : tiktoken.py throws Encoding import error

After setting everything up locally, both the generate.py from the github and the minimal python script on the huggingface page throw the same error.
I followed all the steps in this repo and my brand new venv has been populated with "pip install -r requirements.txt"


Traceback (most recent call last):
  File "/media/models/dbrx/generate.py", line 34, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/models/dbrx/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 822, in from_pretrained
    return tokenizer_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/models/dbrx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2086, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/media/models/dbrx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2325, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/tiktoken.py", line 105, in __init__
    from tiktoken import Encoding  # type: ignore (thirdParty)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: cannot import name 'Encoding' from 'tiktoken' (/media/models/dbrx/tiktoken.py)

Why pretrainModel class "_supports_sdpa" is False?

The most models' pretrainModel class attribute _supports_sdpa are True, why DBRX set False?

training slow

Hello, I find it is slow to train the moe model, because the DbrxExperts moe training process is serial.

Can it use parallel training process in "for loop"？

What's the optimal parallel strategy using TensorRT-LLM?

Thanks for your great efforts first. I read the PR you opened in the TensorRT-LLM repo and noticed that EP +TP, PP + TP, and TP are supported during inference. May I ask which one is optimal? Specifically, as for the MoE layer, does EP or TP yield better performance?

Fine Tuning?

Do you support fine-tuning on this model? Such as using Lora, Deepspeed, etc

Real Performance versus llama-70B？

I have a problem about the inference data posted in this blog:
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

A MoE model with 36B activated parameters and 132B total parameters, it's inference performance will act like a 90B dense model with 2000 prompt and 256 output tokenes. How can it always performs better than llama2-70B dense model? As the batchsize increases, it will perform better than llama2-70B dense model first, and will perform worse than llama2-70B dense model from batchszie 3 or 4, because it will load all the 132B parameters when more and more experts are activated.