GithubHelp home page GithubHelp logo

safeailab / eagle Goto Github PK

View Code? Open in Web Editor NEW
712.0 11.0 74.0 56.32 MB

Official Implementation of EAGLE-1 and EAGLE-2

Home Page: https://arxiv.org/pdf/2406.16858

License: Apache License 2.0

Python 100.00%
large-language-models llm-inference speculative-decoding

eagle's Introduction

EAGLE

 EAGLE

| Paper (EAGLE) | Paper (EAGLE-2) | Blog | Demo |

Version License Maintenance Contributions welcome

benchmark

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a new baseline for fast decoding of Large Language Models (LLMs) with provable performance maintenance. This approach involves extrapolating the second-top-layer contextual feature vectors of LLMs, enabling a significant boost in generation efficiency.

  • EAGLE is:
    • certified by the third-party evaluation as the fastest speculative method so far.
    • achieving 2x speedup on gpt-fast.
    • 3x faster than vanilla decoding (13B).
    • 2x faster than Lookahead (13B).
    • 1.6x faster than Medusa (13B).
    • provably maintaining the consistency with vanilla decoding in the distribution of generated texts.
    • trainable (within 1-2 days) and testable on 8x RTX 3090 GPUs. So even the GPU poor can afford it.
    • combinable with other parallelled techniques such as vLLM, DeepSpeed, Mamba, FlashAttention, quantization, and hardware optimization.

EAGLE-2 uses the confidence scores from the draft model to approximate acceptance rates, dynamically adjusting the draft tree structure, which further enhances performance.

  • EAGLE-2 is:
    • 4x faster than vanilla decoding (13B).
    • 1.4x faster than EAGLE-1 (13B).

demogif

Using EAGLE-2, the inference speed on 2 RTX 3060 GPUs can be faster than vanilla autoregressive decoding on an A100 GPU.

Update

2024.8.23: EAGLE is merged with vLLM.

2024.8.8: We now support Qwen-2.

2024.6.27: EAGLE-2 is released.

2024.5.25: EAGLE is meged with Intel® LLM library for PyTorch.

2024.5.9: EAGLE is meged with Intel® Extension for Transformers.

2024.2.25: EAGLE is certified by the third-party evaluation as the fastest speculative method.

2024.1.17: We now support Mixtral-8x7B-Instruct.

2023.12.8: EAGLE v1.0 is released.

Todo

  • Support non-greedy inference (provably maintaining text distribution).
  • Support more LLMs such as Mixtral 8x7B.
  • Support LLaMA-3.
  • Support Qwen-2.

The default main branch is the implementation of EAGLE-2. For using EAGLE-1, please switch to the v1 branch.

Contents

Setup & Installation

git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt

EAGLE Weights

Note: When Qwen2 is the target model, please use bf16 precision instead of fp16 to avoid numerical overflow. The training dataset for the draft model of Qwen2 is ShareGPT, which has removed non-English data. Therefore, if you want to use it on non-English data such as Chinese, please train with the corresponding data.

Compared to EAGLE, EAGLE-2 does not require additional training and uses the same weights.

Base Model EAGLE on Hugging Face # EAGLE Parameters Base Model EAGLE on Hugging Face # EAGLE Parameters
Vicuna-7B-v1.3 yuhuili/EAGLE-Vicuna-7B-v1.3 0.24B LLaMA2-Chat 7B yuhuili/EAGLE-llama2-chat-7B 0.24B
Vicuna-13B-v1.3 yuhuili/EAGLE-Vicuna-13B-v1.3 0.37B LLaMA2-Chat 13B yuhuili/EAGLE-llama2-chat-13B 0.37B
Vicuna-33B-v1.3 yuhuili/EAGLE-Vicuna-33B-v1.3 0.56B LLaMA2-Chat 70B yuhuili/EAGLE-llama2-chat-70B 0.99B
Mixtral-8x7B-Instruct-v0.1 yuhuili/EAGLE-mixtral-instruct-8x7B 0.28B
LLaMA3-Instruct 8B yuhuili/EAGLE-LLaMA3-Instruct-8B 0.25B LLaMA3-Instruct 70B yuhuili/EAGLE-LLaMA3-Instruct-70B 0.99B
Qwen2-7B-Instruct yuhuili/EAGLE-Qwen2-7B-Instruct 0.26B Qwen2-72B-Instruct yuhuili/EAGLE-Qwen2-72B-Instruct 1.05B

Inference

The inference code we provide automatically allocates model weights (loading a model across multiple GPUs), allowing you to run models that exceed the memory of a single GPU.

With UI

We have provided a suggested web interface, which you can use by running the following command. After the model is fully loaded, a URL will be output in the terminal, which you can enter into your browser to access.

python -m eagle.application.webui --ea-model-path [path of EAGLE weight]\ 
		--base-model-path [path of the original model]\
		--model-type [vicuna\llama2\llama3]\
        --total-token [int]

The total-token is the number of draft tokens. For smaller models and advanced GPUs, this value can be set larger. Adjusting according to the specific device and model can achieve better results. If set to -1, EAGLE-2 will automatically configure this parameter.

With Code

You can use our provided "eagenerate" for speedup generation just like using 'generate' from Hugging Face. Here is an example.

from eagle.model.ea_model import EaModel
from fastchat.model import get_conversation_template
model = EaModel.from_pretrained(
    base_model_path=base_model_path,
    ea_model_path=EAGLE_model_path,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto",
    total_token=-1
)
model.eval()
your_message="Hello"
conv = get_conversation_template("vicuna")
conv.append_message(conv.roles[0], your_message)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids=model.tokenizer([prompt]).input_ids
input_ids = torch.as_tensor(input_ids).cuda()
output_ids=model.eagenerate(input_ids,temperature=0.5,max_new_tokens=512)
output=model.tokenizer.decode(output_ids[0])

Note: Vicuna, LLaMA2-Chat, and LLaMA3-Instruct are both chat models. You need to use the correct chat template, otherwise it will cause abnormal output from the model and affect the performance of EAGLE.

Train

Generate Train Data

You can run the following command to generate the training data.

python -m eagle.ge_data.allocation --outdir [path of data]

Train the Auto-regression Head

accelerate launch -m --mixed_precision=bf16 eagle.train.main --tmpdir [path of data]\
--cpdir [path of checkpoints] --configpath [path of config file]

eagle/train provides examples of configuration files.

You can also use DeepSpeed for training.

cd eagle/train
deepspeed main_deepspeed.py --deepspeed_config ds_config.json

Inference on custom models

If the original LLM structure differs from LLaMA and Mixtral, you can utilize EAGLE as follows:

Copy the modeling_basemodelname.py from the Transformers library and proceed to make modifications to leverage the pre-allocated kv_cache for enhanced speed in the base model. You can refer to model/modeling_llama_kv.py for guidance, where places that require modifications are annotated with # [MODIFIED]. These modifications are minimal.

Evaluation

You can test the speed of EAGLE on MT-bench using the following command.

python -m eagle.evaluation.gen_ea_answer_vicuna(or gen_ea_answer_vicuna_llama2chat)\
		 --ea-model-path [path of EAGLE weight]\ 
		 --base-model-path [path of the original model]\

If you need specific acceleration ratios, you will also need to run the following command to get the speed of vanilla auto-regression.

python -m eagle.evaluation.gen_baseline_answer_vicuna\
		(or gen_ea_answer_vicuna_llama2chat)\
		 --ea-model-path [path of EAGLE weight]\ 
		 --base-model-path [path of the original model]\

The above two commands will each generate a .jsonl file that records the generation results and wall time. Then, you can use evaluation/speed.py to calculate the ratio of speeds.

🌟 Our Contributors

A heartfelt thank you to all our contributors.

Contributors

Reference

For technical details and full experimental results, please check the paper of EAGLE and the paper of EAGLE-2.

@inproceedings{li2024eagle, 
	author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, 
	title = {EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty}, 
	booktitle = {International Conference on Machine Learning},
	year = {2024}
}
@misc{li2024eagle2fasterinferencelanguage,
      title={EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees}, 
      author={Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
      year={2024},
      eprint={2406.16858},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.16858}, 
}

Acknowledgements

This project has been influenced by many excellent projects in the LLM community, such as Medusa, FastChat, and others. The logo is designed by GPT-4. We also appreciate many valuable discussions with Tianle Cai, Hao Zhang, Ziteng Sun, and others.

eagle's People

Contributors

andreslavescu avatar cyli-tiger avatar dtlzhuangz avatar eltociear avatar hongyanz avatar liyuhui-12 avatar sonald avatar wejoncy avatar yanjunplay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eagle's Issues

bsne1 branch "last_hidden = out_hidden[ab,last_nopadding][:,None]" gets wrong

Environment:
cuda 11.8
python 3.8
pip install -r requirements.txt
git checkout bsne1
python example.py

example.py:

from model.ea_model import EaModel
from fastchat.model import get_conversation_template
import torch
model = EaModel.from_pretrained(
    base_model_path="/home/server/models/llamla-2-7b-chat",
    ea_model_path="/home/server/models/EAGLE-LLAMA2-CHAT-7b",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto"
)
# left padding
model.eval()
model.tokenizer.padding_side = "left"
model.tokenizer.pad_token = model.tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

sys_p = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

your_message="Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions."
conv = get_conversation_template("llama-2-chat")
conv.system_message = sys_p
conv.append_message(conv.roles[0], your_message)
conv.append_message(conv.roles[1], None)
prompt1 = conv.get_prompt()+" "

your_message="Hello"
conv = get_conversation_template("llama-2-chat")
conv.system_message = sys_p
conv.append_message(conv.roles[0], your_message)
conv.append_message(conv.roles[1], None)
prompt2 = conv.get_prompt()+" "

input_s=model.tokenizer([prompt1,prompt2],return_tensors="pt",padding=True).to("cuda")
output_ids=model.eagenerate(input_s.input_ids,input_s.attention_mask,temperature=0.0,max_new_tokens=512,top_k=15)
output=model.tokenizer.batch_decode(output_ids)
print(output)

got this failure:

../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [31,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [31,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [31,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [31,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
.....
....
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [2,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [2,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "example.py", line 34, in <module>
    output_ids=model.eagenerate(input_s.input_ids,input_s.attention_mask,temperature=0.0,max_new_tokens=512,top_k=15)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/server/EAGLE-bsne1/model/ea_model.py", line 242, in eagenerate
    input_ids, tree_logits, new_token, hidden_state, sample_token,attention_mask,newfinish_flag,new_outs = update_inference_inputs(
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/server/EAGLE-bsne1/model/utils.py", line 514, in update_inference_inputs
    tree_logits = model.ea_layer.topK_genrate(draft_hidden,
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/server/EAGLE-bsne1/model/cnets.py", line 772, in topK_genrate
    last_hidden = out_hidden[ab,last_nopadding][:,None]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

A simple question regarding to the paper.

Not sure if here is the right place to ask, but I want to check my understanding here.

In the end of section 4.1, I quote here:

EAGLE generates a tree-structured draft. To enhance efficiency, we implement tree attention,
enabling the creation of a draft tree with a depth of m through m forward passes,
thereby encompassing more than m tokens.

Should it be "enabling the creation of a draft tree with a depth of m through 1 forward passes"? Should the point of draft tree be completing a tree of guess predictions in one forward pass?

Thanks!

7B config

I'm trying to replicate the training results for the 7B head. Could you share the training config used in main.py please?

About tree_buffer

Hi, thank you for your wonderful work! I would like to ask, why are there two tree_buffers in eagle? One comes from cnets here:

self.tree_buffer=generate_tree_buffers(self.tree,"cpu")
, and the other comes from EaModel:
tree_choices, device=self.base_model.model.layers[-1].self_attn.q_proj.weight.device
. In my test, the tree_indices of these two tree_buffers are different. Why is this? Looking forward to your reply!

Can I get a test setting??

I tried to reproduce the result on 2 A100 80GB with Llama2 70B chat. But the speed acceleration was only ~x1.4. Is this originated my base_model generation was too fast?

Can I get a average number of tokens generated per each base_model(Llama2 70B chat at this case) forward?? or any other metric used to validate the speed??

Inference on the CPU, the speed improvement is limited

Dear EAGLE Team,

I've made modifications to the EAGLE code to accommodate the Qwen model, and the speed results are quite promising on GPU, with performance enhancements ranging from 2.3 to 3 times faster than the baseline model. However, when running inference on the CPU, the speed results on MT-bench are as follows:

Speed: 6.706688090355101
Speed0: 5.750603114818664
Ratio: 1.1662582091733495
Unfortunately, the speed improvement is only around 1.16 times faster. Could you please provide some suggestions on how to improve the speed on the CPU? Additionally, I'm curious to know what results you obtained when inferring on the CPU.

cpu's config:
model name : Intel(R) Xeon(R) Platinum 8452Y
cpu MHz : 2000.000
cache size : 69120 KB
physical id : 1
siblings : 72
core id : 35
cpu cores : 36
apicid : 199
initial apicid : 199

Difference between Eagle and SpecInfer

Hi, Eagle team, I am one of the authors of SpecInfer (https://arxiv.org/pdf/2305.09781.pdf). I just noticed your work on exploring speculative inference of LLMs. I am curious about some contents in your blog (https://sites.google.com/view/eagle-llm, such as token tree structure and multi-round speculative sampling), what's the difference compared with the token tree design and multi-step speculative sampling in our SpecInfer paper? As you mentioned Medusa, they referred our work to respect our contributions about tree attention (https://sites.google.com/view/medusa-llm).

train configuration

train_config={
"lr":args.lr,
"bs":args.bs,
"gradient_accumulation_steps":args.gradient_accumulation_steps,
"datapath":f"{args.tmpdir}",
"is_warmup":True,
"num_epochs":200,
"num_warmup_steps":2000,
"total_steps":800000,
"p_w":0.1,
"v_w":1.0,
"head_w":0.1,
"num_workers":2,
"embeding":True,
"act":"No",
"data_noise":True,
"noise":"uniform",
"mean":0.0,
"std":0.2,
"residual":"true,norm",
"max_len":2048,
"config_path":args.configpath,
"b1":0.9,
"b2": 0.95,
"grad_clip": 0.5,
}
I'm trying to retrain the autoregression head with your train code.

Is this train_config used for the every autoregression head in hf_hub??
epochs seems too much for me.. If it's not the exact train config used for training this model(https://huggingface.co/yuhuili/EAGLE-llama2-chat-70B) could you share the train_config used for training yuhuili/EAGLE-llama2-chat-70B??

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:7)

Hello, I have reproduced it on vicuna7b and llama7b-chat on 8 L40S, and the results are quite amazing:

vicuna-7b-v1.3:
speed 85.4037544546903
speed0 28.96434457451452
ratio 2.9485823245534903
llama-2-7b-chat:
speed 83.18835319185268
speed0 29.447255352096626
ratio 2.824995137821214

However, when I tried llama13b-chat, I encountered the following problem:

# python3 -m evaluation.gen_ea_answer_llama2chat --base-model-path /mnt/data3/LLaMa2-13B-chat-hf/LLaMa2-13B-chat-hf/ --ea-model-path /mnt/data3/models/EAGLE-llama2-chat-13B/ --model-id llama-2-13B-ea
Output to data/mt_bench/model_answer/llama-2-13B-ea-temperature-1.0.jsonl
Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:22<00:00,  7.42s/it]
Check model training state: False
CUDA VISIBLE DEVICES: 0,1,2,3,4,5,6,7
Traceback (most recent call last):
  File "/home/condaenv/.python_libs/conda_env/eagle/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/condaenv/.python_libs/conda_env/eagle/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/xxx/EAGLE/evaluation/gen_ea_answer_llama2chat.py", line 477, in <module>
    run_eval(
  File "/home/xxx/EAGLE/evaluation/gen_ea_answer_llama2chat.py", line 150, in run_eval
    get_answers_func(
  File "/home/condaenv/.python_libs/conda_env/eagle/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/xxx/EAGLE/evaluation/gen_ea_answer_llama2chat.py", line 239, in get_model_answers
    output_ids, new_token, idx = ea_forward(
  File "/home/xxx/EAGLE/evaluation/gen_ea_answer_llama2chat.py", line 63, in ea_forward
    tree_logits, logits,hidden_state,sample_token = initialize_tree(
  File "/home/xxx/EAGLE/model/utils.py", line 164, in initialize_tree
    tree_logits, outputs, logits,hidden_state,sample_token = model(
  File "/home/condaenv/.python_libs/conda_env/eagle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xxx/EAGLE/model/ea_model.py", line 143, in forward
    ea_logits = self.ea_layer.topK_genrate(hidden_states, input_ids, self.base_model.lm_head, logits_processor)
  File "/home/condaenv/.python_libs/conda_env/eagle/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/xxx/EAGLE/model/cnets.py", line 830, in topK_genrate
    select_index=topk_index[self.tree_buffer['tree_indices'][i]]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:7)

When I run the following command it works well:

# python3 -m evaluation.gen_baseline_answer_llama2chat --base-model-path /mnt/data3/LLaMa2-13B-chat-hf/LLaMa2-13B-chat-hf/ --ea-model-path /mnt/data3/models/EAGLE-llama2-chat-13B/ --model-id llama-2-13B-base

No pytorch.bin is generated after training is done

I have rerun the whole training pipeline for the vicuna 7b model.
The training ended successfully. However, there is no pytorch.bin is generated. May I know why?

The following folders and files are what I get.
Folders:

state_0  state_10  state_12  state_15  state_17  state_2   state_3  state_6  state_8
state_1  state_11  state_13  state_16  state_18  state_20  state_5  state_7

Files:

config.json  model_1.safetensors  model.safetensors  optimizer.bin  random_states_0.pkl  scheduler.bin

Support Chinese Task

We tested a small number of Chinese tasks(about 50 tasks) on Vicuna (7b, 13b) and found that the acceleration ratio of Chinese tasks was lower than that of English tasks. Is this in line with expectations? Here are some results:

vicuna-7b vicuna-13b
Baseline(tokens/s) 39.63 23.13
Eagle(tokens/s) 65.59 42.27
Ratio 1.66 1.83

CUDA out of memory

when I run the script as python ge_data_all_vicuna.py --start=0 --end=17000 --index=0 --gpu_index 0 --outdir ./data_generated/sharegpt_0_67999_mufp16 , it will occur the "cuda out of memory", I use a single A100 gpu with 80G memory, and I use the model vicuna-13B, I checked the data and it seems that the longest input_id is 20000+, it undoubtly exceed the memory limit, so how do you generate the data?

Question on draft process

I had a question on the sampling strategy during building the token tree. I noticed here you sampled the next layer using torch.multinomial, why not use torch.topk instead? Cause it seems that choosing the top k likely selected tokens are more reasonable. And SpecInfer also use the top k candidates to expand a token tree.

vicuna 7b oom

I'm trying to retrain the autoregression head with your train code on vicuna 7b
I have 8 v100 with 32G. But even I set bs=1, it still out of memory.

My environment is as follow:
CUDA 11.7
python 3.10
pytorch 2.1.2
transformers 4.37.2
accelerate 0.27.2

By the way, I used code of bsne1 branch

Best regards

Can I change the choices tree as I want??

Could it be possible to change the choices.py and make another tree architecture ??
or is there any way I could try without tree related decoding process?
and also do you have any result about how many token is verified from draft tokens for each basemodel verification?

Small Typo in Blog

Algorithm definition has typo on line 7:

  • missing closing bracket for norm function.
image

Head accuracy

Can you provide the code to measure Head accuracy?

VLLM contribution

Thanks for this great repo. I would like to run EAGLE with VLLM.
I would like to contribute to the VLLM implementation. If you are
already working on a branch I would like to help. If not, if you
can point me what changes need to be made. That would be most
helpful.

Support vLLM

Support for VLLM, how is the progress of the work?

About project structure

Thank you for your work! Regarding the project structure, I would like to know the design purposes of the modeling_eagle and ea_model source files? It appears that both describe the structure of the original model and a single decoder layer. Perhaps modeling_eagle is specifically designed for inference using a custom model?

Generation quality loss

Hi, I noticed you mentioned in paper that

evaluating the quality of EAGLE’s generated results is both unnecessary and meaningless.

But based on my experiments on both llama2-chat-7b and Qwen-chat-7b, EAGLE's generation quality declined on c-eval and human-eval. The result is attached.

image

Baseline is the output from hf model.generate() , eagle is the output from ea_model.eagenerate().

Have you done similar experiments? Any clues?

run with vllm

Thanks for this great repo. I would like to know how is the progress to support with VLLM.
Or could you point me what major changes that need to be done. That would be very helpful.

failed to reproduce given speedup results

hi EAGLE team,

thank you for great work!
we failed to acquire your speedup results with mi250(rocm). experimental outcomes of llama2-13b-chat and llama2-13b-chat are below,
Screenshot 2024-01-09 at 17 33 20
are your results really dependent on hardware? or, are we lost something crucial?

best.

Release Mixtral Training Code

Hi, I am interested in training Mixtral Eagle. Could I know will there be plans to release the training code anytime soon?

The number of candidate nodes and the maximum prediction length setting problem in the candidate tree

Thank you for your excellent work. I have read your code and have some questions about candidate tokens. In the code, I saw that the number of nodes in the designed tree is 26, which means that the length of seq_len in the decode layer is 26 each time. I don’t know if this 26 has any special meaning or is the best result after an experiment, because on some edge devices or when the computing power is insufficient, the long seq_len length and the low hit rate will cause performance degradation, so I Wondering if I could reduce the length of some candidate words, or set this parameter as a hyperparameter. Hope I can get some of your thoughts.
@Liyuhui-12

RuntimeError (porting inference on kaggle)

RuntimeError: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
cannot import name 'builder' from 'google.protobuf.internal' (/opt/conda/lib/python3.10/site-packages/google/protobuf/internal/init.py)

OSError when download weights from huggingface

OSError: Can't load tokenizer for 'yuhuili/EAGLE-llama2-chat-7B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'yuhuili/EAGLE-llama2-chat-7B' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

Data processing script (ge_data/allocation.py) script does not work out of the box

There are a few small issues:

  1. The model is set to load from local file instead of from huggingface hub (https://github.com/SafeAILab/EAGLE/blob/main/ge_data/ge_data_all_vicuna.py#L22). To fix this, I just set bigname='lmsys/vicuna-13b-v1.3' at the line in ge_data_all_vicuna.py.

  2. The ShareGPT dataset loads from local disk, without instructions for how to download it. To fix this, I downloaded the ShareGPT dataset from https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V4.3_unfiltered_cleaned_split.json
    wget https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V4.3_unfiltered_cleaned_split.json .

ValueError: not enough values to unpack (expected 2, got 1)

Hi,I encountered some problem when benchmarking throughputs. Here are the tracebacks

Traceback (most recent call last):
File "/disk1/EAGLE/test_throughput.py", line 14, in
output_ids=model.eagenerate(input_ids,temperature=0.5,max_new_tokens=512)
File "/home/disk1/.conda/envs/EAGLE/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/disk1/EAGLE/model/ea_model.py", line 203, in eagenerate
tree_logits, logits, hidden_state, sample_token = initialize_tree(
File "/disk1/EAGLE/model/utils.py", line 164, in initialize_tree
tree_logits, outputs, logits,hidden_state,sample_token = model(
File "/home/disk1/.conda/envs/EAGLE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/disk1/EAGLE/model/ea_model.py", line 146, in forward
ea_logits = self.ea_layer.topK_genrate(hidden_states,input_ids,self.base_model.lm_head,logits_processor)
File "/home/disk1/.conda/envs/EAGLE/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/disk1/EAGLE/model/cnets.py", line 801, in topK_genrate
out_hidden, past_key_values = self(hidden_states, input_ids=input_ids, use_cache=True)
ValueError: not enough values to unpack (expected 2, got 1)

Maybe there is some bug within the forward? I've noticed that forward only return 1 value, hidden_states

Weired Runtime Error during Inference

I have fine-tuned Llama-2 13-B chat model saved locally in my system [ Linux OS ]

When I am running the following code :-

from model.ea_model import EaModel
from pathlib import Path
import torch
from fastchat.model import get_conversation_template
import os
device = "cuda:0"
base_model_path="llama"
EAGLE_model_path="yuhuili/EAGLE-llama2-chat-13B"
model = EaModel.from_pretrained(
base_model_path=base_model_path,
ea_model_path=EAGLE_model_path,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
#map_location='cuda:0'
device_map=device
)
model.eval()

ues_llama_2_chat=1
use_vicuna=0
your_message="Hello"

if ues_llama_2_chat:
conv = get_conversation_template("llama-2-chat")
sys_p = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
conv.system_message = sys_p
conv.append_message(conv.roles[0], your_message)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt() + " "

if use_vicuna:
conv = get_conversation_template("vicuna")
conv.append_message(conv.roles[0], your_message)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids=model.tokenizer([prompt]).input_ids
input_ids = torch.as_tensor(input_ids).to(device)
output_ids=model.eagenerate(input_ids,temperature=0.5,max_new_tokens=512)
output=model.tokenizer.decode(output_ids[0])
print(output)

I am getting the following error -

../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [102,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [103,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [104,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [105,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [106,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [267,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
Traceback (most recent call last):
File "/home/azureuser/tensorrtllm_backend/llama-experiments/EAGLE/sample_code.py", line 45, in
output_ids=model.eagenerate(input_ids,temperature=0.5,max_new_tokens=512)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/azureuser/tensorrtllm_backend/llama-experiments/EAGLE/model/ea_model.py", line 218, in eagenerate
tree_logits, logits, hidden_state, sample_token = initialize_tree(
File "/home/azureuser/tensorrtllm_backend/llama-experiments/EAGLE/model/utils.py", line 164, in initialize_tree
tree_logits, outputs, logits,hidden_state,sample_token = model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/azureuser/tensorrtllm_backend/llama-experiments/EAGLE/model/ea_model.py", line 159, in forward
ea_logits = self.ea_layer.topK_genrate(hidden_states, input_ids, self.base_model.lm_head, logits_processor)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/azureuser/tensorrtllm_backend/llama-experiments/EAGLE/model/cnets.py", line 867, in topK_genrate
topk_index,topk_prob=self.sample(last_headout,logits_processor,k=top_k,)
File "/home/azureuser/tensorrtllm_backend/llama-experiments/EAGLE/model/cnets.py", line 805, in sample
sampled_indices = torch.multinomial(probabilities, k, replacement=False)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

When I am running the above code after setting os.environ["CUDA_VISIBLE_DEVICES"] = "1", I am getting the following runtime error -

Traceback (most recent call last):
File "/home/azureuser/tensorrtllm_backend/llama-experiments/EAGLE/sample_code.py", line 12, in
model = EaModel.from_pretrained(
File "/home/azureuser/tensorrtllm_backend/llama-experiments/EAGLE/model/ea_model.py", line 118, in from_pretrained
ea_layer_state_dict = torch.load(load_model_path,
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 809, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1172, in _load
result = unpickler.load()
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1142, in persistent_load
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1116, in load_tensor
wrap_storage=restore_location(storage, location),
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1086, in restore_location
return default_restore_location(storage, str(map_location))
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 217, in default_restore_location
result = fn(storage, location)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 182, in _cuda_deserialize
device = validate_cuda_device(location)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 173, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on CUDA device '
RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an existing device.

KV Cache initialization throwing an error

When running the sample code for bs>1 (on that branch), I get the following error when the KV cache is initialized. I'm running this inference direct on a Xeon.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 35
     32 prompt2 = conv.get_prompt()+" "
     34 input_s=model.tokenizer([prompt1,prompt2],return_tensors="pt",padding=True).to("cpu")
---> 35 output_ids=model.eagenerate(input_s.input_ids,input_s.attention_mask,temperature=0.0,max_new_tokens=512,top_k=15)
     36 output=model.tokenizer.batch_decode(output_ids)
     37 print(output)

File ~/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /mnt/BigDisk1T/haim/EAGLE/model/ea_model.py:205, in EaModel.eagenerate(self, input_ids, attention_mask, temperature, top_p, top_k, max_new_tokens, max_length, tree_choices, log)
    199     current_length_data.zero_()
    200 else:
    201     (
    202         past_key_values,
    203         past_key_values_data,
    204         current_length_data,
--> 205     ) = initialize_past_key_values(self.base_model,bs=bs)
    206     self.past_key_values = past_key_values
    207     self.past_key_values_data = past_key_values_data

File /mnt/BigDisk1T/haim/EAGLE/model/kv_cache.py:142, in initialize_past_key_values(model, bs)
    139         bias=0
    140         start_data_m=data_m
    141     past_key_values.append(
--> 142         [
    143             KVCache(past_key_values_data_list[data_m-devices[0].index][2*bias + j], current_length_data[i * 2 + j])
    144             for j in range(2)
    145         ]
    146     )
    147     bias+=1
    148 return past_key_values, past_key_values_data_list, current_length_data

File /mnt/BigDisk1T/haim/EAGLE/model/kv_cache.py:143, in <listcomp>(.0)
    139         bias=0
    140         start_data_m=data_m
    141     past_key_values.append(
    142         [
--> 143             KVCache(past_key_values_data_list[data_m-devices[0].index][2*bias + j], current_length_data[i * 2 + j])
    144             for j in range(2)
    145         ]
    146     )
    147     bias+=1
    148 return past_key_values, past_key_values_data_list, current_length_data

TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

Question about manual datasets & Some interests about details

Thanks for your great job. I just have some curiosities about it.

First, would it be too big to store the hidden states directly as .ckpt file? How large it is.
Second, could you provide ablation experiment about the cnet and decoding methods?

runtime

IRM%7{I%K_1ANXFQY{AUR
how to solve this problem?I completely cloned the project and then transferred it to it, and then directly followed this evaluation without changing the code. Then the models were downloaded from hf. If I say changes, they are targeted changes to the runtime errors (after all, they are also changed. to run successfully (if I don’t change it, an error will still be reported),

About reproduction baseline results

Thank for sharing your great works!

We are doing reproduction of your method for research purpose and found that the Medusa inference for baseline is also reported in your blog. We tried to check the speed of both EAGLE and Medusa methods with Llama2 70B Chat, but I guess official repo of Medusa doesn’t support Llama2 architecture at inference(maybe Medusa KV cache doesn’t match with Llama2).
It would be thankful if you can provide your Medusa inference code with Llama2 70B chat so that we can cross-check EAGLE has far better acceleration on baseline models.

Thanks for reading.

Can EAGLE actually improve throughput?

It seems that it always needs a complete forward of tree candidates for verify, which appears to increase the overall computational flops. For example, for "mc_sim_7b_63," each iteration requires the computation of 26 candidate tokens, but only two tokens can be accepted.

Can Eagle improve the inference throughput in continuous batch mode?

Um, I have a question for the author: In the continuous batch mode, when the computing power is already fully utilized, is there no way to further increase the throughput of inference with Eagle? Because after Eagle generates candidate tokens/sequences, it is still inevitable that the verification phase needs to call the original model. Since Eagle essentially uses the "computing power of x self-regressive heads in the generation phase" plus "the computing power consumed by the original model once" to generate multiple tokens, thereby reducing the inference latency, and as the verification still has a relatively high demand for computing power, does the increase in throughput become less significant? If using Eagle requires reducing the batch size, it may not be cost-effective. Do we need to conduct a baseline test, under continuous batch mode, without using Eagle, fully utilize the computing power, and observe the throughput; then use Eagle, fully utilize the computing power, and observe the throughput as well as how large a batch it can support?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.