vinairesearch / phogpt Goto Github PK

PhoGPT: Generative Pre-training for Vietnamese (2023)

License: Apache License 2.0

Python 100.00%

generative-pre-trained-transformer gpt instruction-following llm phogpt vietnamese vietnamese-nlp

phogpt's Issues

Mình sử dụng model này bằng CPU được không bạn ơi?

Mình thấy hướng dẫn là dùng CUDA, mình chạy model này bằng CPU được không bạn? Cấu hình phải như thế nào mới chạy ngon ạ?

(Google Colab) ImportError: cannot import name '_expand_mask' from 'transformers.models.bloom.modeling_bloom'

Hi !
Thank you for your excellent work!

I got this error when trying to run the model on Google Colab. I use the example code on README file:

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer  
  
model_path = "vinai/PhoGPT-7B5-Instruct"  
  
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)  
config.init_device = "cuda"
# config.attn_config['attn_impl'] = 'triton' # Enable if "triton" installed!
  
model = AutoModelForCausalLM.from_pretrained(  
    model_path, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True  
)
# If your GPU does not support bfloat16:
# model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.float16, trust_remote_code=True)
model.eval()  
  
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)  
  
PROMPT = "### Câu hỏi:\n{instruction}\n\n### Trả lời:"  
  
input_prompt = PROMPT.format_map(  
    {"instruction": "Làm thế nào để cải thiện kỹ năng quản lý thời gian?"}  
)  
  
input_ids = tokenizer(input_prompt, return_tensors="pt")  
  
outputs = model.generate(  
    inputs=input_ids["input_ids"].to("cuda"),  
    attention_mask=input_ids["attention_mask"].to("cuda"),  
    do_sample=True,  
    temperature=1.0,  
    top_k=50,  
    top_p=0.9,  
    max_new_tokens=1024,  
    eos_token_id=tokenizer.eos_token_id,  
    pad_token_id=tokenizer.pad_token_id  
)  
  
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]  
response = response.split("### Trả lời:")[1]

The error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
[<ipython-input-8-2cf92868f260>](https://localhost:8080/#) in <cell line: 13>()
     11 # )
     12 # If your GPU does not support bfloat16:
---> 13 model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.float16, trust_remote_code=True)
     14 model.eval()
     15 

11 frames
[~/.cache/huggingface/modules/transformers_modules/vinai/PhoGPT-7B5-Instruct/8083375bebd52681090be6ebaf8bae7aee491f73/hf_prefixlm_converter.py](https://localhost:8080/#) in <module>
     13 import torch
     14 from transformers.models.bloom.modeling_bloom import BaseModelOutputWithPastAndCrossAttentions, BloomForCausalLM, BloomModel, CausalLMOutputWithCrossAttentions, CrossEntropyLoss
---> 15 from transformers.models.bloom.modeling_bloom import _expand_mask as _expand_mask_bloom
     16 from transformers.models.bloom.modeling_bloom import _make_causal_mask as _make_causal_mask_bloom
     17 from transformers.models.bloom.modeling_bloom import logging

ImportError: cannot import name '_expand_mask' from 'transformers.models.bloom.modeling_bloom' (/usr/local/lib/python3.10/dist-packages/transformers/models/bloom/modeling_bloom.py)

I don't know how to fix it, but I found the same issue at : https://huggingface.co/mosaicml/mpt-7b/discussions/83

Can you tell me how to run it?
Thanks you!

Context Window of PhoGPT ?

Hi team,

As in model card have said that PhoGPT using ALiBi for context length extrapolation , so team have tested maximum effective context length of PhoGPT ?

Thanks for the first GPT foundation model for Vietnamese

Incomplete Response from 4bit Version of PhoGPT

Hello, I made some testing on 4bit and 8bit version of PhoGPT. I got issue with 4bit version detail is below:

Environment:
PhoGPT Version: 4bit
Execution Environment: Google Colab with T4 GPU

Issue Description:
When using the 4bit version of PhoGPT with the provided initialization code from the documentation, the model returns an incomplete response. Specifically, it only returns a newline character \n, in contrast to the 8bit version, which functions correctly and returns a comprehensive output.

Steps to Reproduce:
Initialize the 4bit PhoGPT model using the sample code from the official documentation.
Use instruction = "Viết bài văn nghị luận xã hội về an toàn giao thông"
Observe that the response is only a newline character, indicating an incomplete or failed generation.

Expected Behavior:
The 4bit version of PhoGPT should return a complete and coherent response similar to the 8bit version, which returns detailed and lengthy outputs.

Actual Behavior:
The 4bit version outputs only a newline character \n, indicating an error or issue in processing the input prompt.

8bit

4bit

Any online demo available?

Congratulations! This is great work!!! Particularly, for the Vietnamese community around the world!

Do you have any plans for releasing an online demo of PhoGPT (perhaps a HuggingFace online demo)? :)

Could you explain what is the diference the non-instruct and instruct PhoGPT verrsion ?

Hi Mr Dat,

First of all, thank you so much for this great project.

I am reaching out to seek clarification regarding the distinctions between the non-instruct and instruct versions of PhoGPT.

I've come across discussions about these two versions and would greatly appreciate it if you could provide some insights into their key differences. Specifically, I am curious about how their functionalities vary and what specific use cases each version is optimized for.

Thank you very much for your time and assistance. I look forward to gaining a better understanding of the distinctions between the non-instruct and instruct versions of PhoGPT.

hardware requirements to run finetune

I ran the sample finetune with default settings on Nvidia A100 40Gb but got OOM. can we get finetune successfully on Nvidia A100 40Gb hardware? any updates for the settings to get it. thank you.

OutOfMemoryError when Fine-tuning model PhoGPT4B with llm-foundry

Hello. Thanks for your work very much.
I have carefully reviewed and adhered to the guidelines provided for fine-tuning the PhoGPT4B model. Nonetheless, despite specifying a maximum sequence length of 2048 and a global training batch size of 1, I am encountering Out of Memory issues. My GPU is RTX4090 24G.
Do you have any ideas to solve this problem?

Wishing you all the best!

Có giới thiệu và hướng dẫn bằng Tiếng Việt không bạn ơi!

Is there a quantization version support in the future?

I hope that PhoGPT will have an AWQ or GPTQ version for running on low VRAM GPUs.
Do you have any plans for quantization? It will make this LLM more popular for students and individual researchers who have limited computing resources.
Thank you for your hard work!

HuggingFace tokenizer does not pad to max_length?

Dear VinAI team,

Thank you for sharing your work with us. I tried to use your model (PhoGPT tokenizer) and set the max length to 8192, but the tokenizer's output did not add any padding tokens. Here is an example:

phogpt_tokenizer= AutoTokenizer.from_pretrained("vinai/PhoGPT-4B", trust_remote_code= True)
print(
    phogpt_tokenizer(
        "Đây là câu hỏi",
        max_length= 8192,
        truncation= True,
        padding= True
    )
)

The output is:
{'input_ids': [2985, 270, 1117, 1378], 'attention_mask': [1, 1, 1, 1]}

You can see that the output token list only has 4 tokens. Should it be 8192 tokens instead?

Error wrong number of tensors when serving vinai/PhoGPT-4B-Chat with llama.cpp

I successfully converted the model to the gguf format using llama.cpp convert-hf-to-gguf.py script

cd ~/.models
git clone --progress --verbose https://huggingface.co/vinai/PhoGPT-4B-Chat
cd ~/llama.cpp
python3 convert-hf-to-gguf.py ~/.models/PhoGPT-4B-Chat --outfile ~/.models/pho.gguf

Output

Loading model: PhoGPT-4B-Chat
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer

...

output_norm.bias, n_dims = 1, torch.bfloat16 --> float32
Model successfully exported to '/home/username/.models/pho.gguf'

I tried inference and the error showed up

cd
./llama.cpp/main -m ./.models/pho.gguf -p "xin chào"

Log start
main: build = 2101 (b7b74cef)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1707968957
llama_model_loader: loaded meta data with 18 key-value pairs and 388 tensors from ./models/pho.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mpt
llama_model_loader: - kv   1:                               general.name str              = PhoGPT-4B-Chat
llama_model_loader: - kv   2:                         mpt.context_length u32              = 8192
llama_model_loader: - kv   3:                       mpt.embedding_length u32              = 3072
llama_model_loader: - kv   4:                            mpt.block_count u32              = 32
llama_model_loader: - kv   5:                    mpt.feed_forward_length u32              = 12288
llama_model_loader: - kv   6:                   mpt.attention.head_count u32              = 24
llama_model_loader: - kv   7:           mpt.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   8:               mpt.attention.max_alibi_bias f32              = 8.000000
llama_model_loader: - kv   9:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  10:                      tokenizer.ggml.tokens arr[str,20480]   = ["<unk>", "<s>", "</s>", "<pad>", "!"...
llama_model_loader: - kv  11:                  tokenizer.ggml.token_type arr[i32,20480]   = [3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  12:                      tokenizer.ggml.merges arr[str,20266]   = ["á »", "á º", "Ġ t", "n g", "Ġ...
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 3
llama_model_loader: - kv  17:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - type  f32:  258 tensors
llama_model_loader: - type  f16:  130 tensors
llm_load_vocab: special tokens definition check successful ( 4/20480 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = mpt
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 20480
llm_load_print_meta: n_merges         = 20266
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 24
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 8.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 3.75 B
llm_load_print_meta: model size       = 6.99 GiB (16.01 BPW)
llm_load_print_meta: general.name     = PhoGPT-4B-Chat
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 3 '<pad>'
llm_load_print_meta: LF token         = 130 'Ä'
llm_load_tensors: ggml ctx size =    0.15 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 388, got 195
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './.models/pho.gguf'
main: error: unable to load model

Error message: "llama.cpp error: 'done_getting_tensors: wrong number of tensors; expected 388, got 195'"

I converted the model into the llama.cpp's own format (.GGUF). And this model could run successful in colab like snapshot.

But when i loaded this model in LLM Studio or Jan, i got error message. Anyone met this problem? How to resolve this?

Many thanks!

Which prompt format we use for PhoGPT without instruct model!

I run this code without instruct model, but the code may be adapt with instruct only.
Could you provide PROMPT format for model without instruct?

[Question] Hardware requirements for PhoGPT

What is minimum and recommended hardware requirements for running PhoGPT?

How can I get the model to run on vLLM?

Thank you for publishing the project.

I would like to test the model on my local computer using compatible, supported OpenAI APIs, and I see that vLLM is the appropriate project to make it happen.

I would appreciate some advice on making changes and getting the code compatible to run on vLLM.
I truly appreciate your help.

PhoGPT-7B5-Instruct has only one word repeated over and over again.

PhoGPT-7B5-Instruct has only one word repeated over and over again when i generate! Why does this error appear, everyone? I only have this problem with this model.

[Question] Hardware requirements with vLLM

To fully fine-tune vinai/PhoGPT-7B5 or vinai/PhoGPT-7B5-Instruct on a single GPU A100 with 40GB memory, it is advisable to employ the decoupled_lionw optimizer with a device_train_microbatch_size set to 1.

How about minimum and recommended hardware requirements when fine-tune vinai/PhoGPT-7B5 with vLLM?

Hardware specification for inference

Hello,

The model has been trained on the A100 GPU. However, I am wondering about the GPU memory cost during inference.

Currently, I have a 3060 GPU with 12 VRAM. Can it be used for running inference?

Thank you

What is the data format to perform fine turning of the phoGPT model?

I want to perform fine turning on the phoGPT model with the goal of answering the information in the text source. What format should the data have to be able to do this?

Redistributing the model in other format

Hello,

I was able to use phoGPT in CPU-only mode with llama.cpp . I would want to convert the model into the llama.cpp's own format (.GGUF) and post it to HuggingFace (similar to this repo for example). I have gone through the LICENSE file and it looks like this would be fine. However I still want to make sure if am I allowed to perform such action?

Regards,

Temporarily set the `instruct` version private

We temporarily set the instruct version private to reinvestigate its safety level.
We will make it public within this week.

License problem

As far as i know from reading your paper, you guys used data created by chatgpt, but still you guys release as commercial model. The terms of use of OpenAI show that you can not use use output from the Services to develop models that compete with OpenAI, so please check out license.
https://openai.com/policies/terms-of-use

Need some information about the tokenizer

Hi, thanks for the great work.

I'm new to the Vietnamese language modeling scene. I came across some major articles from 2019-2021 where people perceived word segmentation as the standard step before tokenization, which I appreciate but still not quite sure if it is actually necessary. Cannot find much information to answer that myself.

Then I take a look into your paper: you trained a BPE tokenization (sort of a sub-word tokenization). I have a few questions:

Is it correct that word segmentation is not used at all to create PhoGPT? If that's correct, I would love some reasoning.
You used segmentation in PhoBert. Why didn't you use BPE back then?

Do you have any plan for building a bi-lingual language English - Vietnames in the future?

Hello,

I am curious whether do you have any plan for building a bi-lingual language English - Vietnames in the future?

Thank you

Evaluation lack of context

Is the question suppose to be this lack of context ? Given that only the Ground truth even mention about Vietnam.

4/8-bit with bitsandbytes

See: https://huggingface.co/docs/transformers/main/en/quantization#bitsandbytes

import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

model_4bit = AutoModelForCausalLM.from_pretrained("vinai/PhoGPT-4B-Chat", quantization_config=quantization_config, device_map="auto", trust_remote_code=True)

Or:

import torch
from transformers import BitsAndBytesConfig, AutoConfig, AutoModelForCausalLM, AutoTokenizer

config = AutoConfig.from_pretrained("vinai/PhoGPT-4B-Chat", trust_remote_code=True)  
config.init_device = "cuda"

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

model_4bit = AutoModelForCausalLM.from_pretrained("vinai/PhoGPT-4B-Chat", quantization_config=quantization_config, config=config, trust_remote_code=True)

vinairesearch / phogpt Goto Github PK

phogpt's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs