vinairesearch / phogpt Goto Github PK
View Code? Open in Web Editor NEWPhoGPT: Generative Pre-training for Vietnamese (2023)
License: Apache License 2.0
PhoGPT: Generative Pre-training for Vietnamese (2023)
License: Apache License 2.0
Mình thấy hướng dẫn là dùng CUDA, mình chạy model này bằng CPU được không bạn? Cấu hình phải như thế nào mới chạy ngon ạ?
Hi !
Thank you for your excellent work!
I got this error when trying to run the model on Google Colab. I use the example code on README file:
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
model_path = "vinai/PhoGPT-7B5-Instruct"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
config.init_device = "cuda"
# config.attn_config['attn_impl'] = 'triton' # Enable if "triton" installed!
model = AutoModelForCausalLM.from_pretrained(
model_path, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True
)
# If your GPU does not support bfloat16:
# model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.float16, trust_remote_code=True)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
PROMPT = "### Câu hỏi:\n{instruction}\n\n### Trả lời:"
input_prompt = PROMPT.format_map(
{"instruction": "Làm thế nào để cải thiện kỹ năng quản lý thời gian?"}
)
input_ids = tokenizer(input_prompt, return_tensors="pt")
outputs = model.generate(
inputs=input_ids["input_ids"].to("cuda"),
attention_mask=input_ids["attention_mask"].to("cuda"),
do_sample=True,
temperature=1.0,
top_k=50,
top_p=0.9,
max_new_tokens=1024,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
response = response.split("### Trả lời:")[1]
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
[<ipython-input-8-2cf92868f260>](https://localhost:8080/#) in <cell line: 13>()
11 # )
12 # If your GPU does not support bfloat16:
---> 13 model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.float16, trust_remote_code=True)
14 model.eval()
15
11 frames
[~/.cache/huggingface/modules/transformers_modules/vinai/PhoGPT-7B5-Instruct/8083375bebd52681090be6ebaf8bae7aee491f73/hf_prefixlm_converter.py](https://localhost:8080/#) in <module>
13 import torch
14 from transformers.models.bloom.modeling_bloom import BaseModelOutputWithPastAndCrossAttentions, BloomForCausalLM, BloomModel, CausalLMOutputWithCrossAttentions, CrossEntropyLoss
---> 15 from transformers.models.bloom.modeling_bloom import _expand_mask as _expand_mask_bloom
16 from transformers.models.bloom.modeling_bloom import _make_causal_mask as _make_causal_mask_bloom
17 from transformers.models.bloom.modeling_bloom import logging
ImportError: cannot import name '_expand_mask' from 'transformers.models.bloom.modeling_bloom' (/usr/local/lib/python3.10/dist-packages/transformers/models/bloom/modeling_bloom.py)
I don't know how to fix it, but I found the same issue at : https://huggingface.co/mosaicml/mpt-7b/discussions/83
Can you tell me how to run it?
Thanks you!
Hi team,
As in model card have said that PhoGPT using ALiBi for context length extrapolation , so team have tested maximum effective context length of PhoGPT ?
Thanks for the first GPT foundation model for Vietnamese
Hello, I made some testing on 4bit and 8bit version of PhoGPT. I got issue with 4bit version detail is below:
Environment:
PhoGPT Version: 4bit
Execution Environment: Google Colab with T4 GPU
Issue Description:
When using the 4bit version of PhoGPT with the provided initialization code from the documentation, the model returns an incomplete response. Specifically, it only returns a newline character \n, in contrast to the 8bit version, which functions correctly and returns a comprehensive output.
Steps to Reproduce:
Initialize the 4bit PhoGPT model using the sample code from the official documentation.
Use instruction = "Viết bài văn nghị luận xã hội về an toàn giao thông"
Observe that the response is only a newline character, indicating an incomplete or failed generation.
Expected Behavior:
The 4bit version of PhoGPT should return a complete and coherent response similar to the 8bit version, which returns detailed and lengthy outputs.
Actual Behavior:
The 4bit version outputs only a newline character \n, indicating an error or issue in processing the input prompt.
Congratulations! This is great work!!! Particularly, for the Vietnamese community around the world!
Do you have any plans for releasing an online demo of PhoGPT (perhaps a HuggingFace online demo)? :)
Hi Mr Dat,
First of all, thank you so much for this great project.
I am reaching out to seek clarification regarding the distinctions between the non-instruct and instruct versions of PhoGPT.
I've come across discussions about these two versions and would greatly appreciate it if you could provide some insights into their key differences. Specifically, I am curious about how their functionalities vary and what specific use cases each version is optimized for.
Thank you very much for your time and assistance. I look forward to gaining a better understanding of the distinctions between the non-instruct and instruct versions of PhoGPT.
I ran the sample finetune with default settings on Nvidia A100 40Gb but got OOM. can we get finetune successfully on Nvidia A100 40Gb hardware? any updates for the settings to get it. thank you.
Hello. Thanks for your work very much.
I have carefully reviewed and adhered to the guidelines provided for fine-tuning the PhoGPT4B model. Nonetheless, despite specifying a maximum sequence length of 2048 and a global training batch size of 1, I am encountering Out of Memory issues. My GPU is RTX4090 24G.
Do you have any ideas to solve this problem?
Wishing you all the best!
I hope that PhoGPT will have an AWQ or GPTQ version for running on low VRAM GPUs.
Do you have any plans for quantization? It will make this LLM more popular for students and individual researchers who have limited computing resources.
Thank you for your hard work!
Dear VinAI team,
Thank you for sharing your work with us. I tried to use your model (PhoGPT tokenizer) and set the max length to 8192, but the tokenizer's output did not add any padding tokens. Here is an example:
phogpt_tokenizer= AutoTokenizer.from_pretrained("vinai/PhoGPT-4B", trust_remote_code= True)
print(
phogpt_tokenizer(
"Đây là câu hỏi",
max_length= 8192,
truncation= True,
padding= True
)
)
The output is:
{'input_ids': [2985, 270, 1117, 1378], 'attention_mask': [1, 1, 1, 1]}
You can see that the output token list only has 4 tokens. Should it be 8192 tokens instead?
I successfully converted the model to the gguf format using llama.cpp convert-hf-to-gguf.py
script
cd ~/.models
git clone --progress --verbose https://huggingface.co/vinai/PhoGPT-4B-Chat
cd ~/llama.cpp
python3 convert-hf-to-gguf.py ~/.models/PhoGPT-4B-Chat --outfile ~/.models/pho.gguf
Output
Loading model: PhoGPT-4B-Chat
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
...
output_norm.bias, n_dims = 1, torch.bfloat16 --> float32
Model successfully exported to '/home/username/.models/pho.gguf'
I tried inference and the error showed up
cd
./llama.cpp/main -m ./.models/pho.gguf -p "xin chào"
Log start
main: build = 2101 (b7b74cef)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed = 1707968957
llama_model_loader: loaded meta data with 18 key-value pairs and 388 tensors from ./models/pho.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mpt
llama_model_loader: - kv 1: general.name str = PhoGPT-4B-Chat
llama_model_loader: - kv 2: mpt.context_length u32 = 8192
llama_model_loader: - kv 3: mpt.embedding_length u32 = 3072
llama_model_loader: - kv 4: mpt.block_count u32 = 32
llama_model_loader: - kv 5: mpt.feed_forward_length u32 = 12288
llama_model_loader: - kv 6: mpt.attention.head_count u32 = 24
llama_model_loader: - kv 7: mpt.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 8: mpt.attention.max_alibi_bias f32 = 8.000000
llama_model_loader: - kv 9: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 10: tokenizer.ggml.tokens arr[str,20480] = ["<unk>", "<s>", "</s>", "<pad>", "!"...
llama_model_loader: - kv 11: tokenizer.ggml.token_type arr[i32,20480] = [3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 12: tokenizer.ggml.merges arr[str,20266] = ["á »", "á º", "Ġ t", "n g", "Ġ...
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 16: tokenizer.ggml.padding_token_id u32 = 3
llama_model_loader: - kv 17: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - type f32: 258 tensors
llama_model_loader: - type f16: 130 tensors
llm_load_vocab: special tokens definition check successful ( 4/20480 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mpt
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 20480
llm_load_print_meta: n_merges = 20266
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 24
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 3072
llm_load_print_meta: n_embd_v_gqa = 3072
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 8.0e+00
llm_load_print_meta: n_ff = 12288
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = all F32 (guessed)
llm_load_print_meta: model params = 3.75 B
llm_load_print_meta: model size = 6.99 GiB (16.01 BPW)
llm_load_print_meta: general.name = PhoGPT-4B-Chat
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 3 '<pad>'
llm_load_print_meta: LF token = 130 'Ä'
llm_load_tensors: ggml ctx size = 0.15 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 388, got 195
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './.models/pho.gguf'
main: error: unable to load model
I run this code without instruct model, but the code may be adapt with instruct only.
Could you provide PROMPT format for model without instruct?
What is minimum and recommended hardware requirements for running PhoGPT?
Thank you for publishing the project.
I would like to test the model on my local computer using compatible, supported OpenAI APIs, and I see that vLLM is the appropriate project to make it happen.
I would appreciate some advice on making changes and getting the code compatible to run on vLLM.
I truly appreciate your help.
To fully fine-tune vinai/PhoGPT-7B5 or vinai/PhoGPT-7B5-Instruct on a single GPU A100 with 40GB memory, it is advisable to employ the decoupled_lionw optimizer with a device_train_microbatch_size set to 1.
How about minimum and recommended hardware requirements when fine-tune vinai/PhoGPT-7B5 with vLLM?
Hello,
The model has been trained on the A100 GPU. However, I am wondering about the GPU memory cost during inference.
Currently, I have a 3060 GPU with 12 VRAM. Can it be used for running inference?
Thank you
I want to perform fine turning on the phoGPT model with the goal of answering the information in the text source. What format should the data have to be able to do this?
Hello,
I was able to use phoGPT in CPU-only mode with llama.cpp . I would want to convert the model into the llama.cpp's own format (.GGUF) and post it to HuggingFace (similar to this repo for example). I have gone through the LICENSE file and it looks like this would be fine. However I still want to make sure if am I allowed to perform such action?
Regards,
We temporarily set the instruct
version private to reinvestigate its safety level.
We will make it public within this week.
As far as i know from reading your paper, you guys used data created by chatgpt, but still you guys release as commercial model. The terms of use of OpenAI show that you can not use use output from the Services to develop models that compete with OpenAI, so please check out license.
https://openai.com/policies/terms-of-use
Hi, thanks for the great work.
I'm new to the Vietnamese language modeling scene. I came across some major articles from 2019-2021 where people perceived word segmentation as the standard step before tokenization, which I appreciate but still not quite sure if it is actually necessary. Cannot find much information to answer that myself.
Then I take a look into your paper: you trained a BPE tokenization (sort of a sub-word tokenization). I have a few questions:
Hello,
I am curious whether do you have any plan for building a bi-lingual language English - Vietnames in the future?
Thank you
See: https://huggingface.co/docs/transformers/main/en/quantization#bitsandbytes
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model_4bit = AutoModelForCausalLM.from_pretrained("vinai/PhoGPT-4B-Chat", quantization_config=quantization_config, device_map="auto", trust_remote_code=True)
Or:
import torch
from transformers import BitsAndBytesConfig, AutoConfig, AutoModelForCausalLM, AutoTokenizer
config = AutoConfig.from_pretrained("vinai/PhoGPT-4B-Chat", trust_remote_code=True)
config.init_device = "cuda"
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model_4bit = AutoModelForCausalLM.from_pretrained("vinai/PhoGPT-4B-Chat", quantization_config=quantization_config, config=config, trust_remote_code=True)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.