vinairesearch / phogpt Goto Github PK

View Code? Open in Web Editor NEW

725.0 19.0 62.0 326 KB

PhoGPT: Generative Pre-training for Vietnamese (2023)

License: Apache License 2.0

Python 100.00%

generative-pre-trained-transformer gpt instruction-following llm phogpt vietnamese vietnamese-nlp

phogpt's Introduction

Introduction
Model download
Run the model
Fine-tuning the model
Limitations

PhoGPT: Generative Pre-training for Vietnamese

We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20K token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. We demonstrate its superior performance compared to previous open-source models.

More details about the general architecture and experimental results of PhoGPT can be found in our technical report. All output responses of PhoGPT and baselines are available HERE for readers' self-evaluation. Please CITE our technical report when PhoGPT is used to help produce published results or is incorporated into other software:

@article{PhoGPT,
title     = {{PhoGPT: Generative Pre-training for Vietnamese}},
author    = {Dat Quoc Nguyen and Linh The Nguyen and Chi Tran and Dung Ngoc Nguyen and Dinh Phung and Hung Bui},
journal   = {arXiv preprint},
volume    = {arXiv:2311.02945},
year      = {2023}
}

Model download

Model	Type	Model Size	Context length	Vocab size	Training data size	Note
`vinai/PhoGPT-4B`	Base	3.7B	8192	20K	2 training epochs on 482GB of texts	Loading "PhoGPT-4B" or "PhoGPT-4B-Chat" in float16 takes 7GB of GPU memory
`vinai/PhoGPT-4B-Chat`	Instruction following & Chat	3.7B	8192	20K	70K instructional prompt and response pairs & 290K conversations	`PROMPT_TEMPLATE = "### Câu hỏi: {instruction}\n### Trả lời:"`

Run the model

With vLLM, Text Generation Inference & llama.cpp

PhoGPT can run with inference engines, such as vLLM, Text Generation Inference and llama.cpp.

With llama.cpp

Compile llama.cpp
Install Python dependencies from llama.cpp

cd llama.cpp
python3 -m pip install -r requirements.txt

Convert the model to gguf FP16 format: python3 convert-hf-to-gguf.py <path_to_PhoGPT-4B-Chat_model> --outfile ./PhoGPT-4B-Chat.gguf
(Optional) Quantize the model to 4/8-bits:
- ./quantize ./PhoGPT-4B-Chat.gguf ./PhoGPT-4B-Chat-Q4_K_M.gguf Q4_K_M
- ./quantize ./PhoGPT-4B-Chat.gguf ./PhoGPT-4B-Chat-Q8_0.gguf Q8_0
Start inference on a gguf model: ./main -m ./PhoGPT-4B-Chat-Q4_K_M.gguf -n 1024 -p "### Câu hỏi: Viết bài văn nghị luận xã hội về an toàn giao thông\n### Trả lời:"

Converted gguf files are available at: vinai/PhoGPT-4B-Chat-gguf. Note that phogpt_4b_chat_preset.json might be needed for LM Studio to work properly with our gguf files.

With pure `transformers`

Instruction following

# coding: utf8
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

model_path = "vinai/PhoGPT-4B-Chat"  

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)  
config.init_device = "cuda"
# config.attn_config['attn_impl'] = 'flash' # If installed: this will use either Flash Attention V1 or V2 depending on what is installed

model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True)
# If your GPU does not support bfloat16:
# model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.float16, trust_remote_code=True)
model.eval()  

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)  

PROMPT_TEMPLATE = "### Câu hỏi: {instruction}\n### Trả lời:"  

# Some instruction examples
# instruction = "Viết bài văn nghị luận xã hội về {topic}"
# instruction = "Viết bản mô tả công việc cho vị trí {job_title}"
# instruction = "Sửa lỗi chính tả:\n{sentence_or_paragraph}"
# instruction = "Dựa vào văn bản sau đây:\n{text}\nHãy trả lời câu hỏi: {question}"
# instruction = "Tóm tắt văn bản:\n{text}"

instruction = "Viết bài văn nghị luận xã hội về an toàn giao thông"
# instruction = "Sửa lỗi chính tả:\nTriệt phá băng nhóm kướp ô tô, sử dụng \"vũ khí nóng\""

input_prompt = PROMPT_TEMPLATE.format_map({"instruction": instruction})  

input_ids = tokenizer(input_prompt, return_tensors="pt")  

outputs = model.generate(  
    inputs=input_ids["input_ids"].to("cuda"),  
    attention_mask=input_ids["attention_mask"].to("cuda"),  
    do_sample=True,  
    temperature=1.0,  
    top_k=50,  
    top_p=0.9,  
    max_new_tokens=1024,  
    eos_token_id=tokenizer.eos_token_id,  
    pad_token_id=tokenizer.pad_token_id  
)  

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]  
response = response.split("### Trả lời:")[1]

Chat

messages = [
    {"role": "user", "content": "Kể tên một môn thể thao mạo hiểm"},
    {"role": "assistant", "content": "Nhảy Bungee."},
    {"role": "user", "content": "Bạn đã bao giờ đi nhảy bungee chưa"}
]

# Using apply_chat_template
tokenizer = AutoTokenizer.from_pretrained("vinai/PhoGPT-4B-Chat", trust_remote_code=True)
input_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

quantization with `bitsandbytes`

import torch
from transformers import BitsAndBytesConfig, AutoConfig, AutoModelForCausalLM, AutoTokenizer

config = AutoConfig.from_pretrained("vinai/PhoGPT-4B-Chat", trust_remote_code=True)  
config.init_device = "cuda"

# 8-bit quantization
model_8bit = AutoModelForCausalLM.from_pretrained("vinai/PhoGPT-4B-Chat", config=config, load_in_8bit=True)

Fine-tuning the model

See llm-foundry docs for details. To fully fine-tune PhoGPT, users can find an example of model finetuning YAML configuration at fine-tuning-phogpt.yaml. Users can also find the sample_instruction_following_dataset folder as an example of an instruction-following dataset.

To install llm-foundry, see Section "Installation" in https://github.com/mosaicml/llm-foundry.
Run: cd llm-foundry/scripts/train/ and then composer --world_size <number_of_GPUs> train.py <path_to_yaml_configuration_file> (e.g. composer --world_size 1 train.py fine-tuning-phogpt.yaml).

Other fine-tuning options may include the use of transformers's Trainer (e.g. see stanford_alpaca as an example), lit-gpt or LLaMA-Factory.

Limitations

PhoGPT has certain limitations. For example, it is not good at tasks involving reasoning, coding or mathematics. PhoGPT may generate harmful, hate speech, biased responses, or answer unsafe questions. Users should be cautious when interacting with PhoGPT that can produce factually incorrect output.

License

Copyright (c) 2023 VinAI Research

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

phogpt's People

Contributors

Stargazers

Watchers

phogpt's Issues

Which prompt format we use for PhoGPT without instruct model!

I run this code without instruct model, but the code may be adapt with instruct only.
Could you provide PROMPT format for model without instruct?

Context Window of PhoGPT ?

Hi team,

As in model card have said that PhoGPT using ALiBi for context length extrapolation , so team have tested maximum effective context length of PhoGPT ?

Thanks for the first GPT foundation model for Vietnamese

(Google Colab) ImportError: cannot import name '_expand_mask' from 'transformers.models.bloom.modeling_bloom'

Hi !
Thank you for your excellent work!

I got this error when trying to run the model on Google Colab. I use the example code on README file:

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer  
  
model_path = "vinai/PhoGPT-7B5-Instruct"  
  
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)  
config.init_device = "cuda"
# config.attn_config['attn_impl'] = 'triton' # Enable if "triton" installed!
  
model = AutoModelForCausalLM.from_pretrained(  
    model_path, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True  
)
# If your GPU does not support bfloat16:
# model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.float16, trust_remote_code=True)
model.eval()  
  
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)  
  
PROMPT = "### Câu hỏi:\n{instruction}\n\n### Trả lời:"  
  
input_prompt = PROMPT.format_map(  
    {"instruction": "Làm thế nào để cải thiện kỹ năng quản lý thời gian?"}  
)  
  
input_ids = tokenizer(input_prompt, return_tensors="pt")  
  
outputs = model.generate(  
    inputs=input_ids["input_ids"].to("cuda"),  
    attention_mask=input_ids["attention_mask"].to("cuda"),  
    do_sample=True,  
    temperature=1.0,  
    top_k=50,  
    top_p=0.9,  
    max_new_tokens=1024,  
    eos_token_id=tokenizer.eos_token_id,  
    pad_token_id=tokenizer.pad_token_id  
)  
  
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]  
response = response.split("### Trả lời:")[1]

The error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
[<ipython-input-8-2cf92868f260>](https://localhost:8080/#) in <cell line: 13>()
     11 # )
     12 # If your GPU does not support bfloat16:
---> 13 model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.float16, trust_remote_code=True)
     14 model.eval()
     15 

11 frames
[~/.cache/huggingface/modules/transformers_modules/vinai/PhoGPT-7B5-Instruct/8083375bebd52681090be6ebaf8bae7aee491f73/hf_prefixlm_converter.py](https://localhost:8080/#) in <module>
     13 import torch
     14 from transformers.models.bloom.modeling_bloom import BaseModelOutputWithPastAndCrossAttentions, BloomForCausalLM, BloomModel, CausalLMOutputWithCrossAttentions, CrossEntropyLoss
---> 15 from transformers.models.bloom.modeling_bloom import _expand_mask as _expand_mask_bloom
     16 from transformers.models.bloom.modeling_bloom import _make_causal_mask as _make_causal_mask_bloom
     17 from transformers.models.bloom.modeling_bloom import logging

ImportError: cannot import name '_expand_mask' from 'transformers.models.bloom.modeling_bloom' (/usr/local/lib/python3.10/dist-packages/transformers/models/bloom/modeling_bloom.py)

I don't know how to fix it, but I found the same issue at : https://huggingface.co/mosaicml/mpt-7b/discussions/83

Can you tell me how to run it?
Thanks you!

Do you have any plan for building a bi-lingual language English - Vietnames in the future?

Hello,

I am curious whether do you have any plan for building a bi-lingual language English - Vietnames in the future?

Thank you

Error wrong number of tensors when serving vinai/PhoGPT-4B-Chat with llama.cpp

I successfully converted the model to the gguf format using llama.cpp convert-hf-to-gguf.py script

cd ~/.models
git clone --progress --verbose https://huggingface.co/vinai/PhoGPT-4B-Chat
cd ~/llama.cpp
python3 convert-hf-to-gguf.py ~/.models/PhoGPT-4B-Chat --outfile ~/.models/pho.gguf

Output

Loading model: PhoGPT-4B-Chat
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer

...

output_norm.bias, n_dims = 1, torch.bfloat16 --> float32
Model successfully exported to '/home/username/.models/pho.gguf'

I tried inference and the error showed up

cd
./llama.cpp/main -m ./.models/pho.gguf -p "xin chào"

Log start
main: build = 2101 (b7b74cef)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1707968957
llama_model_loader: loaded meta data with 18 key-value pairs and 388 tensors from ./models/pho.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mpt
llama_model_loader: - kv   1:                               general.name str              = PhoGPT-4B-Chat
llama_model_loader: - kv   2:                         mpt.context_length u32              = 8192
llama_model_loader: - kv   3:                       mpt.embedding_length u32              = 3072
llama_model_loader: - kv   4:                            mpt.block_count u32              = 32
llama_model_loader: - kv   5:                    mpt.feed_forward_length u32              = 12288
llama_model_loader: - kv   6:                   mpt.attention.head_count u32              = 24
llama_model_loader: - kv   7:           mpt.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   8:               mpt.attention.max_alibi_bias f32              = 8.000000
llama_model_loader: - kv   9:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  10:                      tokenizer.ggml.tokens arr[str,20480]   = ["<unk>", "<s>", "</s>", "<pad>", "!"...
llama_model_loader: - kv  11:                  tokenizer.ggml.token_type arr[i32,20480]   = [3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  12:                      tokenizer.ggml.merges arr[str,20266]   = ["á »", "á º", "Ġ t", "n g", "Ġ...
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 3
llama_model_loader: - kv  17:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - type  f32:  258 tensors
llama_model_loader: - type  f16:  130 tensors
llm_load_vocab: special tokens definition check successful ( 4/20480 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = mpt
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 20480
llm_load_print_meta: n_merges         = 20266
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 24
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 8.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 3.75 B
llm_load_print_meta: model size       = 6.99 GiB (16.01 BPW)
llm_load_print_meta: general.name     = PhoGPT-4B-Chat
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 3 '<pad>'
llm_load_print_meta: LF token         = 130 'Ä'
llm_load_tensors: ggml ctx size =    0.15 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 388, got 195
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './.models/pho.gguf'
main: error: unable to load model

Có giới thiệu và hướng dẫn bằng Tiếng Việt không bạn ơi!

Could you explain what is the diference the non-instruct and instruct PhoGPT verrsion ?

Hi Mr Dat,

First of all, thank you so much for this great project.

I am reaching out to seek clarification regarding the distinctions between the non-instruct and instruct versions of PhoGPT.

I've come across discussions about these two versions and would greatly appreciate it if you could provide some insights into their key differences. Specifically, I am curious about how their functionalities vary and what specific use cases each version is optimized for.

Thank you very much for your time and assistance. I look forward to gaining a better understanding of the distinctions between the non-instruct and instruct versions of PhoGPT.

Hardware specification for inference

Hello,

The model has been trained on the A100 GPU. However, I am wondering about the GPU memory cost during inference.

Currently, I have a 3060 GPU with 12 VRAM. Can it be used for running inference?

Thank you

OutOfMemoryError when Fine-tuning model PhoGPT4B with llm-foundry

Hello. Thanks for your work very much.
I have carefully reviewed and adhered to the guidelines provided for fine-tuning the PhoGPT4B model. Nonetheless, despite specifying a maximum sequence length of 2048 and a global training batch size of 1, I am encountering Out of Memory issues. My GPU is RTX4090 24G.
Do you have any ideas to solve this problem?

Wishing you all the best!

License problem

As far as i know from reading your paper, you guys used data created by chatgpt, but still you guys release as commercial model. The terms of use of OpenAI show that you can not use use output from the Services to develop models that compete with OpenAI, so please check out license.
https://openai.com/policies/terms-of-use

Need some information about the tokenizer

Hi, thanks for the great work.

I'm new to the Vietnamese language modeling scene. I came across some major articles from 2019-2021 where people perceived word segmentation as the standard step before tokenization, which I appreciate but still not quite sure if it is actually necessary. Cannot find much information to answer that myself.

Then I take a look into your paper: you trained a BPE tokenization (sort of a sub-word tokenization). I have a few questions:

Is it correct that word segmentation is not used at all to create PhoGPT? If that's correct, I would love some reasoning.
You used segmentation in PhoBert. Why didn't you use BPE back then?

Error message: "llama.cpp error: 'done_getting_tensors: wrong number of tensors; expected 388, got 195'"

I converted the model into the llama.cpp's own format (.GGUF). And this model could run successful in colab like snapshot.

But when i loaded this model in LLM Studio or Jan, i got error message. Anyone met this problem? How to resolve this?

Many thanks!

Evaluation lack of context

Is the question suppose to be this lack of context ? Given that only the Ground truth even mention about Vietnam.

[Question] Hardware requirements with vLLM

To fully fine-tune vinai/PhoGPT-7B5 or vinai/PhoGPT-7B5-Instruct on a single GPU A100 with 40GB memory, it is advisable to employ the decoupled_lionw optimizer with a device_train_microbatch_size set to 1.

How about minimum and recommended hardware requirements when fine-tune vinai/PhoGPT-7B5 with vLLM?

4/8-bit with bitsandbytes

See: https://huggingface.co/docs/transformers/main/en/quantization#bitsandbytes

import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

model_4bit = AutoModelForCausalLM.from_pretrained("vinai/PhoGPT-4B-Chat", quantization_config=quantization_config, device_map="auto", trust_remote_code=True)

Or:

import torch
from transformers import BitsAndBytesConfig, AutoConfig, AutoModelForCausalLM, AutoTokenizer

config = AutoConfig.from_pretrained("vinai/PhoGPT-4B-Chat", trust_remote_code=True)  
config.init_device = "cuda"

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

model_4bit = AutoModelForCausalLM.from_pretrained("vinai/PhoGPT-4B-Chat", quantization_config=quantization_config, config=config, trust_remote_code=True)

Is there a quantization version support in the future?

I hope that PhoGPT will have an AWQ or GPTQ version for running on low VRAM GPUs.
Do you have any plans for quantization? It will make this LLM more popular for students and individual researchers who have limited computing resources.
Thank you for your hard work!

hardware requirements to run finetune

I ran the sample finetune with default settings on Nvidia A100 40Gb but got OOM. can we get finetune successfully on Nvidia A100 40Gb hardware? any updates for the settings to get it. thank you.

Any online demo available?

Congratulations! This is great work!!! Particularly, for the Vietnamese community around the world!

Do you have any plans for releasing an online demo of PhoGPT (perhaps a HuggingFace online demo)? :)

Mình sử dụng model này bằng CPU được không bạn ơi?

Mình thấy hướng dẫn là dùng CUDA, mình chạy model này bằng CPU được không bạn? Cấu hình phải như thế nào mới chạy ngon ạ?

How can I get the model to run on vLLM?

Thank you for publishing the project.

I would like to test the model on my local computer using compatible, supported OpenAI APIs, and I see that vLLM is the appropriate project to make it happen.

I would appreciate some advice on making changes and getting the code compatible to run on vLLM.
I truly appreciate your help.

What is the data format to perform fine turning of the phoGPT model?

I want to perform fine turning on the phoGPT model with the goal of answering the information in the text source. What format should the data have to be able to do this?

Incomplete Response from 4bit Version of PhoGPT

Hello, I made some testing on 4bit and 8bit version of PhoGPT. I got issue with 4bit version detail is below:

Environment:
PhoGPT Version: 4bit
Execution Environment: Google Colab with T4 GPU

Issue Description:
When using the 4bit version of PhoGPT with the provided initialization code from the documentation, the model returns an incomplete response. Specifically, it only returns a newline character \n, in contrast to the 8bit version, which functions correctly and returns a comprehensive output.

Steps to Reproduce:
Initialize the 4bit PhoGPT model using the sample code from the official documentation.
Use instruction = "Viết bài văn nghị luận xã hội về an toàn giao thông"
Observe that the response is only a newline character, indicating an incomplete or failed generation.

Expected Behavior:
The 4bit version of PhoGPT should return a complete and coherent response similar to the 8bit version, which returns detailed and lengthy outputs.

Actual Behavior:
The 4bit version outputs only a newline character \n, indicating an error or issue in processing the input prompt.

8bit

4bit

HuggingFace tokenizer does not pad to max_length?

Dear VinAI team,

Thank you for sharing your work with us. I tried to use your model (PhoGPT tokenizer) and set the max length to 8192, but the tokenizer's output did not add any padding tokens. Here is an example:

phogpt_tokenizer= AutoTokenizer.from_pretrained("vinai/PhoGPT-4B", trust_remote_code= True)
print(
    phogpt_tokenizer(
        "Đây là câu hỏi",
        max_length= 8192,
        truncation= True,
        padding= True
    )
)

The output is:
{'input_ids': [2985, 270, 1117, 1378], 'attention_mask': [1, 1, 1, 1]}

You can see that the output token list only has 4 tokens. Should it be 8192 tokens instead?

Redistributing the model in other format

Hello,

I was able to use phoGPT in CPU-only mode with llama.cpp . I would want to convert the model into the llama.cpp's own format (.GGUF) and post it to HuggingFace (similar to this repo for example). I have gone through the LICENSE file and it looks like this would be fine. However I still want to make sure if am I allowed to perform such action?

Regards,

vinairesearch / phogpt Goto Github PK

phogpt's Introduction

PhoGPT: Generative Pre-training for Vietnamese

Model download

Run the model

With vLLM, Text Generation Inference & llama.cpp

With llama.cpp

With pure transformers

Instruction following

Chat

quantization with bitsandbytes

Fine-tuning the model

Limitations

License

phogpt's People

Contributors

Stargazers

Watchers

Forkers

phogpt's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs

With pure `transformers`

quantization with `bitsandbytes`