This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).
You may use our model directly from the HuggingFace's transformers library.
您可直接透過 HuggingFace's transformers 套件使用我們的模型。
pip install -U transformers
Please use BertTokenizerFast as tokenizer, and replace ckiplab/albert-tiny-chinese and ckiplab/albert-tiny-chinese-ws by any model you need in the following example.
fromtransformersimport (
BertTokenizerFast,
AutoModelForMaskedLM,
AutoModelForCausalLM,
AutoModelForTokenClassification,
)
# masked language model (ALBERT, BERT)tokenizer=BertTokenizerFast.from_pretrained('bert-base-chinese')
model=AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above# casual language model (GPT2)tokenizer=BertTokenizerFast.from_pretrained('bert-base-chinese')
model=AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above# nlp task modeltokenizer=BertTokenizerFast.from_pretrained('bert-base-chinese')
model=AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above
Model Fine-Tunning
To fine tunning our model on your own datasets, please refer to the following example from HuggingFace's transformers.
python run_mlm.py \
--model_name_or_path ckiplab/albert-tiny-chinese \ # or other models above
--tokenizer_name bert-base-chinese \
...
python run_ner.py \
--model_name_or_path ckiplab/albert-tiny-chinese-ws \ # or other models above
--tokenizer_name bert-base-chinese \
...
Model Performance
The following is a performance comparison between our model and other models.
The results are tested on a traditional Chinese corpus.
以下是我們的模型與其他的模型之性能比較。
各個任務皆測試於繁體中文的測試集。
Model
#Parameters
Perplexity†
WS (F1)‡
POS (ACC)‡
NER (F1)‡
ckiplab/albert-tiny-chinese
4M
4.80
96.66%
94.48%
71.17%
ckiplab/albert-base-chinese
11M
2.65
97.33%
95.30%
79.47%
ckiplab/bert-tiny-chinese
12M
8.07
96.98%
95.11%
74.21%
ckiplab/bert-base-chinese
102M
1.88
97.60%
95.67%
81.18%
ckiplab/gpt2-tiny-chinese
4M
16.94
--
--
--
ckiplab/gpt2-base-chinese
102M
8.36
--
--
--
voidful/albert_chinese_tiny
4M
74.93
--
--
--
voidful/albert_chinese_base
11M
22.34
--
--
--
bert-base-chinese
102M
2.53
--
--
--
† Perplexity; the smaller the better.
† 混淆度;數字越小越好。
‡ WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.
‡ WS: 斷詞;POS: 詞性標記;NER: 實體辨識;數字越大越好。
Training Corpus
The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.
The POS driver will automatically segment the sentence internally using there characters ',,。::;;!!??' while running the model. (The output sentences will be concatenated back.) You may set delim_set to any characters you want.
You may set use_delim=False to disable this feature, or set use_delim=True in WS and NER driver to enable this feature.
# Enable sentence segmentationws=ws_driver(text, use_delim=True)
ner=ner_driver(text, use_delim=True)
# Disable sentence segmentationpos=pos_driver(ws, use_delim=False)
# Use new line characters and tabs for sentence segmentationpos=pos_driver(ws, delim_set='\n\t')
You may specify batch_size and max_length to better utilize you machine resources.
您亦可設置 batch_size 與 max_length 以更完美的利用您的機器資源。
# Sets the batch size and maximum sentence lengthws=ws_driver(text, batch_size=256, max_length=128)