retarfi / language-pretraining Goto Github PK

View Code? Open in Web Editor NEW

46.0 4.0 6.0 208 KB

Pre-training Language Models for Japanese

License: MIT License

Python 100.00%

nlp natural-language-processing pytorch transformer implementation language-models language-model bert electra japanese

language-pretraining's Introduction

Hi there 👋

Things I code with

More Profile with

language-pretraining's People

Stargazers

Watchers

Forkers

magnetizedcell yosuken4 kirawang23 upura ykawakami-mhd stophobia

language-pretraining's Issues

line_per_fileに+5している理由について

line_per_file: int = line_all // num_filesではなく、line_per_file: int = line_all // num_files + 5である理由についてお伺いしたいです。

https://github.com/retarfi/language-pretraining/blob/ebf6b40ad7bfd6c76c415f86f41c0c35b9bab25f/train_tokenizer.py#LL39C9-L40C1

損失関数の値が得られない

事前学習モデルとソースコードの提供ありがとうございます。
提供していただいているソースコードを利用して追加学習を試みているのですが、以下のエラーが発生します。

場所：utils.trainer.py 727行目
エラー内容：KeyError 'loss'

outputs.lossの値を確認したところ、結果はNoneでした。
よって、717行目でlossが算出されていないことが原因であると考えていますが、解決できていません。

この問題を解決できることを望んでいます。
よろしくお願いいたします。

node_rankが-1のときget_indices_per_processが落ちる

この行においてnode_rank=-1のとき、

language-pretraining/utils/trainer_pt_utils.py

Line 28 in ebf6b40

 num_before = sum([batch_config[str(i)] for i in range(node_rank)]) + per_device_batch_size * (rank - node_rank * nproc_per_node) 

indicesの長さに比べて、expand_idxが大きいため、落ちてしまう。

language-pretraining/utils/trainer_pt_utils.py

Line 32 in ebf6b40

indices = np.array(indices)[expand_idx].tolist()

ipadic problem for 四半期連結会計期間末日満期手形

Thank you for releasing bert-small-japanese-fin and other Electra models for FinTech. But I've found they tokenize "四半期連結会計期間末日満期手形" in bad way:

>>> from transformers import AutoTokenizer
>>> tokenizer=AutoTokenizer.from_pretrained("izumi-lab/bert-small-japanese-fin")
>>> tokenizer.tokenize("四半期連結会計期間末日満期手形")
['四半期', '連結', '会計', '期間', '末日', '満期', '手形']
>>> tokenizer.tokenize("第3四半期連結会計期間末日満期手形")
['第', '3', '四半期連結会計期間末日満期手形']

This is because of the bug of ipadic on 名詞,数 tokenization for 漢字-strings which begin with 漢数字.

>>> import fugashi,ipadic
>>> parser=fugashi.GenericTagger(ipadic.MECAB_ARGS).parse
>>> print(parser("四半期連結会計期間末日満期手形"))
四半期  名詞,一般,*,*,*,*,四半期,シハンキ,シハンキ
連結    名詞,サ変接続,*,*,*,*,連結,レンケツ,レンケツ
会計    名詞,サ変接続,*,*,*,*,会計,カイケイ,カイケイ
期間    名詞,一般,*,*,*,*,期間,キカン,キカン
末日    名詞,一般,*,*,*,*,末日,マツジツ,マツジツ
満期    名詞,一般,*,*,*,*,満期,マンキ,マンキ
手形    名詞,一般,*,*,*,*,手形,テガタ,テガタ
EOS
>>> print(parser("第3四半期連結会計期間末日満期手形"))
第      接頭詞,数接続,*,*,*,*,第,ダイ,ダイ
3       名詞,数,*,*,*,*,*
四半期連結会計期間末日満期手形  名詞,数,*,*,*,*,*
EOS

I recommend you to use another tokenizer than BertJapaneseTokenizer+ipadic. See detail in my diary written in Japanese.

retarfi / language-pretraining Goto Github PK

language-pretraining's Introduction

Hi there 👋

language-pretraining's People

Stargazers

Watchers

Forkers

language-pretraining's Issues

line_per_fileに+5している理由について

損失関数の値が得られない

node_rankが-1のときget_indices_per_processが落ちる

ipadic problem for 四半期連結会計期間末日満期手形

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs