gitabtion / bertbasedcorrectionmodels Goto Github PK
View Code? Open in Web Editor NEWPyTorch impelementations of BERT-based Spelling Error Correction Models. 基于BERT的文本纠错模型,使用PyTorch实现。
License: Apache License 2.0
PyTorch impelementations of BERT-based Spelling Error Correction Models. 基于BERT的文本纠错模型,使用PyTorch实现。
License: Apache License 2.0
l can train correctly the model with the public datasets, but when l use my company data to train model,an error occurs,as follows:
ValueError: Expected input batch_size (80) to match target batch_size (88)
note: my data format is json, the same as the above public datasets, an special example as follows:
{
"id": "--",
"original_text": "播放我的世界之梦想大陆",
"wrong_ids": [],
"correct_text": "播放我的世界之梦想大陆"
}
Great work,
I wonder if this approach can be adopted for other languages.?
I fellow the steps. And get different result.
DATALOADER:0 TEST RESULTS
{'val_loss': 0.08779423662285873}
用你的配置训练的bert纠错模型,用你的评测脚本:
Sentence Level:
acc:0.793636, precision:0.828810, recall:0.732472, f1:0.777669
使用realise模型的评测脚本:
{'sent-detect-acc': 82.18181818181817,
'sent-detect-p': 72.86689419795222,
'sent-detect-r': 78.9279112754159,
'sent-detect-f1': 75.77639751552793,
'sent-correct-acc': 79.9090909090909,
'sent-correct-p': 68.60068259385666,
'sent-correct-r': 74.3068391866913,
'sent-correct-f1': 71.33984028393967}
你只在src == tgt时统计了FP,统计出的FP偏小,导致计算precision时的分母偏小,最终的precision偏大
train_SoftMaskedBert中的HYPER_PARAMS数值,是指关于detection loss 与 correction loss的权重占比吗?
这个数值是应用于CscTrainingModel.training_step下的loss = self.w * outputs[1] + (1 - self.w) * outputs[0]吗?
0.8: detection0.2 + correction0.8
我是否可以通过修改该数值达到模型侧重于提高detection的prf
Could you release softMaskedBert model for evaluation?
依赖包严格按照给出版本号安装
D:\SoftRun\Anaconda3\envs\torch161\python.exe E:/nlpcode/BertBasedCorrectionModels-master/tools/train_csc.py
2021-04-30 13:58:02,496 bert4csc INFO: Namespace(config_file='', opts=[])
2021-04-30 13:58:02,496 bert4csc INFO: Loaded configuration file csc/train_bert4csc.yml
2021-04-30 13:58:02,496 bert4csc INFO:
MODEL:
BERT_CKPT: "bert-base-chinese"
DEVICE: "cuda:0"
NAME: "bert4csc"
HYPER_PARAMS: [ 1.0 ]
GPU_IDS: [0]
DATASETS:
TRAIN: "datasets/csc/train.json"
VALID: "datasets/csc/dev.json"
TEST: "datasets/csc/test.json"
SOLVER:
BASE_LR: 0.001
WEIGHT_DECAY: 0.00001
BATCH_SIZE: 16
WARMUP_EPOCHS: 8
MAX_EPOCHS: 20
ACCUMULATE_GRAD_BATCHES: 16
TEST:
BATCH_SIZE: 16
TASK:
NAME: "csc"
OUTPUT_DIR: "checkpoints/bert4csc"
2021-04-30 13:58:02,496 bert4csc INFO: Running with config:
DATALOADER:
NUM_WORKERS: 4
DATASETS:
TEST: datasets/csc/test.json
TRAIN: datasets/csc/train.json
VALID: datasets/csc/dev.json
INPUT:
MAX_LEN: 512
MODE: ['train', 'test']
MODEL:
BERT_CKPT: bert-base-chinese
DEVICE: cuda:0
GPU_IDS: [0]
HYPER_PARAMS: [1.0]
NAME: bert4csc
NUM_CLASSES: 10
WEIGHTS:
OUTPUT_DIR: checkpoints/bert4csc
SOLVER:
ACCUMULATE_GRAD_BATCHES: 16
BASE_LR: 0.001
BATCH_SIZE: 16
BIAS_LR_FACTOR: 2
CHECKPOINT_PERIOD: 10
GAMMA: 0.1
LOG_PERIOD: 100
MAX_EPOCHS: 20
MOMENTUM: 0.9
OPTIMIZER_NAME: AdamW
STEPS: (30000,)
WARMUP_EPOCHS: 8
WARMUP_FACTOR: 0.3333333333333333
WARMUP_ITERS: 500
WARMUP_METHOD: linear
WEIGHT_DECAY: 1e-05
WEIGHT_DECAY_BIAS: 0
TASK:
NAME: csc
TEST:
BATCH_SIZE: 16
CKPT_FN:
Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\utilities\distributed.py:49: UserWarning: Checkpoint directory E:\nlpcode\BertBasedCorrectionModels-master\checkpoints\bert4csc exists and is not empty.
warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
102 M Trainable params
0 Non-trainable params
102 M Total params
Validation sanity check: 0it [00:00, ?it/s]2021-04-30 13:58:11,384 bert4csc INFO: Valid.
Traceback (most recent call last):
File "E:/nlpcode/BertBasedCorrectionModels-master/tools/train_csc.py", line 53, in
main()
File "E:/nlpcode/BertBasedCorrectionModels-master/tools/train_csc.py", line 49, in main
train(cfg, model, loaders, ckpt_callback)
File "E:\nlpcode\BertBasedCorrectionModels-master\tools\bases.py", line 78, in train
trainer.fit(model, train_loader, valid_loader)
File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 470, in fit
results = self.accelerator_backend.train()
File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 68, in train
results = self.train_or_test()
File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 69, in train_or_test
results = self.trainer.train()
File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 492, in train
self.run_sanity_check(self.get_model())
File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 690, in run_sanity_check
_, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 593, in run_evaluation
for batch_idx, batch in enumerate(dataloader):
File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\torch\utils\data\dataloader.py", line 291, in iter
return _MultiProcessingDataLoaderIter(self)
File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\torch\utils\data\dataloader.py", line 737, in init
w.start()
File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'get_csc_loader.._collate_fn'
Process finished with exit code 1
你好,通过调用inference.py中的load_model_directly()方法,无法加载训练的模型,具体代码如下:
① 代码部分:
def load_model_directly():
ckpt_file = 'SoftMaskedBert/epoch=05-val_loss=0.03253.ckpt'
config_file = 'csc/train_SoftMaskedBert.yml'
from bbcm.config import cfg
cp = get_abs_path('checkpoints', ckpt_file)
cfg.merge_from_file(get_abs_path('configs', config_file))
tokenizer = BertTokenizer.from_pretrained(cfg.MODEL.BERT_CKPT)
print("###tokenizer加载完毕")
print("### tokenizer: ", tokenizer)
if cfg.MODEL.NAME in ['bert4csc', 'macbert4csc']:
model = BertForCsc.load_from_checkpoint(cp,
cfg=cfg,
tokenizer=tokenizer)
else:
print("###加载模型")
print("###cp : ", cp)
model = SoftMaskedBertModel.load_from_checkpoint(cp,
cfg=cfg,
tokenizer=tokenizer)
print("###model加载完毕")
model.eval()
model.to(cfg.MODEL.DEVICE)
return model
② 问题:
感觉这段代码没有起作用,ckpt文件无法加载,程序还是自动从huggingface下载了。
model = SoftMaskedBertModel.load_from_checkpoint(cp,
cfg=cfg,
tokenizer=tokenizer)
我查了一下load_from_checkpoint() 方法,对于参数cp, cfg的传递,没有看明白。
sentence level:一个句子怎样算做一个正例,怎样算作一个反例呢,是句子里有若干个错误都被纠正了才算正例嘛
How to output the position of the wrong word when inferring?
你好!谢谢你的开源模型!
我在测试时发现,纠错的句子中如果包含某些中文符号时,结果中会出现"[UNK]", 这种情况如何处理呢?
texts = ['今天我很“高心”']
model.predict(texts)
results:
['今天我很[UNK]高兴[UNK]']
之前用其他的纠正模型,是有看到加载拼音 、形似字字典,纠错的过程中是替换字典中的后选择,然后通过模型去计算分数。
但是在我们这个softmasked bert里,从e‘输入到纠正网络后,直接给出的就是纠正好的文字,请问这些文字是来自哪里?来自vocab文件吗?
F1达到了90?
初次运行模型在数据预处理B1_training.sgml的时候编码报错,文件是通过给的网址下的,有尝试在open的时候加入encoding='utf-8'但是没有作用。人工看了下文件也看不出问题出在哪..问题第一次似乎出在处理第5842行的时候。
<PASSAGE id="B1-0826-1">因為那是我的第一次去北京,我的朋友就是我的導遊。跟他我們一起去了北京特別的地方,必如說長城、故宮、天堂公園什麼的。</PASSAGE>
Traceback (most recent call last):
File "/home/BertBasedCorrect/tools/train_csc.py", line 51, in <module>
main()
File "/home/BertBasedCorrect/tools/train_csc.py", line 28, in main
preproc()
File "/home/BertBasedCorrect/bbcm/data/processors/csc.py", line 185, in preproc
for item in read_data(get_abs_path('datasets', 'csc')):
File "/home/BertBasedCorrect/bbcm/data/processors/csc.py", line 116, in read_data
for line in f:
File "/home/anaconda3/envs/torch/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 4867: invalid start byte
Process finished with exit code 1
为什么这个跑出来的精度比论文上的精度高那么多,就连bert的微调都可以达到state of the art?
一开始看训练数据量不一样,以为删除了脏数据所以提升,把训练数据全部加上去,精度还是很高,连bert微调都能当sota了。。。
为啥呀
你好!
Good Job!
I reproduce your training model and generate the best model to inference on my special business sample as I DO not have more business data, but found some issue.
During inference, the model will show the overcorrection: check the correct word to wrong word like this:
{
"paragraph": "在本合同中,除上下文另有规定外,下列用语应当具有如下含义:",
"error_fragments": [
{
"error_init_id": 26,
"error_end_id": 27,
"src_fragment": "含", -> the correct word in paragraph
"tgt_fragment": "涵" -> model output wrong word
}
]
}
So, there will be one white_name_list config and code to fix this overcorrection issue.
I'd like to pull this code in tools/inference.py.
So are you approve and do you hava any good idea ? Let's keep talk, thx
BestRegards
Yazhou
how to change char level or sentence level ?
这个NUM_CLASSES自己在训练的时候是不是应该改成vocab里面的元素个数呢?
使用config文件如下:
MODEL:
BERT_CKPT: "bert-base-chinese"
DEVICE: "cpu"
NAME: "SoftMaskedBertModel"
HYPER_PARAMS: [0.8]
DATASETS:
TRAIN: "datasets/csc/train.json"
VALID: "datasets/csc/dev.json"
TEST: "datasets/csc/test.json"
SOLVER:
BASE_LR: 0.0001
WEIGHT_DECAY: 5e-8
BATCH_SIZE: 32
MAX_EPOCHS: 10
ACCUMULATE_GRAD_BATCHES: 4
TEST:
BATCH_SIZE: 16
TASK:
NAME: "csc"
OUTPUT_DIR: "checkpoints/SoftMaskedBert"
运行命令:
python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml
报错:
/Users//opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:49: UserWarning: Checkpoint directory /Users///Documents/personal/BertBasedCorrectionModels/checkpoints/SoftMaskedBert exists and is not empty.
warnings.warn(*args, kwargs)
GPU available: False, used: False
TPU available: None, using: 0 TPU cores
Traceback (most recent call last):
File "tools/train_csc.py", line 52, in
main()
File "tools/train_csc.py", line 48, in main
train(cfg, model, loaders, ckpt_callback)
File "/Users///Documents/personal/BertBasedCorrectionModels/tools/bases.py", line 78, in train
trainer.fit(model, train_loader, valid_loader)
File "/Users//opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 454, in fit
self.accelerator_backend.setup(model)
File "/Users///opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/accelerators/cpu_accelerator.py", line 49, in setup
self.setup_optimizers(model)
File "/Users///opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 145, in setup_optimizers
optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model)
File "/Users///opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/trainer/optimizers.py", line 30, in init_optimizers
optim_conf = model.configure_optimizers()
File "/Users///Documents/personal/BertBasedCorrectionModels/bbcm/engine/bases.py", line 21, in configure_optimizers
scheduler = build_lr_scheduler(self.cfg, optimizer)
File "/Users///Documents/personal/BertBasedCorrectionModels/bbcm/solver/build.py", line 49, in build_lr_scheduler
scheduler = getattr(lr_scheduler, cfg.SOLVER.SCHED)(scheduler_args)
File "/Users///Documents/personal/BertBasedCorrectionModels/bbcm/solver/lr_scheduler.py", line 73, in init
super().init(optimizer, last_epoch, verbose)
TypeError: init() takes from 2 to 3 positional arguments but 4 were given
请问可以处理英文吗,处理标点符号这些
您好,感谢开源,请问一下,模型训练过程中det_labels的意义是什么?
class BertForCsc(CscTrainingModel):
def __init__(self, cfg, tokenizer):
super().__init__(cfg)
self.cfg = cfg
self.bert = BertForMaskedLM.from_pretrained(cfg.MODEL.BERT_CKPT)
self.tokenizer = tokenizer
def forward(self, texts, cor_labels=None, det_labels=None):
# print('text: ', texts)
# print('cor_labels: ', cor_labels)
# print('det labels: ', det_labels)
if cor_labels is not None:
# 正确样本不为空
text_labels = self.tokenizer(cor_labels, padding=True, return_tensors='pt')['input_ids']
text_labels = text_labels.to(self.cfg.MODEL.DEVICE)
print('text labels: ', text_labels)
# Tokens with indices set to -100 are ignored (masked)
text_labels[text_labels == 0] = -100
看懂了..这个是对模型做检错,但是好像模型并没有做检错这个工作是吗?似乎是直接完成纠错的。因此计算出的det_acc也是恒定为1
您好!请问您这个项目只能做拼写纠错么?可以做中文语法纠错么?例如缺失、冗余、乱序等等。
我跑起来看输出有sentence level的,另外是detection和correction的result,不明白是怎么得到您readme里写的char level的结果的呢
用作者的数据集训练可以,使用自己的数据集结果报错,
ValueError: Expected input batch_size (304) to match target batch_size (336).
我把自己的数据集融合到作者的数据集中也能训练,唯独无法单独训练自己的,不知道是不是还应该修改其他参数?
BertBasedCorrectionModels/bbcm/modeling/csc/modeling_bert4csc.py的第30行
text_labels[text_labels == 0] = -100
这里为什么要设置为-100呢,可以稍微解释一下吗,谢谢啦
Hi, guy.
Good Job.
In readme.md file part 2:
from tools.inference import *
ckpt_fn = 'SoftMaskedBert/epoch=02-val_loss=0.02904.ckpt' # find it in checkpoints/
config_file = 'csc/train_SoftMaskedBert.yml' # find it in configs/
model = load_model_directly(ckpt_fn=ckpt_fn, config_file=config_file)
texts = ['今天我很高心', '测试', '继续测试']
model.predict(texts)
suggest to update this file to show the model predict output result and format(output data type is list).
Or please allow me to update by pull request ?
Keep talk.
Best Regards
Yazhou
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.