GithubHelp home page GithubHelp logo

longformer_zh's Introduction

中文预训练Longformer模型 | Longformer_ZH with PyTorch

相比于Transformer的O(n^2)复杂度,Longformer提供了一种以线性复杂度处理最长4K字符级别文档序列的方法。Longformer Attention包括了标准的自注意力与全局注意力机制,方便模型更好地学习超长序列的信息。

Compared with O(n^2) complexity for Transformer model, Longformer provides an efficient method for processing long-document level sequence in Linear complexity. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.

我们注意到关于中文Longformer或超长序列任务的资源较少,因此在此开源了我们预训练的中文Longformer模型参数, 并提供了相应的加载方法,以及预训练脚本。

There are not so much resource for Chinese Longformer or long-sequence-level chinese task. Thus we open source our pretrained longformer model to help the researchers.

加载模型 | Load the model

您可以使用谷歌云盘或百度网盘下载我们的模型
You could get Longformer_zh from Google Drive or Baidu Yun.

我们同样提供了Huggingface的自动下载
We also provide auto load with HuggingFace.Transformers.

from Longformer_zh import LongformerZhForMaksedLM
LongformerZhForMaskedLM.from_pretrained('ValkyriaLenneth/longformer_zh')

注意事项 | Notice

  • 直接使用 transformers.LongformerModel.from_pretrained 加载模型

  • Please use transformers.LongformerModel.from_pretrained to load the model directly

  • 以下内容已经被弃用

  • The following notices are abondoned, please ignore them.

  • 区别于英文原版Longformer, 中文Longformer的基础是Roberta_zh模型,其本质上属于 Transformers.BertModel 而非 RobertaModel, 因此无法使用原版代码直接加载。

  • Different with origin English Longformer, Longformer_Zh is based on Roberta_zh which is a subclass of Transformers.BertModel not RobertaModel. Thus it is impossible to load it with origin code.

  • 我们提供了修改后的中文Longformer文件,您可以使用其加载参数。

  • We provide modified Longformer_zh class, you can use it directly to load the model.

  • 如果您想将此参数用于更多任务,请参考Longformer_zh.py替换Attention Layer.

  • If you want to use our model on more down-stream tasks, please refer to Longformer_zh.py and replace Attention layer with Longformer Attention layer.

关于预训练 | About Pretraining

效果测试 | Evaluation

CCF Sentiment Analysis

  • 由于中文超长文本级别任务稀缺,我们采用了CCF-Sentiment-Analysis任务进行测试
  • Since it is hard to acquire open-sourced long sequence level chinese NLP task, we use CCF-Sentiment-Analysis for evaluation.
Model Dev F
Bert 80.3
Bert-wwm-ext 80.5
Roberta-mid 80.5
Roberta-large 81.25
Longformer_SC 79.37
Longformer_ZH 80.51

Pretraining BPC

  • 我们提供了预训练BPC(bits-per-character), BPC越小,代表语言模型性能更优。可视作PPL.
  • We also provide BPC scores of pretraining, the lower BPC score, the better performance Langugage Model has. You can also treat it as PPL.
Model BPC
Longformer before training 14.78
Longformer after training 3.10

CMRC(Chinese Machine Reading Comprehension)

Model F1 EM
Bert 85.87 64.90
Roberta 86.45 66.57
Longformer_zh 86.15 66.84

Chinese Coreference Resolution

Model Conll-F1 Precision Recall
Bert 66.82 70.30 63.67
Roberta 67.77 69.28 66.32
Longformer_zh 67.81 70.13 65.64

致谢

感谢东京工业大学 奥村·船越研究室 提供算力。

Thanks Okumula·Funakoshi Lab from Tokyo Institute of Technology who provides the devices and oppotunity for me to finish this project.

longformer_zh's People

Contributors

valkyrialenneth avatar

Stargazers

dp avatar  avatar Shay Duane avatar Chris Tian avatar  avatar Hui ZENG avatar Yeray avatar  avatar kstranger avatar  avatar  avatar Chung-Ping Huang avatar Zacharyyyyyy_Z avatar  avatar  avatar ss avatar Huaqing Xu avatar :)s avatar LiaoFei avatar  avatar  avatar changshivek avatar  avatar superldj avatar Li Zhang avatar  avatar SouthWind avatar Michael Chien avatar

Watchers

James Cloos avatar  avatar

longformer_zh's Issues

Sequence length should be multiple of 512. It can't directly used for encoding

File "D:\Anaconda\envs\torch_1.7\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\Anaconda\envs\torch_1.7\lib\site-packages\transformers\models\bert\modeling_bert.py", line 1068, in forward
return_dict=return_dict,
File "D:\Anaconda\envs\torch_1.7\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\Anaconda\envs\torch_1.7\lib\site-packages\transformers\models\bert\modeling_bert.py", line 591, in forward
output_attentions,
File "D:\Anaconda\envs\torch_1.7\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\Anaconda\envs\torch_1.7\lib\site-packages\transformers\models\bert\modeling_bert.py", line 476, in forward
past_key_value=self_attn_past_key_value,
File "D:\Anaconda\envs\torch_1.7\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\Anaconda\envs\torch_1.7\lib\site-packages\transformers\models\bert\modeling_bert.py", line 408, in forward
output_attentions,
File "D:\Anaconda\envs\torch_1.7\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "I:\PycharmProject\zh_efficient-autogressive-EL\model\Longformer_zh.py", line 21, in forward
output_attentions=output_attentions)
File "D:\Anaconda\envs\torch_1.7\lib\site-packages\transformers\models\longformer\modeling_longformer.py", line 591, in forward
query_vectors, key_vectors, self.one_sided_attn_window_size
File "D:\Anaconda\envs\torch_1.7\lib\site-packages\transformers\models\longformer\modeling_longformer.py", line 803, in _sliding_chunks_query_key_matmul
), f"Sequence length should be multiple of {window_overlap * 2}. Given {seq_len}"
AssertionError: Sequence length should be multiple of 512. Given 158

did you miss something that pad the sequence to suitbal length?

spelling mistake?

我们同样提供了Huggingface的自动下载
We also provide auto load with HuggingFace.Transformers.
from Longformer_zh import LongformerZhForMaksedLM
LongformerZhForMaksedLM.from_pretrained('ValkyriaLenneth/longformer_zh')

ImportError: cannot import name 'LongformerZhForMaksedLM' from 'Longformer_zh'

LongformerZhForMaksedLM or LongformerZhForMaskedLM ?

如直接使用LongformerTokenizer会报此错,是否需要使用BertTokenizer?

I'm not sure if it's able to directly ask you questions in Chinese. If it caused misinterpretations, I can change to English.

您好!我现在正在使用您的预训练模型,文件下载自 https://huggingface.co/ValkyriaLenneth/longformer_zh 。我直接使用AutoTokenizer的话,代码会自动调用LongformerTokenizer,然后会报如下错误:

Traceback (most recent call last):
  File "mypath/trylongformerzh1.py", line 3, in <module>
    tokenizer = LongformerTokenizer.from_pretrained("pretrain_path/longformer_zh")
  File "virtualenv_path/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1744, in from_pretrained
    return cls._from_pretrained(
  File "virtualenv_path/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1872, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "virtualenv_path/lib/python3.8/site-packages/transformers/models/roberta/tokenization_roberta.py", line 159, in __init__
    super().__init__(
  File "virtualenv_path/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2.py", line 179, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

我看到您的代码中使用的是BertTokenizerFast,所以请问加载longformer_zh的tokenizer是否也需要使用BertTokenizer?
我直接用BertTokenizer确实是可以运行的。

另:我是使用transformers.LongformerModel.from_pretrained来加载您的模型。我暂时没有测试其他功能,直接加载模型似乎是可行的。

我的transformers版本是4.12.5,我能够运行成功的代码是:

from transformers import BertTokenizer, LongformerModel

tokenizer = BertTokenizer.from_pretrained("pretrain_path/longformer_zh")

model = LongformerModel.from_pretrained("pretrain_path/longformer_zh")

如果您有时间浏览本issue的话,我会非常感谢!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.