GithubHelp home page GithubHelp logo

openclap's Introduction

OpenCLaP:多领域开源中文预训练语言模型仓库

目录

项目简介

OpenCLaP(Open Chinese Language Pre-trained Model Zoo)是由清华大学人工智能研究院自然语言处理与社会人文计算研究中心推出的一个多领域中文预训练模型仓库。预训练语言模型通过在大规模文本上进行预训练,可以作为下游自然语言处理任务的模型参数或者模型输入以提高模型的整体性能。该模型仓库具有如下几个特点:

  • 多领域。我们目前训练出了基于法律文本和百度百科的预训练模型,以提供多样化的可选择模型。
  • 能力强。我们使用了当前主流的 BERT 模型作为预训练的神经网络结构,并支持最大 512 长度的文本输入来适配更加多样的任务需求。
  • 持续更新。我们将在近期加入更多的预训练模型,如增加更多样的训练语料,使用最新的全词覆盖(Whole Word Masking)训练策略等。

模型概览

以下是我们目前公开发布的模型概览:

名称 基础模型 数据来源 训练数据大小 词表大小 模型大小 下载地址
民事文书BERT bert-base 全部民事文书 2654万篇文书 22554 370MB 点我下载
刑事文书BERT bert-base 全部刑事文书 663万篇文书 22554 370MB 点我下载
百度百科BERT bert-base 百度百科 903万篇词条 22166 367MB 点我下载

使用方式

我们提供的模型可以被开源项目pytorch-pretrained-BERT直接使用。以民事文书BERT为例,具体使用方法分为两步:

  • 首先使用脚本下载我们的模型
wget https://thunlp.oss-cn-qingdao.aliyuncs.com/bert/ms.zip
unzip ms.zip
  • 在运行时指定使用我们的模型--bert_model $model_folder来进行使用

项目网站

请访问 http://zoo.thunlp.org 以获得更多有关信息。

引用

Bibtex:

@techreport{zhong2019openclap,
  title={Open Chinese Language Pre-trained Model Zoo},
  author={Zhong, Haoxi and Zhang, Zhengyan and Liu, Zhiyuan and Sun, Maosong},
  year={2019},
  url = "https://github.com/thunlp/openclap",
}

作者与致谢

Haoxi Zhong(钟皓曦,硕士生), Zhengyan Zhang(张正彦,本科生), Zhiyuan Liu(刘知远,副教授), Maosong Sun(孙茂松,教授).

感谢幂律智能对本项目的大力支持与帮助。

openclap's People

Contributors

haoxizhong avatar zzy14 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openclap's Issues

可以用于中文的语音识别吗?

作者你好,感谢你的工作,请问一下该语言模型可以用于提高中文语音识别的准确率吗?能给一下使用说明吗?非常感谢。

求教两个训练细节问题

1、next sentence prediction 是以什么标点符号进行分割作为上下句的(逗号、句号、分号?)。
2、next sentence prediction 单句大概长度都是多少?
非常感谢。

Warning when saving vocabulary

The warning shows when saving the models:

>>> tokenizer.save_vocabulary(model_dir)
Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!

为什么不先做中文分词再预训练?

想请问一下,你们预训练的bert模型是基于中文字的对吗?为什么不先做刑事/民事文书的中文分词,再预训练?这样效果不会更好吗?想请问一下不这样做背后的原因是什么?

缺少文件config.json

我把解压后的文件命名成民事文书BERT,并按照以下代码引入,发生了以下报错。

from transformers import BertTokenizer, BertModel
import torch

model = BertModel.from_pretrained('./预训练模型/民事文书BERT')
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
~/miniconda3/lib/python3.8/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    496             # Load from URL or cache if already cached
--> 497             resolved_config_file = cached_path(
    498                 config_file,

~/miniconda3/lib/python3.8/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
   1343         # File, but it doesn't exist.
-> 1344         raise EnvironmentError(f"file {url_or_filename} not found")
   1345     else:

OSError: file ./预训练模型/民事文书BERT/config.json not found

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-11-693f89123985> in <module>
      2 import torch
      3 
----> 4 model = BertModel.from_pretrained('./预训练模型/民事文书BERT')

~/miniconda3/lib/python3.8/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   1071         if not isinstance(config, PretrainedConfig):
   1072             config_path = config if config is not None else pretrained_model_name_or_path
-> 1073             config, model_kwargs = cls.config_class.from_pretrained(
   1074                 config_path,
   1075                 *model_args,

~/miniconda3/lib/python3.8/site-packages/transformers/configuration_utils.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    438 
    439         """
--> 440         config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
    441         if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
    442             logger.warn(

~/miniconda3/lib/python3.8/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    515                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"
    516             )
--> 517             raise EnvironmentError(msg)
    518 
    519         except json.JSONDecodeError:

OSError: Can't load config for './预训练模型/民事文书BERT'. Make sure that:

- './预训练模型/民事文书BERT' is a correct model identifier listed on 'https://huggingface.co/models'

- or './预训练模型/民事文书BERT' is the correct path to a directory containing a config.json file

请问是我引入模型的代码有误,还是我应该将模型中的bert_config.json改名成config.json
截屏2021-06-30 上午11 57 41

如何训练其他案件的预训练模型?

您好,已经提供的民事和刑事案件的预训练模型,请问如何再训练得到其他案件类型的案件,比如:行政案件或者是赔偿案件的预训练模型?谢谢

使用模型finetuing遇到的问题

thu的模型参数名和pytorch_pretrained_bert参数名不一致
加载进行微调的时候报错
['bert.embeddings.LayerNorm.gamma', 'bert.embeddings.LayerNorm.beta', 'bert.encoder.layer.0.attention.output.LayerNorm.gamma', 'bert.encoder.layer.0.attention.output.LayerNorm.beta', 'bert.encoder.layer.0.output.LayerNorm.gamma', 'bert.encoder.layer.0.output.LayerNorm.beta', 'bert.encoder.layer.1.attention.output.LayerNorm.gamma', 'bert.encoder.layer.1.attention.output.LayerNorm.beta', 'bert.encoder.layer.1.output.LayerNorm.gamma', 'bert.encoder.layer.1.output.LayerNorm.beta', 'bert.encoder.layer.2.attention.output.LayerNorm.gamma', 'bert.encoder.layer.2.attention.output.LayerNorm.beta', 'bert.encoder.layer.2.output.LayerNorm.gamma', 'bert.encoder.layer.2.output.LayerNorm.beta', 'bert.encoder.layer.3.attention.output.LayerNorm.gamma', 'bert.encoder.layer.3.attention.output.LayerNorm.beta', 'bert.encoder.layer.3.output.LayerNorm.gamma', 'bert.encoder.layer.3.output.LayerNorm.beta', 'bert.encoder.layer.4.attention.output.LayerNorm.gamma', 'bert.encoder.layer.4.attention.output.LayerNorm.beta', 'bert.encoder.layer.4.output.LayerNorm.gamma', 'bert.encoder.layer.4.output.LayerNorm.beta', 'bert.encoder.layer.5.attention.output.LayerNorm.gamma', 'bert.encoder.layer.5.attention.output.LayerNorm.beta', 'bert.encoder.layer.5.output.LayerNorm.gamma', 'bert.encoder.layer.5.output.LayerNorm.beta', 'bert.encoder.layer.6.attention.output.LayerNorm.gamma', 'bert.encoder.layer.6.attention.output.LayerNorm.beta', 'bert.encoder.layer.6.output.LayerNorm.gamma', 'bert.encoder.layer.6.output.LayerNorm.beta', 'bert.encoder.layer.7.attention.output.LayerNorm.gamma', 'bert.encoder.layer.7.attention.output.LayerNorm.beta', 'bert.encoder.layer.7.output.LayerNorm.gamma', 'bert.encoder.layer.7.output.LayerNorm.beta', 'bert.encoder.layer.8.attention.output.LayerNorm.gamma', 'bert.encoder.layer.8.attention.output.LayerNorm.beta', 'bert.encoder.layer.8.output.LayerNorm.gamma', 'bert.encoder.layer.8.output.LayerNorm.beta', 'bert.encoder.layer.9.attention.output.LayerNorm.gamma', 'bert.encoder.layer.9.attention.output.LayerNorm.beta', 'bert.encoder.layer.9.output.LayerNorm.gamma', 'bert.encoder.layer.9.output.LayerNorm.beta', 'bert.encoder.layer.10.attention.output.LayerNorm.gamma', 'bert.encoder.layer.10.attention.output.LayerNorm.beta', 'bert.encoder.layer.10.output.LayerNorm.gamma', 'bert.encoder.layer.10.output.LayerNorm.beta', 'bert.encoder.layer.11.attention.output.LayerNorm.gamma', 'bert.encoder.layer.11.attention.output.LayerNorm.beta', 'bert.encoder.layer.11.output.LayerNorm.gamma', 'bert.encoder.layer.11.output.LayerNorm.beta']
11/19/2020 14:22:02 - INFO - pytorch_pretrained_bert.modeling - Weights from pretrained model not used in BertModel: ['bert.embeddings.LayerNorm.weight', 'bert.embeddings.LayerNorm.bias', 'bert.encoder.layer.0.attention.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.1.attention.output.LayerNorm.weight', 'bert.encoder.layer.1.attention.output.LayerNorm.bias', 'bert.encoder.layer.1.output.LayerNorm.weight', 'bert.encoder.layer.1.output.LayerNorm.bias', 'bert.encoder.layer.2.attention.output.LayerNorm.weight', 'bert.encoder.layer.2.attention.output.LayerNorm.bias', 'bert.encoder.layer.2.output.LayerNorm.weight', 'bert.encoder.layer.2.output.LayerNorm.bias', 'bert.encoder.layer.3.attention.output.LayerNorm.weight', 'bert.encoder.layer.3.attention.output.LayerNorm.bias', 'bert.encoder.layer.3.output.LayerNorm.weight', 'bert.encoder.layer.3.output.LayerNorm.bias', 'bert.encoder.layer.4.attention.output.LayerNorm.weight', 'bert.encoder.layer.4.attention.output.LayerNorm.bias', 'bert.encoder.layer.4.output.LayerNorm.weight', 'bert.encoder.layer.4.output.LayerNorm.bias', 'bert.encoder.layer.5.attention.output.LayerNorm.weight', 'bert.encoder.layer.5.attention.output.LayerNorm.bias', 'bert.encoder.layer.5.output.LayerNorm.weight', 'bert.encoder.layer.5.output.LayerNorm.bias', 'bert.encoder.layer.6.attention.output.LayerNorm.weight', 'bert.encoder.layer.6.attention.output.LayerNorm.bias', 'bert.encoder.layer.6.output.LayerNorm.weight', 'bert.encoder.layer.6.output.LayerNorm.bias', 'bert.encoder.layer.7.attention.output.LayerNorm.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.7.output.LayerNorm.weight', 'bert.encoder.layer.7.output.LayerNorm.bias', 'bert.encoder.layer.8.attention.output.LayerNorm.weight', 'bert.encoder.layer.8.attention.output.LayerNorm.bias', 'bert.encoder.layer.8.output.LayerNorm.weight', 'bert.encoder.layer.8.output.LayerNorm.bias', 'bert.encoder.layer.9.attention.output.LayerNorm.weight', 'bert.encoder.layer.9.attention.output.LayerNorm.bias', 'bert.encoder.layer.9.output.LayerNorm.weight', 'bert.encoder.layer.9.output.LayerNorm.bias', 'bert.encoder.layer.10.attention.output.LayerNorm.weight', 'bert.encoder.layer.10.attention.output.LayerNorm.bias', 'bert.encoder.layer.10.output.LayerNorm.weight', 'bert.encoder.layer.10.output.LayerNorm.bias', 'bert.encoder.layer.11.attention.output.LayerNorm.weight', 'bert.encoder.layer.11.attention.output.LayerNorm.bias', 'bert.encoder.layer.11.output.LayerNorm.weight', 'bert.encoder.layer.11.output.LayerNorm.bias']

训练细节问题

请问一下可以介绍一下训练的硬件、时间、参数设置等细节吗?

如何输出词向量?

加载这个模型之后,如何输出词向量。Google的模型可以输出768维的向量,这个如何弄呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.