Comments (3)
有人来回答一下吗?确定不了字典,预训练也就无从谈起了
from paddlenlp.
我刚才研究了一下,这里是一些结果:
- demo 使用的字典是
data/demo_config/vocab.txt
[1]. - 具体到截图中的示例,其 id 化前的明文可以在
data/demo_wiki_tokens.txt
[2] 中找到:
龙 江 ic ( 平 假 名 : ) 是 位 于 长 野 县 饭 田 市 的 三 远 南 信 自 动 车 道 之 交 流 道 。 现 时 还 未 启 用 。
- 通过解压缩 id 化后的文件
data/train/demo_wiki_train.gz
[3] 并查看其内容我们可以发现:
id 化后每一句的都是以1
开头,以2
结尾。原因应该和 tokenization 的具体实现有关。
至于所要求的 create_train_data.py, 应该就是 train.py [4] 和 tokenization.py [5].
[1]
https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleNLP/pretrain_language_models/BERT/data/demo_config/vocab.txt
[2] https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleNLP/pretrain_language_models/BERT/data/demo_wiki_tokens.txt
[3] https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleNLP/pretrain_language_models/BERT/data/train/demo_wiki_train.gz
[4] https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleNLP/pretrain_language_models/BERT/train.py
[5] https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleNLP/pretrain_language_models/BERT/tokenization.py
from paddlenlp.
https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/bert
from paddlenlp.
Related Issues (20)
- [Question]: 我想使用UIE模型去进行关系或实体抽取的zero-shot,我想使用我的数据然后使用我定义的标签去完成,这些都可以实现,但是结果给到的是具体的实体和关系,我想问下UIE模型中有对zero-shot的f1、precision、recall值得计算程序吗,在哪里呢?的计算程序嘛 HOT 3
- [Question]: 在llama预训练中,paddlenlp是否支持在customdevice(比如mlu、npu)使用flashattention HOT 1
- 层次分类模型,预训练后,checkpoint文件夹是空的 HOT 2
- [Question]: the device must be a string which is like 'cpu' HOT 1
- [Question]: FAQ pipeline能否给个能运行的说明? HOT 4
- [Question]: 再使用UIE-X封闭域信息抽取时遇到的问题 HOT 12
- [Question]: 牛爷爷们,救救孩子... uie-x-base 封闭域信息抽取问题 HOT 9
- [Question]: llama3 支持计划 HOT 2
- Paddle在单卡上是否支持并行推理? HOT 3
- [Question]: qwen推理显存不足,如何设置多卡推理 HOT 4
- [Bug]: llama模型loss=0时出现"Tensor need be reduced must not empty [Hint: Expected x.numel() > 0, but received x.numel():0 <= 0:0.]"错误 HOT 2
- [Question]: Taskflow文本抽取结束后显存越积越多,应该如何释放 HOT 3
- [Question]: 为什么判断attn_mask是否为causal是通过比较上三角矩阵的方式? HOT 1
- [Bug]: 权重加载导致的主机内存不足 HOT 3
- [Bug]: NER分析GPU环境使用CPU报错,提示(InvalidArgument) Variable value (input) of OP(fluid.layers.embedding) HOT 1
- [Bug]: 关于训练PaddleNLP/model_zoo /ernie-layout/下面 XFUND-ZH Train出现的问题。 HOT 2
- from PaddleNLP.llm.chatglm.predict_generation import Predictor导入失败 HOT 3
- [Question]: 文本分类时,taskflow推理gpu和cpu推理结果非常不一致,cpu推理结果还不稳定 HOT 2
- Mac M芯片的电脑怎么使用 HOT 7
- [Question]: 无法安装paddlenlp2.7及2.8版 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paddlenlp.