GithubHelp home page GithubHelp logo

Comments (3)

waywaywayw avatar waywaywayw commented on May 14, 2024

有人来回答一下吗?确定不了字典,预训练也就无从谈起了

from paddlenlp.

songzy12 avatar songzy12 commented on May 14, 2024

我刚才研究了一下,这里是一些结果:

  1. demo 使用的字典是 data/demo_config/vocab.txt [1].
  2. 具体到截图中的示例,其 id 化前的明文可以在 data/demo_wiki_tokens.txt [2] 中找到:
    龙 江 ic ( 平 假 名 : ) 是 位 于 长 野 县 饭 田 市 的 三 远 南 信 自 动 车 道 之 交 流 道 。 现 时 还 未 启 用 。
  3. 通过解压缩 id 化后的文件 data/train/demo_wiki_train.gz [3] 并查看其内容我们可以发现:
    id 化后每一句的都是以1开头,以2结尾。原因应该和 tokenization 的具体实现有关。

至于所要求的 create_train_data.py, 应该就是 train.py [4] 和 tokenization.py [5].

[1]
https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleNLP/pretrain_language_models/BERT/data/demo_config/vocab.txt
[2] https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleNLP/pretrain_language_models/BERT/data/demo_wiki_tokens.txt
[3] https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleNLP/pretrain_language_models/BERT/data/train/demo_wiki_train.gz
[4] https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleNLP/pretrain_language_models/BERT/train.py
[5] https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleNLP/pretrain_language_models/BERT/tokenization.py

from paddlenlp.

ZeyuChen avatar ZeyuChen commented on May 14, 2024

https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/bert

from paddlenlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.