❓ Questions and Help Hello, I'm a bit confused a

token_text as outputs about espresso HOT 2 CLOSED

freewym commented on July 16, 2024

token_text as outputs

from espresso.

Comments (2)

freewym commented on July 16, 2024

Hi,

Maybe you can update the field name token_text to text in the json files, and then remove the arg --bpe sentencepiece --sentencepiece-model ${sentencepiece_model}.model passed to train.py. This will take the text as it-is without any v sentencepiece encoding process.

Alternatively I think you can define your own bpe class similar to https://github.com/freewym/espresso/blob/master/fairseq/data/encoders/sentencepiece_bpe.py, to take your additional tags into consideration. and pass the arg --bpe <your-bpe-name> to train.py

edit: there is a later commit 4c86e23 doing on-the-fly tokenization, where token_text is totally removed from the code. If you are using the version after that, I see where your confusion is from

from espresso.

valentinp72 commented on July 16, 2024

Thank you,

By renaming token_text to text, and removing the --bpe argument, fairseq dit not complain, and I achieved to train my model.

Now, I'm just having trouble when decoding, as there is no difference between the WER and the CER metric (decoded_char_results and decoded_results are the same). I think creating my own tokeniser/bpe class should work, I might try that later. Otherwise, I'll look the decoding functions and add some custom code.

from espresso.

token_text as outputs about espresso HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs