twairball / fairseq-zh-en Goto Github PK

View Code? Open in Web Editor NEW

211.0 211.0 49.0 12.37 MB

NMT for chinese-english using fairseq

Python 0.05% Jupyter Notebook 99.94% Shell 0.01%

fairseq-zh-en's People

Contributors

Stargazers

Watchers

fairseq-zh-en's Issues

fairseq command not found

I have faced numerous directory search issues besides "fairseq" command not found. I have already installed https://github.com/pytorch/fairseq

Could anyone advise ?

[phung@archlinux fairseq-zh-en]$ ls
challenger.md data-bin 'merge blanks.ipynb' nltk_data README.md tmp wmt17_generate.sh wmt17_train.sh
data 'Dataset misaligned.ipynb' mosesdecoder preprocess subword-nmt trainings wmt17_prepare.sh
[phung@archlinux fairseq-zh-en]$ sh ./wmt17_prepare.sh
Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.834 seconds.
DEBUG:jieba:Loading model cost 0.834 seconds.
Prefix dict has been built succesfully.
DEBUG:jieba:Prefix dict has been built succesfully.
INFO:prepare:tokenizing: tmp/wmt17_en_zh/training/news-commentary-v12.zh-en.en
INFO:tokenizer: [0] nltk.word_tokenize: 1929 or 1989?

Traceback (most recent call last):
File "./preprocess/wmt.py", line 58, in
prepare.prepare_dataset(DATA_DIR, TMP_DIR, ds)
File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/fairseq/fairseq-zh-en/preprocess/prepare.py", line 79, in prepare_dataset
tokenized = tokenizer.tokenize_file(tmp_filepath)
File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/fairseq/fairseq-zh-en/preprocess/tokenizer.py", line 60, in tokenize_file
_tokenized = tokenize(line, is_sgm, is_zh, lower_case, delim)
File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/fairseq/fairseq-zh-en/preprocess/tokenizer.py", line 40, in tokenize
_tok = jieba.cut(_line.rstrip('\r\n')) if is_zh else nltk.word_tokenize(_line)
File "/usr/lib/python3.6/site-packages/nltk/tokenize/init.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/usr/lib/python3.6/site-packages/nltk/tokenize/init.py", line 94, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/usr/lib/python3.6/site-packages/nltk/data.py", line 836, in load
opened_resource = _open(resource_url)
File "/usr/lib/python3.6/site-packages/nltk/data.py", line 954, in open
return find(path, path + ['']).open()
File "/usr/lib/python3.6/site-packages/nltk/data.py", line 675, in find
raise LookupError(resource_not_found)
LookupError:

Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt')

Searched in:
- '/home/phung/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/nltk_data'
- '/usr/share/nltk_data'
- '/usr/lib/nltk_data'
- ''

./wmt17_prepare.sh: line 12: ../mosesdecoder/scripts/training/clean-corpus-n.perl: No such file or directory
./wmt17_prepare.sh: line 13: ../mosesdecoder/scripts/training/clean-corpus-n.perl: No such file or directory
./wmt17_prepare.sh: line 14: ../mosesdecoder/scripts/training/clean-corpus-n.perl: No such file or directory
Encoding subword with BPE using ops=32000
./wmt17_prepare.sh: line 23: data/wmt17_en_zh/train.clean.en: No such file or directory
./wmt17_prepare.sh: line 24: data/wmt17_en_zh/train.clean.zh: No such file or directory
Applying vocab to training
./wmt17_prepare.sh: line 27: data/wmt17_en_zh/train.clean.en: No such file or directory
./wmt17_prepare.sh: line 28: data/wmt17_en_zh/train.clean.zh: No such file or directory
Generating vocab: vocab.32000.bpe.en
./wmt17_prepare.sh: line 32: ../subword-nmt/get_vocab.py: No such file or directory
cat: data/wmt17_en_zh/train.32000.bpe.en: No such file or directory
Generating vocab: vocab.32000.bpe.zh
./wmt17_prepare.sh: line 35: ../subword-nmt/get_vocab.py: No such file or directory
cat: data/wmt17_en_zh/train.32000.bpe.zh: No such file or directory
Applying vocab to valid
./wmt17_prepare.sh: line 39: data/wmt17_en_zh/valid.clean.en: No such file or directory
./wmt17_prepare.sh: line 40: data/wmt17_en_zh/valid.clean.zh: No such file or directory
Applying vocab to test
./wmt17_prepare.sh: line 44: data/wmt17_en_zh/test.clean.en: No such file or directory
./wmt17_prepare.sh: line 45: data/wmt17_en_zh/test.clean.zh: No such file or directory
Preprocessing datasets...
./wmt17_prepare.sh: line 52: fairseq: command not found
[phung@archlinux fairseq-zh-en]$

jieba tokenizer

Hello! Appreciate your work on this.

In the preprocess/process.py, you mentioned using Jieba for tokenizing -zh words but I don't see it implemented there. Could you help clarify?

Pretrained model can not be loaded?

[root@localhost fairseq-zh-en]# ./wmt17_generate.sh
optimizing fconv for decoding
decoding to tmp/wmt17_en_zh/fconv_test
/root/torch/install/bin/luajit: .../install/share/lua/5.1/fairseq/models/ensemble_model.lua:134: inconsistent tensor size, expected r_ [10 x 33859], t [10 x 33859] and src [10 x 20490] to have the same number of elements, but got 338590, 338590 and 204900 elements respectively at /root/torch/pkg/torch/lib/TH/generic/THTensorMath.c:887
stack traceback:
[C]: in function 'add'
.../install/share/lua/5.1/fairseq/models/ensemble_model.lua:134: in function 'generate'
...torch/install/share/lua/5.1/fairseq/scripts/generate.lua:213: in main chunk
[C]: in function 'require'
...install/lib/luarocks/rocks/fairseq-cpu/scm-1/bin/fairseq:17: in main chunk
[C]: at 0x004064f0
| [zh] Dictionary: 33859 types
| [en] Dictionary: 29243 types
| IndexedDataset: loaded data-bin/wmt17_en_zh with 2000 examples

wmt17_prepare.sh cost a lot of time but cannot give the precessed dataset

thanks for your sharing, but I have a problem when I process the data to get bpe data. I run wmt17_prepare.sh but I get something wrong in the test dataset, could you give me some advice.

Hi, did you get the reported result by only training on the News Commentary v12 dataset (0.2 million pairs)? Because I saw your preprocess script only download the news dataset. However, I cannot reproduce your result, not even close.

Could you please provide more description of the dataset you used for training?

可以问下, README中开头表格中的训练时间, 对应的硬件条件吗

刚刚接触这块, 对需要的机器资源没有感觉, 请教下

分词时报错

在news-commentary-v12.zh-en.en中，98000行左右有一段其他文字，编码方式不同，报错：UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 21: ordinal not in range(128)
请问这个怎么解决？

could you share the pretrained model?

could you share the pretrained model? thanks

fairseq_py

Does this support fairseq_py?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble

twairball / fairseq-zh-en Goto Github PK

fairseq-zh-en's People

Contributors

Stargazers

Watchers

Forkers

fairseq-zh-en's Issues

fairseq command not found

jieba tokenizer

Pretrained model can not be loaded?

wmt17_prepare.sh cost a lot of time but cannot give the precessed dataset

Training Dataset

可以问下, README中开头表格中的训练时间, 对应的硬件条件吗

分词时报错

could you share the pretrained model?

fairseq_py

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs