facebookresearch / access Goto Github PK

View Code? Open in Web Editor NEW

102.0 102.0 37.0 722 KB

Code to reproduce the experiments from the paper.

License: Other

Python 22.40% HTML 77.60%

access's People

Contributors

Stargazers

Watchers

access's Issues

python scripts/train.py

Sorry, ask your help again.

python scripts/evaluate.py -------------ok
python scripts/generate.py < my_file.complex--------------ok
python scripts/train.py-------------error
Training a model from scratch
method_name='fairseq_train_and_evaluate'
args=()
kwargs={'arch': 'transformer', 'warmup_updates': 4000, 'parametrization_budget': 256, 'beam': 8, 'dataset': 'wikilarge', 'dropout': 0.2, 'fp16': False, 'label_smoothing': 0.54, 'lr': 0.00011, 'lr_scheduler': 'fixed', 'max_epoch': 100, 'max_tokens': 5000, 'metrics_coefs': [0, 1, 0], 'optimizer': 'adam', 'preprocessors_kwargs': {'LengthRatioPreprocessor': {'target_ratio': 0.8}, 'LevenshteinPreprocessor': {'target_ratio': 0.8}, 'WordRankRatioPreprocessor': {'target_ratio': 0.8}, 'DependencyTreeDepthRatioPreprocessor': {'target_ratio': 0.8}, 'SentencePiecePreprocessor': {'vocab_size': 10000}}}
Creating /home/qwh/桌面/access/resources/datasets/wikilarge/fairseq_preprocessed...
usage: train.py [-h] [--no-progress-bar] [--log-interval N]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir DIR] [--seed N] [--cpu] [--fp16]
[--memory-efficient-fp16] [--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale D]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ]
[--criterion {sentence_prediction,binary_cross_entropy,cross_entropy,sentence_ranking,legacy_masked_lm_loss,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,composite_loss,adaptive_loss,masked_lm,nat_loss}]
[--tokenizer {moses,nltk,space}]
[--bpe {gpt2,sentencepiece,bert,subword_nmt,fastbpe}]
[--optimizer {nag,adam,adafactor,adamax,sgd,adadelta,adagrad}]
[--lr-scheduler {cosine,polynomial_decay,triangular,inverse_sqrt,tri_stage,reduce_lr_on_plateau,fixed}]
[--task TASK] [-s SRC] [-t TARGET] [--trainpref FP]
[--validpref FP] [--testpref FP] [--align-suffix FP]
[--destdir DIR] [--thresholdtgt N] [--thresholdsrc N]
[--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N]
[--alignfile ALIGN] [--dataset-impl FORMAT]
[--joined-dictionary] [--only-source] [--padding-factor N]
[--workers N]
train.py: error: unrecognized arguments: --output-format raw
Error: Rolling back creation of directory /home/qwh/桌面/access/resources/datasets/wikilarge/fairseq_preprocessed
I feel that problem is here: access/fairseq/base.py, and I del '--output-format', 'raw',
def fairseq_preprocess(dataset):
dataset_dir = get_dataset_dir(dataset)
with lock_directory(dataset_dir):
preprocessed_dir = dataset_dir / 'fairseq_preprocessed'
with create_directory_or_skip(preprocessed_dir):
preprocessing_parser = options.get_preprocessing_parser()
preprocess_args = preprocessing_parser.parse_args([
'--source-lang',
'complex',
'--target-lang',
'simple',
'--trainpref',
os.path.join(dataset_dir, f'{dataset}.train'),
'--validpref',
os.path.join(dataset_dir, f'{dataset}.valid'),
'--testpref',
os.path.join(dataset_dir, f'{dataset}.test'),
'--destdir',
str(preprocessed_dir),
'--output-format',
'raw',
])
preprocess.main(preprocess_args)
return preprocessed_dir
then I python scripts/train.py
Training a model from scratch
method_name='fairseq_train_and_evaluate'
args=()
kwargs={'arch': 'transformer', 'warmup_updates': 4000, 'parametrization_budget': 256, 'beam': 8, 'dataset': 'wikilarge', 'dropout': 0.2, 'fp16': False, 'label_smoothing': 0.54, 'lr': 0.00011, 'lr_scheduler': 'fixed', 'max_epoch': 100, 'max_tokens': 5000, 'metrics_coefs': [0, 1, 0], 'optimizer': 'adam', 'preprocessors_kwargs': {'LengthRatioPreprocessor': {'target_ratio': 0.8}, 'LevenshteinPreprocessor': {'target_ratio': 0.8}, 'WordRankRatioPreprocessor': {'target_ratio': 0.8}, 'DependencyTreeDepthRatioPreprocessor': {'target_ratio': 0.8}, 'SentencePiecePreprocessor': {'vocab_size': 10000}}}
usage: train.py [-h] [--no-progress-bar] [--log-interval N]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir DIR] [--seed N] [--cpu] [--fp16]
[--memory-efficient-fp16] [--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale D]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ]
[--criterion {sentence_prediction,binary_cross_entropy,cross_entropy,sentence_ranking,legacy_masked_lm_loss,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,composite_loss,adaptive_loss,masked_lm,nat_loss}]
[--tokenizer {moses,nltk,space}]
[--bpe {gpt2,sentencepiece,bert,subword_nmt,fastbpe}]
[--optimizer {nag,adam,adafactor,adamax,sgd,adadelta,adagrad}]
[--lr-scheduler {cosine,polynomial_decay,triangular,inverse_sqrt,tri_stage,reduce_lr_on_plateau,fixed}]
[--task TASK] [--num-workers N]
[--skip-invalid-size-inputs-valid-test] [--max-tokens N]
[--max-sentences N] [--required-batch-size-multiple N]
[--dataset-impl FORMAT] [--train-subset SPLIT]
[--valid-subset SPLIT] [--validate-interval N]
[--fixed-validation-seed N] [--disable-validation]
[--max-tokens-valid N] [--max-sentences-valid N]
[--curriculum N] [--distributed-world-size N]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,no_c10d}]
[--bucket-cap-mb MB] [--fix-batches-to-gpus]
[--find-unused-parameters] [--fast-stat-sync] --arch ARCH
[--max-epoch N] [--max-update N] [--clip-norm NORM]
[--sentence-avg] [--update-freq N1,N2,...,N_K]
[--lr LR_1,LR_2,...,LR_N] [--min-lr LR] [--use-bmuf]
[--save-dir DIR] [--restore-file RESTORE_FILE]
[--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
[--reset-optimizer] [--optimizer-overrides DICT]
[--save-interval N] [--save-interval-updates N]
[--keep-interval-updates N] [--keep-last-epochs N] [--no-save]
[--no-epoch-checkpoints] [--no-last-checkpoints]
[--no-save-optimizer-state]
[--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
[--maximize-best-checkpoint-metric]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
[--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N]
[--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before]
[--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
[--decoder-layers N] [--decoder-attention-heads N]
[--decoder-learned-pos] [--decoder-normalize-before]
[--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings]
[--adaptive-softmax-cutoff EXPR]
[--adaptive-softmax-dropout D] [--no-cross-attention]
[--cross-self-attention] [--layer-wise-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D]
[--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
[--layernorm-embedding] [--no-scale-embedding]
[--label-smoothing D] [--adam-betas B] [--adam-eps D]
[--weight-decay WD] [--force-anneal N] [--lr-shrink LS]
[--warmup-updates N] [-s SRC] [-t TARGET] [--lazy-load]
[--raw-text] [--load-alignments] [--left-pad-source BOOL]
[--left-pad-target BOOL] [--max-source-positions N]
[--max-target-positions N]
[--upsample-primary UPSAMPLE_PRIMARY] [--truncate-source]
data
train.py: error: unrecognized arguments: --validations-before-sari-early-stopping 10
I debug it, and find it stopped at "access/fairseq/base.py line172"

7)pip install fairseq@git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification------Problem still is here

Generate.py is consuming too large memory on GPU when handling large file

I was wondering if we could break down the large file into smaller batches so that generate.py won't break down when dealing with large file contains 50-60k sentences to generate?

The evaluate.py script fails; pytorch version issue

See the following stack trace upon running python scripts/evaluate.py:

Traceback (most recent call last):
  File "scripts/evaluate.py", line 27, in <module>
    print(evaluate_simplifier_on_turkcorpus(simplifier, phase='test'))
  File "/content/access/access/evaluation/general.py", line 25, in evaluate_simplifier_on_turkcorpus
    pred_filepath = get_prediction_on_turkcorpus(simplifier, phase)
  File "/content/access/access/evaluation/general.py", line 20, in get_prediction_on_turkcorpus
    simplifier(source_filepath, pred_filepath)
  File "/content/access/access/simplifiers.py", line 32, in wrapped
    simplifier(complex_filepath, pred_filepath)
  File "/content/access/access/simplifiers.py", line 66, in preprocessed_simplifier
    simplifier(preprocessed_complex_filepath, preprocessed_output_pred_filepath)
  File "/content/access/access/simplifiers.py", line 32, in wrapped
    simplifier(complex_filepath, pred_filepath)
  File "/content/access/access/simplifiers.py", line 46, in fairseq_simplifier
    fairseq_generate(complex_filepath, output_pred_filepath, exp_dir, **kwargs)
  File "/content/access/access/fairseq/base.py", line 283, in fairseq_generate
    batch_size=batch_size)
  File "/content/access/access/fairseq/base.py", line 238, in _fairseq_generate
    generate.main(generate_args)
  File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/generate.py", line 106, in main
    hypos = task.inference_step(generator, models, sample, prefix_tokens)
  File "/usr/local/lib/python3.6/dist-packages/fairseq/tasks/fairseq_task.py", line 242, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/fairseq/sequence_generator.py", line 378, in generate
    scores.view(bsz, beam_size, -1)[:, :, :step],
  File "/usr/local/lib/python3.6/dist-packages/fairseq/search.py", line 83, in step
    torch.div(self.indices_buf, vocab_size, out=self.beams_buf)
RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.

To reproduce, run python scripts/evaluate.py

Speed of the simplification

I am using this code to simplify sentence by sentence:

def paraphrase():
    source_filepath = get_temp_filepath()
    pred_filepath = get_temp_filepath()
    data = request.get_data().decode()

    write_lines([word_tokenize(data)], source_filepath)
    start=time.time()
    simplifier(source_filepath, pred_filepath)
    print(time.time()-start)
    for line in yield_lines(pred_filepath):
      output = line
    os.remove(source_filepath)
    os.remove(pred_filepath)
    return output

And the transformation time is from 0.8 to 2.0 seconds is it possible to speed this? I need the response time to be bellow 0.3 seconds?

generate.py open too many temporary files

Is there a way to stop the generate.py to generate too many temporary files? This can be an issue because it is quickly eating up memory in the /tmp directory. Thanks!

I git clone the submission branch.
Installed all the libraries in requirement.txt
Then I ran scripts/train.py without changing anything.
It seems like the validation SARI keeps around 39+, cannot reach 41+

Could I know what I am missing?

Using ACCESS as a streamline service?!

I implemented your model for sentence simplification. However I would like to test it with streams of news titles and see the output. However, I have problems when using it as API, as for each title it generates input, output and some temp files. Thus, I was wondering is it possible to use it as a service, i.e. the input to be the string (utf-8 encoded) and the model should generate an output which is a simplified string.
Any help on this?

Reversing the results

Hi ,

Thanks for releasing access ,its quite useful . I was wondering if you could tell me how to reverse the outputs ,meaning that if i give a simple sentence as an input , it should paraphrase it as a complex sentence . Looking forward for your reply

Regards

Ajay Sahu

Met with problem when training (python ./scripts/train.py)

Hi:

I want to train a model myself and I follow your default settings in train.py. But I met with a problem and I couldn't solve it.

I run the train.py and the exception is as follows:

Training a model from scratch
Extracting...
method_name='fairseq_train_and_evaluate'
args=()
kwargs={'arch': 'transformer', 'warmup_updates': 4000, 'parametrization_budget': 256, 'beam': 8, 'dataset': 'wikilarge', 'dropout': 0.2, 'fp16': False, 'label_smoothing': 0.54, 'lr': 0.00011, 'lr_scheduler': 'fixed', 'max_epoch': 100, 'max_tokens': 5000, 'metrics_coefs': [0, 1, 0], 'optimizer': 'adam', 'preprocessors_kwargs': {'LengthRatioPreprocessor': {'target_ratio': 0.8}, 'LevenshteinPreprocessor': {'target_ratio': 0.8}, 'WordRankRatioPreprocessor': {'target_ratio': 0.8}, 'DependencyTreeDepthRatioPreprocessor': {'target_ratio': 0.8}, 'SentencePiecePreprocessor': {'vocab_size': 10000}}}
Creating /home/access-main/resources/datasets/_f56b9888a6a6550a1d060813416e5298...
Creating preprocessed dataset with LengthRatioPreprocessor(target_ratio=0.8): wikilarge -> _f56b9888a6a6550a1d060813416e5298
Creating /home/access-main/resources/datasets/_d6002a05838f1e5b3a3fc0d98c9fa7bd...
Creating preprocessed dataset with LevenshteinPreprocessor(bucket_size=0.05, noise_std=0, target_ratio=0.8): _f56b9888a6a6550a1d060813416e5298 -> _d6002a05838f1e5b3a3fc0d98c9fa7bd
Creating /home/access-main/resources/datasets/_e382828c4d4db04ef23094dbd9e38f9c...
Creating preprocessed dataset with WordRankRatioPreprocessor(target_ratio=0.8): _d6002a05838f1e5b3a3fc0d98c9fa7bd -> _e382828c4d4db04ef23094dbd9e38f9c
Error: Rolling back creation of directory /home/access-main/resources/datasets/_e382828c4d4db04ef23094dbd9e38f9c
Traceback (most recent call last):
File "./scripts/train.py", line 49, in
fairseq_train_and_evaluate(**kwargs)
File "/home/access-main/access/utils/training.py", line 18, in wrapped_func
return func(*args, **kwargs)
File "/home/access-main/access/utils/training.py", line 29, in wrapped_func
return func(*args, **kwargs)
File "/home/access-main/access/utils/training.py", line 38, in wrapped_func
result = func(*args, **kwargs)
File "/home/access-main/access/utils/training.py", line 50, in wrapped_func
result = func(*args, **kwargs)
File "/home/access-main/access/fairseq/main.py", line 117, in fairseq_train_and_evaluate
dataset = create_preprocessed_dataset(dataset, preprocessors, n_jobs=1)
File "/home/access-main/access/resources/datasets.py", line 72, in create_preprocessed_dataset
dataset = create_preprocessed_dataset_one_preprocessor(dataset, preprocessor, n_jobs)
File "/home/access-main/access/resources/datasets.py", line 55, in create_preprocessed_dataset_one_preprocessor
new_filepaths_dict[phase, 'complex'], new_filepaths_dict[phase, 'simple'])
File "/home/access-main/access/preprocessors.py", line 144, in encode_file_pair
output_files.write(self.encode_sentence_pair(complex_line, simple_line))
File "/home/access-main/access/preprocessors.py", line 244, in encode_sentence_pair
remove_special_tokens(simple_sentence))))
File "/home/access-main/access/preprocessors.py", line 277, in get_feature_value
return min(safe_division(self.feature_extractor(simple_sentence), self.feature_extractor(complex_sentence)), 2)
File "/home/access-main/access/feature_extraction.py", line 44, in get_lexical_complexity_score
words = [word for word in words if word in get_word2rank()]
File "/home/access-main/access/feature_extraction.py", line 44, in
words = [word for word in words if word in get_word2rank()]
File "/home/access-main/access/feature_extraction.py", line 25, in get_word2rank
next(line_generator) # Skip the first line (header)
File "/home/access-main/access/utils/helpers.py", line 77, in yield_lines
with open(filepath, 'r') as f:
IsADirectoryError: [Errno 21] Is a directory: '/home/access-main/resources/various/fasttext-vectors/wiki.en.vec'

Looking forward to your reply

Best Regards

Dependency Version Conflict

Hi,

An issue occurred during installation of the requirements. This install command fails with the following error:

pip3 install -e .

ERROR: Cannot install access and access==0.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    access 0.1 depends on nltk==3.4.5
    easse 0.2.1 depends on nltk==3.4.3

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

Removing the version from ntlk seems to resolve this issue. The new requirements.txt file looks like this:

  2 dill==0.3.0
  3 GitPython==3.1.0
  4 imohash==1.0.4
  5 joblib==0.13.2
  6 nevergrad==0.2.3
  7 nltk
  8 numpy==1.17.2
  9 pandas==0.25.1
 10 sentencepiece==0.1.83
 11 spacy==2.1.3
 12 tabulate==0.8.4
 13 torch==1.2.0
 14 tqdm==4.36.1
 15 easse@git+git://github.com/feralvam/easse.git@090855e73dee5e26ea0cda01d4aa4f51044d9af9
 16 fairseq@git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification```

Please let me know whether or not you have an alternative solution.

Double tokenization

Thanks a lot for your paper.
I have a question regarding your evaluation script:
Why do you tokenize all the data while calculating the SARI score if it's already tokenized?

access/access/evaluation/general.py

Lines 28 to 31 in ede11a0

 return evaluate_system_output(f'turkcorpus_{phase}_legacy', 

 sys_sents_path=pred_filepath, 

 metrics='bleu,sari_legacy,fkgl', 

 quality_estimation=True)

If the tokenizer is not specified easse uses 13a underneath, but it seems the data (including the one in turkcorpus_test_legacy) is already tokenized.

"https://bitbucket.org/eigen/eigen/get/b2e267dc99d4.zip" unfound

dynet install tries to download "https://bitbucket.org/eigen/eigen/get/b2e267dc99d4.zip" but file does not seem to exist

Is it possible to perform transfer learning on this model?

Hi! Is it possible to perform transfer learning on this model? I couldn't find any straightforward way to do it
It may very well be that i missed something but any suggestion is appreaciated

Best
Cesare

A question on Scripts/generate.py

Hello!
I tried to use generate.py for sentence simplification. I use the command "python scripts/generate.py < my_file.complex", but neither could I find the generated files, nor any output.
In my_file.complex, there are just origin sentences one for each line.
Thank you for your reply!

fairseq_simplifier in Mem

I am currently using the generate.py script in order to paraphrase sentences in a given input file. However, I was wondering is there a way to use the code (fairseq_simplifier) in an API-fashion, that is sentence by sentence, where the best model will be always in RAM so the output will be faster?

Generation without using a file

Hi Louis,

I am trying to input a single sentence for simplification instead of a file. But I run into the following exception
Exception: Could not infer language pair, please provide it explicitly

Since you explicitly copy the input complex file to 'exp_dir' and create a dummy simple file there, how should we tackle this if I do not want to deal with files here?

No module named `ts` in `fairseq_cli/train.py`

I tried to run the model and train it from scratch following the instructions given. I am getting the following error.

| model transformer, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 59637760 (num. trained: 59637760)
| training on 1 GPUs
| max tokens per GPU = 5000 and max sentences per GPU = None
| no existing checkpoint found /home/ubuntu/access/experiments/fairseq/local_1574732696143/checkpoints/checkpoint_last.pt
| epoch 001:   1000 / 2567 loss=12.345, nll_loss=10.559, ppl=1508.20, wps=3578, ups=1, wpb=2960.115, bsz=115.516, num_updates=1001, lr=2.75275e-05, gnorm=1.459, clip=1.000, oom=0.000, wall=830, train_wall=820
| epoch 001:   2000 / 2567 loss=12.007, nll_loss=9.792, ppl=886.70, wps=3564, ups=1, wpb=2952.709, bsz=115.842, num_updates=2001, lr=5.50275e-05, gnorm=1.309, clip=1.000, oom=0.000, wall=1660, train_wall=1641
| epoch 001 | loss 11.877 | nll_loss 9.490 | ppl 719.11 | wps 3574 | ups 1 | wpb 2959.828 | bsz 115.466 | num_updates 2567 | lr 7.05925e-05 | gnorm 1.231 | clip 1.000 | oom 0.000 | wall 2128 | train_wall 2105
Traceback (most recent call last):
  File "scripts/train.py", line 49, in <module>
    fairseq_train_and_evaluate(**kwargs)
  File "/home/ubuntu/access/access/utils/training.py", line 18, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ubuntu/access/access/utils/training.py", line 29, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ubuntu/access/access/utils/training.py", line 38, in wrapped_func
    result = func(*args, **kwargs)
  File "/home/ubuntu/access/access/utils/training.py", line 50, in wrapped_func
    result = func(*args, **kwargs)
  File "/home/ubuntu/access/access/fairseq/main.py", line 121, in fairseq_train_and_evaluate
    fairseq_train(preprocessed_dir, exp_dir=exp_dir, **train_kwargs)
  File "/home/ubuntu/access/access/fairseq/base.py", line 175, in fairseq_train
    train.main(train_args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fairseq_cli/train.py", line 111, in main
    valid_losses = sari_validate(args, trainer, task, epoch_itr, valid_subsets)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fairseq_cli/train.py", line 272, in sari_validate
    from ts.resources.paths import get_data_filepath
ModuleNotFoundError: No module named 'ts'

Which module is this ts ? I understand that you are using a customized version of fairseq_cli/train.py which has sari_validate . Can you give me some pointers on how to resolve this issue?

Thanks

Why model is replacing Proper noun to pronoun? E.g "Oxygen" to "It"

The model is replacing proper noun to the pronoun.
For example,

Here is the input statement:
oxygen is a chemical element with symbol o and atomic number 8 .

The output is:
It has the chemical symbol o . It has the atomic number 8 .

The expected output should be like this:
oxygen is a chemical element with symbol o or oxygen has the chemical symbol o
oxygen has the atomic number 8

How to stop such replacement?
Can anyone please help me to figure out this problem.

Thanks,
Bhavika

What/where is turkcorpus_{phase}_legacy?

Hi @louismartin, thanks so much for providing such a usable resource and responding to issues. I really appreciate it!

So sorry if this is a simple question: what/where is the turkcorpus_{phase}_legacy, which is referenced in access/evaluation/general.py?

For some context, I'm trying to modify scripts/evaluate.py to evaluate on some of my own data. To do so, I modified general.py in access/evaluation, specifically allowing for a directory parameter for get_prediction_on_turkcorpus and get_prediction_on_turkcorpus.

# My modifications
def get_prediction_on_directory(directory, simplifier, phase):
    source_filepath = get_data_filepath(directory, phase, 'complex')
    pred_filepath = get_temp_filepath()
    simplifier(source_filepath, pred_filepath)
    return pred_filepath


def evaluate_simplifier_on_directory(directory, simplifier, phase):
    pred_filepath = get_prediction_on_directory(directory, simplifier, phase)
    pred_filepath = lowercase_file(pred_filepath)
    pred_filepath = to_lrb_rrb_file(pred_filepath)
    return evaluate_system_output(get_data_filepath(directory, phase, 'simple'),
                                  sys_sents_path=pred_filepath,
                                  metrics=['bleu', 'sari_legacy', 'fkgl'],
                                  quality_estimation=True)

I don't quite understand the first parameter to evaluate_system_output, which you have set to f'turkcorpus_{phase}_legacy'. I attempted to replace this with my own .simple file, but when I try this I get the following error:

Traceback (most recent call last):                                                                                                                                                                                                                                                                                
  File "scripts/evaluate.py", line 28, in <module>
    print(evaluate_simplifier_on_directory('simplification', simplifier, 'test'))
  File "/home/.../general.py", line 47, in evaluate_simplifier_on_directory
    quality_estimation=True)
  File "/home/.../cli.py", line 124, in evaluate_system_output
    orig_sents, refs_sents = get_orig_and_refs_sents(test_set, orig_sents_path, refs_sents_paths)
  File "/home/.../cli.py", line 38, in get_orig_and_refs_sents
    orig_sents = get_orig_sents(test_set)
  File "/home/.../resources.py", line 91, in get_orig_sents
    return read_lines(TEST_SETS_PATHS[(test_set, 'orig')])
KeyError: (PosixPath('/home/.../resources/datasets/simplification/simplification.test.simple'), 'orig')

I've also tried looking for a directory with the turkcorpus_test_legacy name, but to no avail.

Thanks so much for the help, and please let me know if there is any additional information I can provide to clarify my problem.

How to make it supports Multi GPU training?

No module named access

I am trying to run the training script as below, however this gives me an error. How can I fix it?

          python scripts/train.py
          Traceback (most recent call last):
          File "scripts/train.py", line 8, in <module>
          from access.fairseq.main import fairseq_train_and_evaluate
          ModuleNotFoundError: No module named 'access'

Control token values during training

Hi, I'm going through the preprocessing and training code. I'm wondering where to find the values of control tokens during training, together with the source and target sides.
Just like the example given in Table 1 of the paper:

Source: <NbChars 0.3><LevSim 0.4>He settled in London , devoting himself chiefly to practical teaching .
Target: He teaches in London .

Is there a way to access these values during training? Also, what dependency parser did you use for the DepTreeDepth token and what are the frequencies for WordRank based on (which corpus, WikiLarge or something else)?

Typo in the License File

The first line in the license file is missing a 'A'

Current text: ttribution-NonCommercial 4.0 International
Expected text: Attribution-NonCommercial 4.0 International

License file are an important part of source code and should not have errors. This helps developers to correctly reference the license when integrating the source code within their applications.

I will create a pull request for the same.

python scripts/evaluate.py

Traceback (most recent call last):
File "/home/qwh/acc/scripts/evaluate.py", line 32, in
evaluate_simplifier_on_turkcorpus(simplifier, phase='test')
File "/home/qwh/acc/access/evaluation/general.py", line 30, in evaluate_simplifier_on_turkcorpus
quality_estimation=True)
File "/home/qwh/acc/easse/cli.py", line 115, in evaluate_system_output
orig_sents, refs_sents = get_orig_and_refs_sents(test_set, orig_sents_path, refs_sents_paths)
File "/home/qwhacc/easse/cli.py", line 39, in get_orig_and_refs_sents
orig_sents = get_orig_sents(test_set)
File "/home/qwh/acc/easse/utils/resources.py", line 76, in get_orig_sents
return read_lines(TEST_SETS_PATHS[(test_set, 'orig')])
KeyError: ('turk', 'orig')

Process finished with exit code 1

Error at the end of train.py

Hi @louismartin ,
i'm sorry to bother you, but I have a problem at the end of the training.
Everything is going well, it comes to the last epoch but in the end it gives me the following error:

File "scripts/train.py", line 53, in
fairseq_train_and_evaluate(**kwargs)
File "/home/usr/Scrivania/folder/access/access/utils/training.py", line 18, in wrapped_func
return func(*args, **kwargs)
File "/home/usr/Scrivania/folder/access/access/utils/training.py", line 29, in wrapped_func
return func(*args, **kwargs)
File "/home/usr/Scrivania/folder/access/access/utils/training.py", line 38, in wrapped_func
result = func(*args, **kwargs)
File "/home/usr/Scrivania/folder/access/access/utils/training.py", line 50, in wrapped_func
result = func(*args, **kwargs)
File "/home/usr/Scrivania/folder/access/access/fairseq/main.py", line 125, in fairseq_train_and_evaluate
parametrization_budget)
File "/home/usr/Scrivania/folder/access/access/fairseq/main.py", line 91, in find_best_parametrization
recommendation = optimizer.optimize(evaluate_parametrization, verbosity=0)
File "/home/usr/Scrivania/folder/venv/lib/python3.6/site-packages/nevergrad/optimization/base.py", line 543, in optimize
return self.minimize(objective_function, executor=executor, batch_mode=batch_mode, verbosity=verbosity)
File "/home/usr/Scrivania/folder/venv/lib/python3.6/site-packages/nevergrad/optimization/base.py", line 503, in minimize
self.tell(x, job.result())
File "/home/usr/Scrivania/folder/venv/lib/python3.6/site-packages/nevergrad/optimization/utils.py", line 150, in result
self._result = self.func(*self.args, **self.kwargs)
File "/home/usr/Scrivania/folder/access/access/fairseq/main.py", line 62, in evaluate_parametrization
return combine_metrics(scores['BLEU'], scores['SARI'], scores['FKGL'], metrics_coefs)
KeyError: 'BLEU'

I can't understand why he gives me this error, easse and fairseq are updating, as are all the other libraries.

Can you help me?
Thank you in advance

Permission denied

I have the authentication problem when I try to clone the repo
Cloning into 'access'... [email protected]: Permission denied (publickey). fatal: Could not read from remote repository.
Could u help me on this

requirement "torch==1.2.0" unfound

ERROR: Could not find a version that satisfies the requirement torch==1.2.0 (from access)
ERROR: No matching distribution found for torch==1.2.0

impossible to install a version below 1.4.0 with pip.

How to accelerate the generation?

Hi
I would like to generate 1M sentences using your ACCESS. However, using generate.py seems slow, I'll appreciate it if you can give me some ideas to accelerate it.
Thank you!

How to use `DependencyTreeDepthRatioPreprocessor` during generation?

I find that in scripts/train.py, it is possible to modify DependencyTreeDepthRatioPreprocessor values which I guess may help to split complex sentence into several simple sentences according to the paper. However, it was removed in generate.py.
Is it possible to add this ratio into generation phase so that I can use it to split sentence with deep dependency tree? Thanks!

	return evaluate_system_output(f'turkcorpus_{phase}_legacy',
	sys_sents_path=pred_filepath,
	metrics='bleu,sari_legacy,fkgl',
	quality_estimation=True)

facebookresearch / access Goto Github PK

access's People

Contributors

Stargazers

Watchers

Forkers

access's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs