microsoft / gluecos Goto Github PK

View Code? Open in Web Editor NEW

73.0 9.0 58.0 823 KB

A benchmark for code-switched NLP, ACL 2020

Home Page: https://microsoft.github.io/GLUECoS

License: MIT License

Python 86.39% Shell 13.61%

gluecos's People

Contributors

Stargazers

Watchers

gluecos's Issues

Getting "Tweet doesn't exist", "Contact author for complete dataset" using download.sh

Hi @ssitaram,

I was trying to get GLUECoS datasets but got multiple logs saying "Tweet doesn't exist", "Contact author for complete dataset", possibly because few tweets which were there in the dataset do not exist anymore. It'll be great if you can share what should we do in such scenarios as train/test datasets might not be same as required for evaluation?

Thanks,
Gaurav

Tweet not found error

Hi, I'm trying to download the dataset but the twitter part is not downloading.. I'm getting the error Tweet not found :: t_id 666613019510775808 :: Contact author for full data-set. Not sure what exactly I'm doing wrong.

Evaluation error

Hi, I'm unable to get my submission evaluated. I've even tried copying zip files from other people's working submissions too and cannot get the results from the eval script to display.

Referencing submission in pull request #43

MT baseline training error

I'm trying to run bash train.sh facebook/mbart-large-cc25 mbart MT_EN_HI but I'm getting the following error:

Fine-tuning facebook/mbart-large-cc25 on MT_EN_HI
03/22/2021 19:49:10 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: True
03/22/2021 19:49:10 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/tmp/mt_model', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=4, per_device_eval_batch_size=2, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Mar22_19-49-10_iiitb-ThinkStation-P920', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1500, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=6000, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1500, dataloader_num_workers=0, past_index=-1, run_name='/tmp/mt_model', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, sortish_sampler=False, predict_with_generate=True)
03/22/2021 19:49:11 - WARNING - datasets.builder -   Using custom data configuration default-48b337e1e0f30e1a
03/22/2021 19:49:11 - WARNING - datasets.builder -   Reusing dataset csv (/home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
loading configuration file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/config.json from cache at /home/vibhav/.cache/huggingface/transformers/36135304685d914515720daa48fc1adae57803e32ab82d5bde85ef78479e9765.b548f7e307531070391a881374674824b374f829e5d8f68857012de63fe2681a
Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "MBartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 1024,
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 5,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "task_specific_params": {
    "translation_en_to_ro": {
      "decoder_start_token_id": 250020
    }
  },
  "transformers_version": "4.4.2",
  "use_cache": true,
  "vocab_size": 250027
}

loading configuration file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/config.json from cache at /home/vibhav/.cache/huggingface/transformers/36135304685d914515720daa48fc1adae57803e32ab82d5bde85ef78479e9765.b548f7e307531070391a881374674824b374f829e5d8f68857012de63fe2681a
Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "MBartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 1024,
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 5,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "task_specific_params": {
    "translation_en_to_ro": {
      "decoder_start_token_id": 250020
    }
  },
  "transformers_version": "4.4.2",
  "use_cache": true,
  "vocab_size": 250027
}

loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/sentencepiece.bpe.model from cache at /home/vibhav/.cache/huggingface/transformers/83d419fb34e90155a8d95f7799f7a7316a327dc28c7ee6bee15b5a62d3c5ca6b.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/tokenizer_config.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/tokenizer.json from cache at /home/vibhav/.cache/huggingface/transformers/16e85cac0e7a8c2938ac468199d0adff7483341305c7e848063b72dcf5f22538.39607a8bede9bcd2666ea442230a9d382f57e4fea127c9cc5b6fc6caf527d682
loading weights file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/pytorch_model.bin from cache at /home/vibhav/.cache/huggingface/transformers/58963b41815ac5618d9910411e018d60a3ae7d4540a66e6cf70adf29a748ca1b.bef0d2e3352d6c4bf1213c6207738ec5ecf458de355c65b2aead6671bc612138
All model checkpoint weights were used when initializing MBartForConditionalGeneration.

All the weights of MBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-cc25.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MBartForConditionalGeneration for predictions without further training.
03/22/2021 19:50:10 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-a5481e6d57bfbce0.arrow
03/22/2021 19:50:10 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-41fcbb9a83397c4b.arrow
Using amp fp16 backend
***** Running training *****
  Num examples = 8060
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 10075
  0%|                                                                                                                                           | 0/10075 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/hdd1/vibhav/Thesis/GLUECoS/Code/run_seq2seq.py", line 584, in <module>
    main()
  File "/home/hdd1/vibhav/Thesis/GLUECoS/Code/run_seq2seq.py", line 529, in main
    train_result = trainer.train()
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1053, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1441, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1475, in compute_loss
    outputs = model(**inputs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 1303, in forward
    return_dict=return_dict,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 1166, in forward
    return_dict=return_dict,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 803, in forward
    output_attentions=output_attentions,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 317, in forward
    output_attentions=output_attentions,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 181, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`
  0%|                                                                                                                                           | 0/10075 [00:00<?, ?it/s]```

MT Baseline Score

Can someone provide the baseline bleu score obtained when you ran mBART

QA task scores

There's no test set for QA, so the scores shown after the git PR would be on the same dev set, I believe. Since the dev set does have the labels we should have been able to use the f1 scores printed locally (which look okay ~72 for lr=5e-6, bs=2, epochs=16, max_seq=512, seed=32 ). I fail to understand why the scores retrieved via the pull request differ, being extremely poor ( ~ 25.3 )? Please let me know if there's anything I could be missing here or why this inconsistency?

PS: model is bert-base-multilingual-cased bert

Add an option to specify the Azure region

The code breaks if we try to use the Azure translator from any region other than southeastasia. I have modified the code in my fork and would be happy to make a pull request :)
Here's the commit: rohanrajpal@89fa5c3

transliterator error on running " ./download_data.sh SUBSCRIPTION_KEY" due to error in response from the specified url

Traceback (most recent call last):
  File "transliterator.py", line 88, in <module>
    main()
  File "transliterator.py", line 79, in main
    trans = get_transliteration(vocab, headers)
  File "transliterator.py", line 38, in get_transliteration
    trans.update({body[j]['text']:i['text']})
TypeError: string indices must be integers

QA_EN_HI: Missing prediction for 234

Hi, I followed the instructions provided in the README file to download the data. I did not encounter any errors during the downloading of the data. However, on testing I am getting the following message when I submit the results.

QA_EN_HI: Missing prediction for 234
Error. Either examples are missing or ids of instances are incorrect

Also, I did try downloading the data again but the error is same.

The training dataset of NLI have two labels while test dataset have three label ?

For nli datasets, the training datasets have two labels : entailment and contradictory. But gold true of test dataset have three labels ! Because I pull two results.zip files with all entailment and all contradictory respectively. Both of them got 33.3% ! So the gold true of test dataset must have another label (maybe “neutral”) . So your nli
tasks is training on two lables datasets and test on three labels datasets ? Please check your datasets carefully !

Trouble with dowloading the datasets

Hi, I'm trying to run ./download_data.sh $SUBSCRIPTION_KEY, and I ran into some issues. I've tried both indictrans and with microsoft translator subscription ID. With indictrans package, I seem to be running into issues with the ndarray shapes not matching (I think this is an issue with indictrans itself). With the subscription key passed, I get the following traceback:

Failed to import No module named 'indictrans'
./download_data.sh: line 50: wget: command not found
./download_data.sh: line 54: wget: command not found
./download_data.sh: line 59: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp/ICON_POS.zip, /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp/ICON_POS.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp/ICON_POS.zip.ZIP.
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 192, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 175, in main
    make_temp_file(original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 11, in make_temp_file
    shutil.copy(original_path_validation,new_path_validation)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp//HindiEnglish_FIRE2013_AnnotatedDev.txt'
Downloaded LID EN HI
./download_data.sh: line 98: wget: command not found
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_hi.py", line 143, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_hi.py", line 107, in main
    make_temp_file(original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_hi.py", line 11, in make_temp_file
    with open(original_path +'/annotatedData.csv','r',encoding='utf-8')as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/NER_EN_HI/temp//annotatedData.csv'
Downloaded NER EN HI
./download_data.sh: line 221: wget: command not found
./download_data.sh: line 224: wget: command not found
./download_data.sh: line 227: wget: command not found
./download_data.sh: line 230: wget: command not found
./download_data.sh: line 233: wget: command not found
./download_data.sh: line 236: wget: command not found
./download_data.sh: line 239: wget: command not found
./download_data.sh: line 242: wget: command not found
./download_data.sh: line 245: wget: command not found
./download_data.sh: line 248: wget: command not found
./download_data.sh: line 251: wget: command not found
./download_data.sh: line 254: wget: command not found
./download_data.sh: line 257: wget: command not found
./download_data.sh: line 260: wget: command not found
./download_data.sh: line 263: wget: command not found
./download_data.sh: line 266: wget: command not found
./download_data.sh: line 269: wget: command not found
./download_data.sh: line 272: wget: command not found
./download_data.sh: line 275: wget: command not found
./download_data.sh: line 278: wget: command not found
./download_data.sh: line 281: wget: command not found
./download_data.sh: line 284: wget: command not found
./download_data.sh: line 287: wget: command not found
./download_data.sh: line 290: wget: command not found
./download_data.sh: line 293: wget: command not found
./download_data.sh: line 296: wget: command not found
./download_data.sh: line 299: wget: command not found
./download_data.sh: line 302: wget: command not found
./download_data.sh: line 305: wget: command not found
./download_data.sh: line 308: wget: command not found
./download_data.sh: line 311: wget: command not found
./download_data.sh: line 314: wget: command not found
./download_data.sh: line 317: wget: command not found
./download_data.sh: line 320: wget: command not found
./download_data.sh: line 323: wget: command not found
./download_data.sh: line 326: wget: command not found
./download_data.sh: line 329: wget: command not found
./download_data.sh: line 332: wget: command not found
./download_data.sh: line 335: wget: command not found
./download_data.sh: line 338: wget: command not found
./download_data.sh: line 341: wget: command not found
./download_data.sh: line 344: wget: command not found
./download_data.sh: line 347: wget: command not found
./download_data.sh: line 350: wget: command not found
./download_data.sh: line 353: wget: command not found
./download_data.sh: line 356: wget: command not found
./download_data.sh: line 359: wget: command not found
./download_data.sh: line 362: wget: command not found
./download_data.sh: line 365: wget: command not found
./download_data.sh: line 368: wget: command not found
./download_data.sh: line 371: wget: command not found
./download_data.sh: line 374: wget: command not found
./download_data.sh: line 377: wget: command not found
./download_data.sh: line 380: wget: command not found
./download_data.sh: line 383: wget: command not found
./download_data.sh: line 386: wget: command not found
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_es.py", line 104, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_es.py", line 90, in main
    make_split_file(id_dir+'/train_ids.txt','temp_word.txt',new_path+'/train.txt',mode='train')
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_es.py", line 45, in make_split_file
    with open(input_file,'r') as infile:
FileNotFoundError: [Errno 2] No such file or directory: 'temp_word.txt'
Downloaded POS EN ES
./download_data.sh: line 140: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS.zip, /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS.zip.ZIP.
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_fg.py", line 63, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_fg.py", line 43, in main
    shutil.copy(original_path+'Romanized/train.txt',new_path+'Romanized/train.txt')
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS/Processed Data/Romanized/train.txt'
Downloaded POS EN HI FG
./download_data.sh: line 171: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017.zip, /Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017.zip.ZIP.
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_hi.py", line 63, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_hi.py", line 43, in main
    shutil.copy(original_path+'Romanized/train.txt',new_path+'Romanized/train.txt')
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017/Processed Data/Romanized/train.txt'
Downloaded Sentiment EN HI
./download_data.sh: line 187: wget: command not found
Downloaded QA EN HI
./download_data.sh: line 113: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/master.zip, /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/master.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/master.zip.ZIP.
(Patch is indented 4 spaces.)
patch: **** Can't find file /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/UD_Hindi_English-master/crawl_tweets.py : No such file or directory
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_ud.py", line 179, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_ud.py", line 164, in main
    scrape_tweets(original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_ud.py", line 16, in scrape_tweets
    os.chdir(original_path)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/UD_Hindi_English-master'
Downloaded POS EN HI UD
./download_data.sh: line 200: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json.zip, /Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json.zip.ZIP.
./download_data.sh: line 207: wget: command not found
./download_data.sh: line 207: wget: command not found
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/iglee/opt/anaconda3/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/Users/iglee/opt/anaconda3/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/iglee/opt/anaconda3/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_nli_en_hi.py", line 125, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_nli_en_hi.py", line 113, in main
    process_files(original_path+'all_keys_json/Final_Key.json',args.data_dir+'/NLI_EN_HI/temp/all_only_id.json')
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_nli_en_hi.py", line 12, in process_files
    with open(final_key_path,'r') as infile:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json/Final_Key.json'
Downloaded NLI EN HI
./download_data.sh: line 156: wget: command not found
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_es.py", line 218, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_es.py", line 197, in main
    download_tweets(tweet_keys,original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_es.py", line 12, in download_tweets
    lines = [line.strip() for line in open(original_path_text,'r').readlines()]
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_ES/temp//cs-en-es-corpus-wassa2015.txt'
Downloaded Sentiment EN ES
./download_data.sh: line 26: wget: command not found
./download_data.sh: line 29: wget: command not found
./download_data.sh: line 33: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp/Release.zip, /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp/Release.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp/Release.zip.ZIP.
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_es.py", line 168, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_es.py", line 139, in main
    download_tweets(original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_es.py", line 13, in download_tweets
    shutil.copy('twitter_authentication.txt',original_path+'/Release/twitter_auth.txt')
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 266, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp//Release/twitter_auth.txt'
Downloaded LID EN ES
./download_data.sh: line 75: wget: command not found
./download_data.sh: line 78: wget: command not found
./download_data.sh: line 82: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp/Release.zip, /Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp/Release.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp/Release.zip.ZIP.
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_es.py", line 152, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_es.py", line 134, in main
    download_tweets(original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_es.py", line 13, in download_tweets
    shutil.copy('twitter_authentication.txt',original_path+'/Release/twitter_auth.txt')
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 266, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp//Release/twitter_auth.txt'
Downloaded NER EN ES

I'm wondering why it's still trying to use indictrans despite my passing the subscription key? If someone could help me with this, I'd really appreciate it. thanks!

Inconsistency in the number of sentences due to deleted tweets

Hi, I have observed there were many tweets which were nor downloaded due to the tweets being deleted. Also, for some strange reason for some tasks the number of sentences were higher than the number reported in the paper (specifically for English-Spanish datasets). Please find the table comparing the statistics reported in the paper and the statistics obtained after downloading the data.

English-Hindi

Corpus	Sent (Train)		Sent (Dev)		Sent (Test)
	Paper	Downloaded	Paper	Downloaded	Paper	Downloaded
FIRE LID (D)	2631	2098	500	500	406	406
UD POS (D)	1384	1344	215	209	215	225
FG POS (R)	2104	2098	263	261	264	264
IIITH NER (R)	2467	2467	308	308	309	307
SAIL Sentiment (R)	10080	10080	1260	1260	1261	1261

English-Spanish

Corpus	Sent (Train)		Sent (Dev)		Sent (Test)
	Paper	Downloaded	Paper	Downloaded	Paper	Downloaded
EMNLP 2014	10259	7192	1140	824	3014	2981
Bangor POS	2192	2167	274	269	274	270
CALCS NER	27366	28381	3420	3537	3421	3577
Sentiment	1681	1851	211	231	211	232

Could the inconsistency in the dataset lead to unfair comparison of the results?

Providing data to another cloud

Twitter Developer account applying is not friend, preventing the users from downloading dataset.
Do u mind sharing the data to google cloud drive or other online storage platform?

Tweet doesn't exist issue

Hi,
I followed the README and ran download_data.sh script.
I noticed there are plenty of cases where I get tweet missing error/warning.
An example:

 Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 427482958912450560
Tweet doesn't exist: 427482958912450560
Tweet doesn't exist: 427482958912450560
Tweet doesn't exist: 427482958912450560
Tweet doesn't exist: 427482958912450560
Tweet doesn't exist: 427482958912450560

Data for reproducing Modified mBERT

Is it possible that you can share the script on preparing the data for reproducing the experimental setup for modified mBERT?
It requires additional code-switching data preparation.

Thank you!

Questions on Machine Translation Task

Hi,

Thanks for setting up this repo and leaderboard. I had following questions regarding machine translation task

Do we already have a leaderboard for the MT task as well? If yes, can you please share the link. I could not find it in the README.
From what I understand the original dataset from Prof. Alan Black's group from CMU was in English for document grounded conversations. Can you share the paper which has baseline results for the Machine Translation Hinglish dataset which is being used here?
Is the Hinglish MT dataset used here same as the one in LINCE leaderboard with a different split? (LINCE Leaderboard didn't clearly specify from where the data is coming from, hence asking here in case you've already checked that since both leaderboards are trying to cater to code-switched languages and tasks)

Error on running download_data.sh

Hi, I'm getting the following errors on running the first script, download_data.sh

ICON_POS.zip                  100%[=================================================>] 588.98K   436KB/s    in 1.3s
Traceback (most recent call last):
  File "/scratch/aditya.srivastava/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 192, in <module>
    main()
  File "/scratch/aditya.srivastava/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 175, in main
    make_temp_file(original_path)
  File "/scratch/aditya.srivastava/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 11, in make_temp_file
    shutil.copy(original_path_validation,new_path_validation)
  File "/home/aditya.srivastava/opt/Python-3.8.2/lib/python3.8/shutil.py", line 415, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/aditya.srivastava/opt/Python-3.8.2/lib/python3.8/shutil.py", line 261, in copyfile
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/aditya.srivastava/GLUECoS/Data/Original_Data/LID_EN_HI/temp//HindiEnglish_FIRE2013_AnnotatedDev.txt'
Downloaded LID EN HI
annotatedData.csv             100%[=================================================>]   1.52M  --.-KB/s    in 0.1s

Can someone help me resolve this?

Performance on NLI task

Hi!

I was trying to replicate the results for NLI task using multilingual BERT model. The GLUECoS paper says, mBERT gives 61.09 or 57.74 (from the leaderboard). When I am running the sample NLI script here with default parameters, my test accuracy is coming out to be very low ~33. Could anyone confirm that the numbers in the paper are for baseline or not? Also, I see the data was updated, are these numbers for older version of data?

Thanks!

Lines Mismatch in Sentiment_EN_HI Romanized data

Even with same lines as in test set getting the 'lines mismatch error' in Sentiment_EN_HI Romanized test dataset.

Spacy versioning error for DrQA

It would be helpful for people if you could add to the readme file that spacy version 2.1.0 is required to run DrQA. This isn't mentioned on their repo either and the repository seems dead so it probably won't be updated there anyways. If one tries to run their script it installs spacy v3 which is incompatible with their code. I had to find out the correct spacy version by looking up the date at which their scripts were written and comparing it to version releases.

Token level task for transliteration

As the title says, is there any way to add the evaluation script for the transliteration task. I am currently working on creation of a transliteration dataset and training a neural model on the extracted data and I wanted to use this framework, but it doesn't look like transliteration is a part of it. Any work happening in this direction?

An alternative for Microsoft Translator?

For the transliteration of Roman to Devnagari, it would be convenient to add support for more transliteration services.

Trained Models

Are the trained models on each task available publicly? Is it possible for the authors to share them? Specifically, I'm looking for code-mixed NMT and LID.

TIA!

microsoft / gluecos Goto Github PK

gluecos's People

Contributors

Stargazers

Watchers

Forkers

gluecos's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs