microsoft / gluecos Goto Github PK
View Code? Open in Web Editor NEWA benchmark for code-switched NLP, ACL 2020
Home Page: https://microsoft.github.io/GLUECoS
License: MIT License
A benchmark for code-switched NLP, ACL 2020
Home Page: https://microsoft.github.io/GLUECoS
License: MIT License
Hi @ssitaram,
I was trying to get GLUECoS datasets but got multiple logs saying "Tweet doesn't exist", "Contact author for complete dataset", possibly because few tweets which were there in the dataset do not exist anymore. It'll be great if you can share what should we do in such scenarios as train/test datasets might not be same as required for evaluation?
Thanks,
Gaurav
Hi, I'm trying to download the dataset but the twitter part is not downloading.. I'm getting the error Tweet not found :: t_id 666613019510775808 :: Contact author for full data-set
. Not sure what exactly I'm doing wrong.
Hi, I'm unable to get my submission evaluated. I've even tried copying zip files from other people's working submissions too and cannot get the results from the eval script to display.
Referencing submission in pull request #43
I'm trying to run bash train.sh facebook/mbart-large-cc25 mbart MT_EN_HI
but I'm getting the following error:
Fine-tuning facebook/mbart-large-cc25 on MT_EN_HI
03/22/2021 19:49:10 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: True
03/22/2021 19:49:10 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/tmp/mt_model', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=4, per_device_eval_batch_size=2, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Mar22_19-49-10_iiitb-ThinkStation-P920', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1500, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=6000, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1500, dataloader_num_workers=0, past_index=-1, run_name='/tmp/mt_model', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, sortish_sampler=False, predict_with_generate=True)
03/22/2021 19:49:11 - WARNING - datasets.builder - Using custom data configuration default-48b337e1e0f30e1a
03/22/2021 19:49:11 - WARNING - datasets.builder - Reusing dataset csv (/home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
loading configuration file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/config.json from cache at /home/vibhav/.cache/huggingface/transformers/36135304685d914515720daa48fc1adae57803e32ab82d5bde85ef78479e9765.b548f7e307531070391a881374674824b374f829e5d8f68857012de63fe2681a
Model config MBartConfig {
"_num_labels": 3,
"activation_dropout": 0.0,
"activation_function": "gelu",
"add_bias_logits": false,
"add_final_layer_norm": true,
"architectures": [
"MBartForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 0,
"classif_dropout": 0.0,
"classifier_dropout": 0.0,
"d_model": 1024,
"decoder_attention_heads": 16,
"decoder_ffn_dim": 4096,
"decoder_layerdrop": 0.0,
"decoder_layers": 12,
"dropout": 0.1,
"encoder_attention_heads": 16,
"encoder_ffn_dim": 4096,
"encoder_layerdrop": 0.0,
"encoder_layers": 12,
"eos_token_id": 2,
"forced_eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2
},
"max_length": 1024,
"max_position_embeddings": 1024,
"model_type": "mbart",
"normalize_before": true,
"normalize_embedding": true,
"num_beams": 5,
"num_hidden_layers": 12,
"output_past": true,
"pad_token_id": 1,
"scale_embedding": true,
"static_position_embeddings": false,
"task_specific_params": {
"translation_en_to_ro": {
"decoder_start_token_id": 250020
}
},
"transformers_version": "4.4.2",
"use_cache": true,
"vocab_size": 250027
}
loading configuration file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/config.json from cache at /home/vibhav/.cache/huggingface/transformers/36135304685d914515720daa48fc1adae57803e32ab82d5bde85ef78479e9765.b548f7e307531070391a881374674824b374f829e5d8f68857012de63fe2681a
Model config MBartConfig {
"_num_labels": 3,
"activation_dropout": 0.0,
"activation_function": "gelu",
"add_bias_logits": false,
"add_final_layer_norm": true,
"architectures": [
"MBartForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 0,
"classif_dropout": 0.0,
"classifier_dropout": 0.0,
"d_model": 1024,
"decoder_attention_heads": 16,
"decoder_ffn_dim": 4096,
"decoder_layerdrop": 0.0,
"decoder_layers": 12,
"dropout": 0.1,
"encoder_attention_heads": 16,
"encoder_ffn_dim": 4096,
"encoder_layerdrop": 0.0,
"encoder_layers": 12,
"eos_token_id": 2,
"forced_eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2
},
"max_length": 1024,
"max_position_embeddings": 1024,
"model_type": "mbart",
"normalize_before": true,
"normalize_embedding": true,
"num_beams": 5,
"num_hidden_layers": 12,
"output_past": true,
"pad_token_id": 1,
"scale_embedding": true,
"static_position_embeddings": false,
"task_specific_params": {
"translation_en_to_ro": {
"decoder_start_token_id": 250020
}
},
"transformers_version": "4.4.2",
"use_cache": true,
"vocab_size": 250027
}
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/sentencepiece.bpe.model from cache at /home/vibhav/.cache/huggingface/transformers/83d419fb34e90155a8d95f7799f7a7316a327dc28c7ee6bee15b5a62d3c5ca6b.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/tokenizer_config.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/tokenizer.json from cache at /home/vibhav/.cache/huggingface/transformers/16e85cac0e7a8c2938ac468199d0adff7483341305c7e848063b72dcf5f22538.39607a8bede9bcd2666ea442230a9d382f57e4fea127c9cc5b6fc6caf527d682
loading weights file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/pytorch_model.bin from cache at /home/vibhav/.cache/huggingface/transformers/58963b41815ac5618d9910411e018d60a3ae7d4540a66e6cf70adf29a748ca1b.bef0d2e3352d6c4bf1213c6207738ec5ecf458de355c65b2aead6671bc612138
All model checkpoint weights were used when initializing MBartForConditionalGeneration.
All the weights of MBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-cc25.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MBartForConditionalGeneration for predictions without further training.
03/22/2021 19:50:10 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-a5481e6d57bfbce0.arrow
03/22/2021 19:50:10 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-41fcbb9a83397c4b.arrow
Using amp fp16 backend
***** Running training *****
Num examples = 8060
Num Epochs = 5
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 10075
0%| | 0/10075 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/hdd1/vibhav/Thesis/GLUECoS/Code/run_seq2seq.py", line 584, in <module>
main()
File "/home/hdd1/vibhav/Thesis/GLUECoS/Code/run_seq2seq.py", line 529, in main
train_result = trainer.train()
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1053, in train
tr_loss += self.training_step(model, inputs)
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1441, in training_step
loss = self.compute_loss(model, inputs)
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1475, in compute_loss
outputs = model(**inputs)
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 1303, in forward
return_dict=return_dict,
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 1166, in forward
return_dict=return_dict,
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 803, in forward
output_attentions=output_attentions,
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 317, in forward
output_attentions=output_attentions,
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 181, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`
0%| | 0/10075 [00:00<?, ?it/s]```
Can someone provide the baseline bleu score obtained when you ran mBART
There's no test set for QA, so the scores shown after the git PR would be on the same dev set, I believe. Since the dev set does have the labels we should have been able to use the f1 scores printed locally (which look okay ~72 for lr=5e-6, bs=2, epochs=16, max_seq=512, seed=32 ). I fail to understand why the scores retrieved via the pull request differ, being extremely poor ( ~ 25.3 )? Please let me know if there's anything I could be missing here or why this inconsistency?
PS: model is bert-base-multilingual-cased bert
The code breaks if we try to use the Azure translator from any region other than southeastasia
. I have modified the code in my fork and would be happy to make a pull request :)
Here's the commit: rohanrajpal@89fa5c3
Traceback (most recent call last):
File "transliterator.py", line 88, in <module>
main()
File "transliterator.py", line 79, in main
trans = get_transliteration(vocab, headers)
File "transliterator.py", line 38, in get_transliteration
trans.update({body[j]['text']:i['text']})
TypeError: string indices must be integers
Hi, I followed the instructions provided in the README file to download the data. I did not encounter any errors during the downloading of the data. However, on testing I am getting the following message when I submit the results.
QA_EN_HI: Missing prediction for 234
Error. Either examples are missing or ids of instances are incorrect
Also, I did try downloading the data again but the error is same.
For nli datasets, the training datasets have two labels : entailment and contradictory. But gold true of test dataset have three labels ! Because I pull two results.zip files with all entailment and all contradictory respectively. Both of them got 33.3% ! So the gold true of test dataset must have another label (maybe “neutral”) . So your nli
tasks is training on two lables datasets and test on three labels datasets ? Please check your datasets carefully !
Hi, I'm trying to run ./download_data.sh $SUBSCRIPTION_KEY
, and I ran into some issues. I've tried both indictrans
and with microsoft translator subscription ID. With indictrans
package, I seem to be running into issues with the ndarray shapes not matching (I think this is an issue with indictrans itself). With the subscription key passed, I get the following traceback:
Failed to import No module named 'indictrans'
./download_data.sh: line 50: wget: command not found
./download_data.sh: line 54: wget: command not found
./download_data.sh: line 59: wget: command not found
unzip: cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp/ICON_POS.zip, /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp/ICON_POS.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp/ICON_POS.zip.ZIP.
Traceback (most recent call last):
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 192, in <module>
main()
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 175, in main
make_temp_file(original_path)
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 11, in make_temp_file
shutil.copy(original_path_validation,new_path_validation)
File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 264, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp//HindiEnglish_FIRE2013_AnnotatedDev.txt'
Downloaded LID EN HI
./download_data.sh: line 98: wget: command not found
Traceback (most recent call last):
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_hi.py", line 143, in <module>
main()
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_hi.py", line 107, in main
make_temp_file(original_path)
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_hi.py", line 11, in make_temp_file
with open(original_path +'/annotatedData.csv','r',encoding='utf-8')as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/NER_EN_HI/temp//annotatedData.csv'
Downloaded NER EN HI
./download_data.sh: line 221: wget: command not found
./download_data.sh: line 224: wget: command not found
./download_data.sh: line 227: wget: command not found
./download_data.sh: line 230: wget: command not found
./download_data.sh: line 233: wget: command not found
./download_data.sh: line 236: wget: command not found
./download_data.sh: line 239: wget: command not found
./download_data.sh: line 242: wget: command not found
./download_data.sh: line 245: wget: command not found
./download_data.sh: line 248: wget: command not found
./download_data.sh: line 251: wget: command not found
./download_data.sh: line 254: wget: command not found
./download_data.sh: line 257: wget: command not found
./download_data.sh: line 260: wget: command not found
./download_data.sh: line 263: wget: command not found
./download_data.sh: line 266: wget: command not found
./download_data.sh: line 269: wget: command not found
./download_data.sh: line 272: wget: command not found
./download_data.sh: line 275: wget: command not found
./download_data.sh: line 278: wget: command not found
./download_data.sh: line 281: wget: command not found
./download_data.sh: line 284: wget: command not found
./download_data.sh: line 287: wget: command not found
./download_data.sh: line 290: wget: command not found
./download_data.sh: line 293: wget: command not found
./download_data.sh: line 296: wget: command not found
./download_data.sh: line 299: wget: command not found
./download_data.sh: line 302: wget: command not found
./download_data.sh: line 305: wget: command not found
./download_data.sh: line 308: wget: command not found
./download_data.sh: line 311: wget: command not found
./download_data.sh: line 314: wget: command not found
./download_data.sh: line 317: wget: command not found
./download_data.sh: line 320: wget: command not found
./download_data.sh: line 323: wget: command not found
./download_data.sh: line 326: wget: command not found
./download_data.sh: line 329: wget: command not found
./download_data.sh: line 332: wget: command not found
./download_data.sh: line 335: wget: command not found
./download_data.sh: line 338: wget: command not found
./download_data.sh: line 341: wget: command not found
./download_data.sh: line 344: wget: command not found
./download_data.sh: line 347: wget: command not found
./download_data.sh: line 350: wget: command not found
./download_data.sh: line 353: wget: command not found
./download_data.sh: line 356: wget: command not found
./download_data.sh: line 359: wget: command not found
./download_data.sh: line 362: wget: command not found
./download_data.sh: line 365: wget: command not found
./download_data.sh: line 368: wget: command not found
./download_data.sh: line 371: wget: command not found
./download_data.sh: line 374: wget: command not found
./download_data.sh: line 377: wget: command not found
./download_data.sh: line 380: wget: command not found
./download_data.sh: line 383: wget: command not found
./download_data.sh: line 386: wget: command not found
Traceback (most recent call last):
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_es.py", line 104, in <module>
main()
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_es.py", line 90, in main
make_split_file(id_dir+'/train_ids.txt','temp_word.txt',new_path+'/train.txt',mode='train')
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_es.py", line 45, in make_split_file
with open(input_file,'r') as infile:
FileNotFoundError: [Errno 2] No such file or directory: 'temp_word.txt'
Downloaded POS EN ES
./download_data.sh: line 140: wget: command not found
unzip: cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS.zip, /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS.zip.ZIP.
Traceback (most recent call last):
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_fg.py", line 63, in <module>
main()
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_fg.py", line 43, in main
shutil.copy(original_path+'Romanized/train.txt',new_path+'Romanized/train.txt')
File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 264, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS/Processed Data/Romanized/train.txt'
Downloaded POS EN HI FG
./download_data.sh: line 171: wget: command not found
unzip: cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017.zip, /Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017.zip.ZIP.
Traceback (most recent call last):
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_hi.py", line 63, in <module>
main()
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_hi.py", line 43, in main
shutil.copy(original_path+'Romanized/train.txt',new_path+'Romanized/train.txt')
File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 264, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017/Processed Data/Romanized/train.txt'
Downloaded Sentiment EN HI
./download_data.sh: line 187: wget: command not found
Downloaded QA EN HI
./download_data.sh: line 113: wget: command not found
unzip: cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/master.zip, /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/master.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/master.zip.ZIP.
(Patch is indented 4 spaces.)
patch: **** Can't find file /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/UD_Hindi_English-master/crawl_tweets.py : No such file or directory
Traceback (most recent call last):
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_ud.py", line 179, in <module>
main()
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_ud.py", line 164, in main
scrape_tweets(original_path)
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_ud.py", line 16, in scrape_tweets
os.chdir(original_path)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/UD_Hindi_English-master'
Downloaded POS EN HI UD
./download_data.sh: line 200: wget: command not found
unzip: cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json.zip, /Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json.zip.ZIP.
./download_data.sh: line 207: wget: command not found
./download_data.sh: line 207: wget: command not found
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/iglee/opt/anaconda3/lib/python3.9/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/Users/iglee/opt/anaconda3/lib/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/Users/iglee/opt/anaconda3/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/iglee/opt/anaconda3/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_nli_en_hi.py", line 125, in <module>
main()
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_nli_en_hi.py", line 113, in main
process_files(original_path+'all_keys_json/Final_Key.json',args.data_dir+'/NLI_EN_HI/temp/all_only_id.json')
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_nli_en_hi.py", line 12, in process_files
with open(final_key_path,'r') as infile:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json/Final_Key.json'
Downloaded NLI EN HI
./download_data.sh: line 156: wget: command not found
Traceback (most recent call last):
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_es.py", line 218, in <module>
main()
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_es.py", line 197, in main
download_tweets(tweet_keys,original_path)
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_es.py", line 12, in download_tweets
lines = [line.strip() for line in open(original_path_text,'r').readlines()]
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_ES/temp//cs-en-es-corpus-wassa2015.txt'
Downloaded Sentiment EN ES
./download_data.sh: line 26: wget: command not found
./download_data.sh: line 29: wget: command not found
./download_data.sh: line 33: wget: command not found
unzip: cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp/Release.zip, /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp/Release.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp/Release.zip.ZIP.
Traceback (most recent call last):
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_es.py", line 168, in <module>
main()
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_es.py", line 139, in main
download_tweets(original_path)
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_es.py", line 13, in download_tweets
shutil.copy('twitter_authentication.txt',original_path+'/Release/twitter_auth.txt')
File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 266, in copyfile
with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp//Release/twitter_auth.txt'
Downloaded LID EN ES
./download_data.sh: line 75: wget: command not found
./download_data.sh: line 78: wget: command not found
./download_data.sh: line 82: wget: command not found
unzip: cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp/Release.zip, /Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp/Release.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp/Release.zip.ZIP.
Traceback (most recent call last):
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_es.py", line 152, in <module>
main()
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_es.py", line 134, in main
download_tweets(original_path)
File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_es.py", line 13, in download_tweets
shutil.copy('twitter_authentication.txt',original_path+'/Release/twitter_auth.txt')
File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 266, in copyfile
with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp//Release/twitter_auth.txt'
Downloaded NER EN ES
I'm wondering why it's still trying to use indictrans despite my passing the subscription key? If someone could help me with this, I'd really appreciate it. thanks!
Hi, I have observed there were many tweets which were nor downloaded due to the tweets being deleted. Also, for some strange reason for some tasks the number of sentences were higher than the number reported in the paper (specifically for English-Spanish datasets). Please find the table comparing the statistics reported in the paper and the statistics obtained after downloading the data.
English-Hindi
Corpus | Sent (Train) | Sent (Dev) | Sent (Test) | |||
---|---|---|---|---|---|---|
Paper | Downloaded | Paper | Downloaded | Paper | Downloaded | |
FIRE LID (D) | 2631 | 2098 | 500 | 500 | 406 | 406 |
UD POS (D) | 1384 | 1344 | 215 | 209 | 215 | 225 |
FG POS (R) | 2104 | 2098 | 263 | 261 | 264 | 264 |
IIITH NER (R) | 2467 | 2467 | 308 | 308 | 309 | 307 |
SAIL Sentiment (R) | 10080 | 10080 | 1260 | 1260 | 1261 | 1261 |
English-Spanish
Corpus | Sent (Train) | Sent (Dev) | Sent (Test) | |||
---|---|---|---|---|---|---|
Paper | Downloaded | Paper | Downloaded | Paper | Downloaded | |
EMNLP 2014 | 10259 | 7192 | 1140 | 824 | 3014 | 2981 |
Bangor POS | 2192 | 2167 | 274 | 269 | 274 | 270 |
CALCS NER | 27366 | 28381 | 3420 | 3537 | 3421 | 3577 |
Sentiment | 1681 | 1851 | 211 | 231 | 211 | 232 |
Could the inconsistency in the dataset lead to unfair comparison of the results?
Twitter Developer account applying is not friend, preventing the users from downloading dataset.
Do u mind sharing the data to google cloud drive or other online storage platform?
Hi,
I followed the README and ran download_data.sh script.
I noticed there are plenty of cases where I get tweet missing error/warning.
An example:
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 424300532652462080
Tweet doesn't exist: 427482958912450560
Tweet doesn't exist: 427482958912450560
Tweet doesn't exist: 427482958912450560
Tweet doesn't exist: 427482958912450560
Tweet doesn't exist: 427482958912450560
Tweet doesn't exist: 427482958912450560
Is it possible that you can share the script on preparing the data for reproducing the experimental setup for modified mBERT?
It requires additional code-switching data preparation.
Thank you!
Hi,
Thanks for setting up this repo and leaderboard. I had following questions regarding machine translation task
Hi, I'm getting the following errors on running the first script, download_data.sh
ICON_POS.zip 100%[=================================================>] 588.98K 436KB/s in 1.3s
Traceback (most recent call last):
File "/scratch/aditya.srivastava/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 192, in <module>
main()
File "/scratch/aditya.srivastava/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 175, in main
make_temp_file(original_path)
File "/scratch/aditya.srivastava/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 11, in make_temp_file
shutil.copy(original_path_validation,new_path_validation)
File "/home/aditya.srivastava/opt/Python-3.8.2/lib/python3.8/shutil.py", line 415, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/home/aditya.srivastava/opt/Python-3.8.2/lib/python3.8/shutil.py", line 261, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/aditya.srivastava/GLUECoS/Data/Original_Data/LID_EN_HI/temp//HindiEnglish_FIRE2013_AnnotatedDev.txt'
Downloaded LID EN HI
annotatedData.csv 100%[=================================================>] 1.52M --.-KB/s in 0.1s
Can someone help me resolve this?
Hi!
I was trying to replicate the results for NLI task using multilingual BERT model. The GLUECoS paper says, mBERT gives 61.09 or 57.74 (from the leaderboard). When I am running the sample NLI script here with default parameters, my test accuracy is coming out to be very low ~33. Could anyone confirm that the numbers in the paper are for baseline or not? Also, I see the data was updated, are these numbers for older version of data?
Thanks!
Even with same lines as in test set getting the 'lines mismatch error' in Sentiment_EN_HI Romanized test dataset.
It would be helpful for people if you could add to the readme file that spacy version 2.1.0 is required to run DrQA. This isn't mentioned on their repo either and the repository seems dead so it probably won't be updated there anyways. If one tries to run their script it installs spacy v3 which is incompatible with their code. I had to find out the correct spacy version by looking up the date at which their scripts were written and comparing it to version releases.
As the title says, is there any way to add the evaluation script for the transliteration task. I am currently working on creation of a transliteration dataset and training a neural model on the extracted data and I wanted to use this framework, but it doesn't look like transliteration is a part of it. Any work happening in this direction?
For the transliteration of Roman to Devnagari, it would be convenient to add support for more transliteration services.
Are the trained models on each task available publicly? Is it possible for the authors to share them? Specifically, I'm looking for code-mixed NMT and LID.
TIA!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.