When preprocesing the raw Inuktitut data, the dev and test sets seem to contains an abnormally high <unk>
replacement ratio:
(fairseq-py3.8) [jonnesaleva@gpu-1-1 hansard]$ cat preprocess.log
Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=True, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/en-iu/en_sp32k_iu_sp32k/hansard', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='en', srcdict=None, target_lang='iu', task='translation', tensorboard_logdir=None, testpref='/home/jonnesaleva/datasets/mrl_nmt22/processed/en-iu/en_sp32k_iu_sp32k/hansard/en-iu.test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='/home/jonnesaleva/datasets/mrl_nmt22/processed/en-iu/en_sp32k_iu_sp32k/hansard/en-iu.train', user_dir=None, validpref='/home/jonnesaleva/datasets/mrl_nmt22/processed/en-iu/en_sp32k_iu_sp32k/hansard/en-iu.dev', workers=40)
[en] Dictionary: 32096 types
[en] /home/jonnesaleva/datasets/mrl_nmt22/processed/en-iu/en_sp32k_iu_sp32k/hansard/en-iu.train.en: 1293439 sents, 22538091 tokens, 0.0% replaced by <unk>
[en] Dictionary: 32096 types
[en] /home/jonnesaleva/datasets/mrl_nmt22/processed/en-iu/en_sp32k_iu_sp32k/hansard/en-iu.dev.en: 2674 sents, 74173 tokens, 4.0% replaced by <unk>
[en] Dictionary: 32096 types
[en] /home/jonnesaleva/datasets/mrl_nmt22/processed/en-iu/en_sp32k_iu_sp32k/hansard/en-iu.test.en: 3602 sents, 98545 tokens, 4.06% replaced by <unk>
[iu] Dictionary: 32256 types
[iu] /home/jonnesaleva/datasets/mrl_nmt22/processed/en-iu/en_sp32k_iu_sp32k/hansard/en-iu.train.iu: 1293439 sents, 16342339 tokens, 0.0% replaced by <unk>
[iu] Dictionary: 32256 types
[iu] /home/jonnesaleva/datasets/mrl_nmt22/processed/en-iu/en_sp32k_iu_sp32k/hansard/en-iu.dev.iu: 2674 sents, 56715 tokens, 33.5% replaced by <unk>
[iu] Dictionary: 32256 types
[iu] /home/jonnesaleva/datasets/mrl_nmt22/processed/en-iu/en_sp32k_iu_sp32k/hansard/en-iu.test.iu: 3602 sents, 80675 tokens, 34.3% replaced by <unk>
Wrote preprocessed data to data-bin/en-iu/en_sp32k_iu_sp32k/hansard
iu-BLEU-13a-mixed 2.17573
iu-BLEU-intl-mixed 2.15823
iu-BLEU-char-mixed 0.96862
iu-BLEU-spm-mixed 1.52165
iu-BLEU-none-mixed 0.03631
iu-BLEU-13a-lc 2.17573
iu-BLEU-intl-lc 2.15823
iu-BLEU-char-lc 0.96862
iu-BLEU-spm-lc 1.52213
iu-BLEU-none-lc 0.03631
iu-CHRF3-min1-max6 4.43602
iu-CHRF2-char6-word0 3.17176