Comments (8)
Hi @datarpit, thank you for your comment. The baseline reported in README does use num_train_epochs=1, but please feel free to do further hyperparameter search.
The GPT2 based model does seem to have a bit of train/inference variability, but it's odd that the trained model is achieving accuracy less than 1%. Which metric (act/slot F1/prec/recall) are you referring to? Also please note that the reported numbers are in fraction (not in %), hence maximum value would be 1.0 (=100%) for all metrics.
from simmc.
Numbers I see are consistently low around 0.0012. During inference, for most of devtest the model doesn't predict anything for belief state, but only system response. When I increased the epochs to 100 The results changed to below, but are still much lower than what README describes.
fashion
{
"joint_accuracy": 0.14777937901218394,
"act_rec": 0.2267784619415695,
"act_prec": 0.7445161290322581,
"act_f1": 0.34766017272544686,
"slot_rec": 0.24814931485273273,
"slot_prec": 0.7452696310312205,
"slot_f1": 0.37232659813304975
}
fashion_to
{
"joint_accuracy": 0.052142014935150006,
"act_rec": 0.09236211188261496,
"act_prec": 0.7654723127035831,
"act_f1": 0.16483516483516483,
"slot_rec": 0.09867695700110253,
"slot_prec": 0.6976614699331849,
"slot_f1": 0.17289913067476195
}
Another thing I wanted to mention is in the evaluation script it seems to pick targets from a folder gpt2_dst/data/v2
however there is no such folder and I had to change the script to pick from gpt2_dst/data/
. Can you please check if everything is in order and there isn't a silly bug.
from simmc.
Hi @datarpit, thank you for sharing the results. Would you mind sharing the train configuration you used as well (e.g. --n_gpu, --nocuda, batchsize, fp16 training, etc.), and perhaps the version of the gpt2 model please? Alternatively, if you have a log file of the training process, I'll take a look. We'll soon share the baseline checkpoints to mitigate the issue for now.
Also thank you for catching gpt2_dst/data/v2
, it should read gpt2_dst/data
and the fix is now pushed.
from simmc.
I didn't change anything in the script except that I run it with CUDA_VISIBLE_DEVICES=0. That will be great.
from simmc.
Same issue. I ran the baseline using the code in the repo (without any changes) and the f1 number that I see are lesser than what's reported.
Here's what I got for "Fashion (multimodal)",
Obtained results:
~/simmc/mm_dst/gpt2_dst/results/fashion
❯ jq . fashion_devtest_dials_report.json
{
"joint_accuracy": 0.06052666055286257,
"act_rec": 0.09603039434036421,
"act_prec": 0.5154711673699015,
"act_f1": 0.16189950303699616,
"slot_rec": 0.08459525885990644,
"slot_prec": 0.5049692380501657,
"slot_f1": 0.1449137579790846
}
Expected/Reported baseline results:
Baseline | Dialog Act F1 | Slot F1 |
---|---|---|
GPT2 - Fashion (multimodal) | 44.3 | 46.6 |
Note:
I ran this on just 1 GPU, multi-GPU training was throwing the following error during eval step.
07/17/2020 00:15:53 - INFO - __main__ - ***** Running evaluation *****
07/17/2020 00:15:53 - INFO - __main__ - Num examples = 3513
07/17/2020 00:15:53 - INFO - __main__ - Batch size = 32
Evaluating: 99%|█████████▉| 109/110 [00:26<00:00, 4.18it/s]
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ec2-user/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 821, in <module>
main()
File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 813, in main
result = evaluate(args, model, tokenizer, prefix=prefix)
File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 459, in evaluate
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
return self.gather(outputs, self.output_device)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [2, 1, 12, 168, 64], but expected [2, 4, 12, 168, 64]
from simmc.
Hi @chetannaik, re. multi-GPU crashes I've seen the same and have a patch which I'll try to push this week.
from simmc.
Hi @chetannaik, the patch for the issue above has been just pushed.
@datarpit @chetannaik - please take a look at the model snapshots for the MM-DST baselines (link), which should give a good starting point - please feel free to fine-tune it further, etc. You can download it and put it under /simmc/mm_dst/save/
.
The README file has been updated for the results obtained with these snapshots (trained with 2 GPUs - you can load training_args.bin
for more details). Since n_gpu
of the machine effectively changes the batch size for training (for which the GPT2 model is very sensitive), it is recommended that you find the right epoch & batch size that work the best (among other hyperparameters), to avoid overfitting & underfitting. Please feel free to re-open this if the issue persists after hyperparameter sweep. Thank you!
from simmc.
I encountered the same issue, and I was using Transformers v3.0.2.
Switching to v2.8.0 seems to solve the problem.
@shanemoon Can you share your exact version? Thanks. 😃
from simmc.
Related Issues (20)
- Incorrect evaluation script provided for MM-DST baseline HOT 1
- Baselines results for API call prediction HOT 1
- action_evaluation expected file format HOT 4
- Question about retrieval evaluation HOT 3
- Baselines results HOT 1
- Bug in baseline? (missing sigmoid) HOT 1
- Possible bugs in evaluation script in SubTask #1 HOT 2
- Are we allowed to use "turn_label" fields for subtasks 1-2 ? HOT 8
- Question about Fashion attributes HOT 1
- SubTask #3 evaluation lower case issue HOT 1
- Bug in mm_dst baseline HOT 1
- Question about submission models HOT 2
- Question about test-std files HOT 9
- Question about the new evaluation method for Task 1&2 HOT 1
- KeyError caused by ~teststd_dials_retrieval_candidates_public.json HOT 1
- How to get images HOT 2
- bug in run scripts/preprocess_simmc.sh HOT 3
- Question about mm_action_prediction/scripts/train_simmc_model.sh HOT 1
- How can I get images of fashion items?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from simmc.