GithubHelp home page GithubHelp logo

rowanz / r2c Goto Github PK

View Code? Open in Web Editor NEW
460.0 16.0 91.0 199 KB

Recognition to Cognition Networks (code for the model in "From Recognition to Cognition: Visual Commonsense Reasoning", CVPR 2019)

Home Page: https://visualcommonsense.com

License: MIT License

Python 100.00%
visual reasoning vision vcr visual-commonsense-reasoning commonsense

r2c's Introduction

From Recognition to Cognition: Visual Commonsense Reasoning (cvpr 2019 oral)

This repository contains data and PyTorch code for the paper From Recognition to Cognition: Visual Commonsense Reasoning (arxiv). For more info, check out the project page at visualcommonsense.com. For updates, or to ask for help, check out and join our google group!!

visualization

This repo should be ready to replicate my results from the paper. If you have any issues with getting it set up though, please file a github issue. Still, the paper is just an arxiv version, so there might be more updates in the future. I'm super excited about VCR but it should be viewed as knowledge that's still in the making :)

Background as to the Recognition to Cognition model

This repository is for the new task of Visual Commonsense Reasoning. A model is given an image, objects, a question, and four answer choices. The model has to decide which answer choice is correct. Then, it's given four rationale choices, and it has to decide which of those is the best rationale that explains why its answer is right.

In particular, I have code and checkpoints for the Recognition to Cognition (R2C) model, as discussed in the paper VCR paper. Here's a diagram that explains what's going on:

modelfig

We'll treat going from Q->A and QA->R as two separate tasks: in each, the model is given a 'query' (question, or question+answer) and 'response choices' (answer, or rationale). Essentially, we'll use BERT and detection regions to ground the words in the query, then contextualize the query with the response. We'll perform several steps of reasoning on top of a representation consisting of the response choice in question, the attended query, and the attended detection regions. See the paper for more details.

What this repo has / doesn't have

I have code and checkpoints for replicating my R2C results. You might find the dataloader useful (in dataloaders/vcr.py), as it handles loading the data in a nice way using the allennlp library. You can submit to the leaderboard using my script in models/eval_for_leaderboard.py

You can train a model using models/train.py. This also has code to obtain model predictions. Use models/eval_q2ar.py to get validation results combining Q->A and QA->R components.

Setting up and using the repo

  1. Get the dataset. Follow the steps in data/README.md. This includes the steps to get the pretrained BERT embeddings. Note (as of Jan 23rd) you'll need to re-download the test embeddings if you downloaded them before, as there was a bug in the version I had uploaded (essentially the 'anonymized' code didn't condition on the right context).

  2. Install cuda 9.0 if it's not available already. You might want to follow this this guide but using cuda 9.0. I use the following commands (my OS is ubuntu 16.04):

wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run
chmod +x cuda_9.0.176_384.81_linux-run
./cuda_9.0.176_384.81_linux-run --extract=$HOME
sudo ./cuda-linux.9.0.176-22781540.run
sudo ln -s /usr/local/cuda-9.0/ /usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/
  1. Install anaconda if it's not available already, and create a new environment. You need to install a few things, namely, pytorch 1.0, torchvision (from the layers branch, which has ROI pooling), and allennlp.
wget https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh
conda update -n base -c defaults conda
conda create --name r2c python=3.6
source activate r2c

conda install numpy pyyaml setuptools cmake cffi tqdm pyyaml scipy ipython mkl mkl-include cython typing h5py pandas nltk spacy numpydoc scikit-learn jpeg

conda install pytorch cudatoolkit=9.0 -c pytorch
pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d

pip install -r allennlp-requirements.txt
pip install --no-deps allennlp==0.8.0
python -m spacy download en_core_web_sm


# this one is optional but it should help make things faster
pip uninstall pillow && CC="cc -mavx2" pip install -U --force-reinstall pillow-simd
  1. If you don't want to download from scratch, then download my checkpoint.
wget https://s3-us-west-2.amazonaws.com/ai2-rowanz/r2c/flagship_answer/best.th -P models/saves/flagship_answer/
wget https://s3-us-west-2.amazonaws.com/ai2-rowanz/r2c/flagship_rationale/best.th -P models/saves/flagship_rationale/
  1. That's it! Now to set up the environment, run source activate r2c && export PYTHONPATH=/home/rowan/code/r2c (or wherever you have this directory).

help

Feel free to open an issue if you encounter trouble getting it to work! Or, post in the google group.

Bibtex

@inproceedings{zellers2019vcr,
    author = {Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
    title = {From Recognition to Cognition: Visual Commonsense Reasoning},
    booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2019}
}

r2c's People

Contributors

cclauss avatar jizecao avatar rowanz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

r2c's Issues

meaning of ctx_answer

I'm looking at the bert representation of the question, it seems that ctx_answer0, ctx_answer1, ctx_answer2, ctx_answer3 should be the same representation for the question for each sample. But in the downloaded bert data, they are not. why does this happen?

KeyError: "Unable to open object (object '194190' doesn't exist)"

Thanks for your great code.
I got an error below.
I don't know where were going wrong and how can i do.
I hope can get some help.

0%| Traceback (most recent call last): File "train.py", line 119, in <module> for b, (time_per_batch, batch) in enumerate(time_batch(train_loader if args.no_tqdm else tqdm(train_loader), reset_every=ARGS_RESET_EVERY)): File "/home/songzijie/r2cmaster/utils/pytorch_misc.py", line 29, in time_batch for i, item in enumerate(gen): File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/tqdm/std.py", line 1127, in __iter__ for obj in iterable: File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__ return self._process_next_batch(batch) File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 606, in _process_next_batch raise Exception("KeyError:" + batch.exc_msg) Exception: KeyError:Traceback (most recent call last): File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp> samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/songzijie/r2cmaster/dataloaders/vcr.py", line 235, in __getitem__ grp_items = {k: np.array(v, dtype=np.float16) for k, v in h5[str(index)].items()} File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/h5py/_hl/group.py", line 264, in __getitem__ oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: "Unable to open object (object '194190' doesn't exist)"

No corresponding module find in AllenNLP

Hi, thanks for the code for helping load the dataset.

But I find there are two small issues in the code.
in dataloaders/vcr.py: from allennlp.data.dataset import Batchl Batch cannot be found
in dataloaders/bert_field.py from allennlp.data.token_indexers.token_indexer import TokenIndexer, TokenType; TokenType cannot be found.

Maybe it's due to the AllenNLP version when writing the code. I'm using version 2.10.0.

Many thanks!

Submission to leaderboard

Hi! Can reuslts be submited to the leaderboard? I tried contacting you via email, but I didn't receive reply. Thank you!

AttributeError: 'ScatterableList' object has no attribute 'cuda'

i have a problem like this:
Traceback (most recent call last):
File "eval_for_leaderboard.py", line 110, in
batch = _to_gpu(batch)
File "eval_for_leaderboard.py", line 74, in _to_gpu
td[k] = {k2: v.cuda(async=True) for k2, v in td[k].items()} if isinstance(td[k], dict) else td[k].cuda(
AttributeError: 'ScatterableList' object has no attribute 'cuda'

what can i do?

thank you:)

Import error : undefined symbol issue

Screenshot from 2019-10-24 10-20-22

I've followed the process from creating a new conda environment but
I can not get the leaderboard output, I get the import issue about cuda

Do you have any idea?

Baseline for Q->AR

First up, thanks for the great work and releasing the code!

I'm trying to repro the baselines from the code and it works like a charm for Q->A and QA->R tasks, but I don't see any code for Q->AR task. Could you please share some details as to how this is computed?

Is the baseline validation accuracy of 43.1 mentioned in the paper for Q->AR task obtained by first running Q->A task, and conditioned on those predicted answers, running QA->R? If so, I believe this would mean the bert_da embeddings for ctx_rationale<i> needs to be recomputed based on the (question + predicted answer) as opposed to what's been precomputed (question + ground-truth answer). To avoid having to pretrain the bert_da embeddings as mentioned in https://github.com/rowanz/r2c/blob/master/data/get_bert_embeddings/README.md, would you be able to share the init_checkpoint file that I could use in extract_features.py?

Thank you!

Corrupted zip file

In linux terminal, while I am unzipping the vcr1images.zip file, with the following command unzip vcr1images.zip, in the midway of extraction process, the process is killed. I tried to unzip in windows with winrar, and the following error is coming.
error
Anyone else faced the problem?

i got an error "OSError: broken data stream when reading image file"

i had training successful before.
but now, i got an error below...
i don't know what happen..
how do i do? T^T

  0%|▏                                               | 5/1019 [00:45<2:49:41, 10.04s/it]Traceback (most recent call last):
  File "train.py", line 132, in <module>
    for b, (time_per_batch, batch) in enumerate(time_batch(train_loader if args.no_tqdm else tqdm(train_loader), reset_every=ARGS_RESET_EVERY)):
  File "/home/ailab/r2c/utils/pytorch_misc.py", line 29, in time_batch
    for i, item in enumerate(gen):
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/tqdm/std.py", line 1104, in __iter__
    for obj in iterable:
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 568, in __next__
    return self._process_next_batch(batch)
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
OSError: Traceback (most recent call last):
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ailab/r2c/dataloaders/vcr.py", line 392, in __getitem__
    image = load_image(os.path.join(VCR_IMAGES_DIR, item['img_fn']))
  File "/home/ailab/r2c/dataloaders/box_utils.py", line 15, in load_image
    return default_loader(img_fn)
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torchvision/datasets/folder.py", line 147, in default_loader
    return pil_loader(path)
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torchvision/datasets/folder.py", line 130, in pil_loader
    return img.convert('RGB')
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/PIL/Image.py", line 930, in convert
    self.load()
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/PIL/ImageFile.py", line 272, in load
    raise_ioerror(err_code)
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/PIL/ImageFile.py", line 59, in raise_ioerror
    raise IOError(message + " when reading image file")
OSError: broken data stream when reading image file

Loss not decreasing on default config settings

@rowanz
Hi, I am trying to train the model from scratch, but am not able to reproduce the actual results. Specifically the loss is not decreasing in each epoch. I ran it for 20 epochs and the results are below. Anyone faced such an issue or know the possible reason for this? Any kind of suggestions will be of great help. Thank you.

TRAIN EPOCH 0:
loss 1.356284
crl 0.144345
accuracy 0.311996
sec_per_batch 1.702358
hr_per_epoch 1.048369
dtype: float64

Val epoch 0 has acc 0.249 and loss 1.386
Best validation performance so far. Copying weights to 'saves/flagship_rationale/best.th'.

TRAIN EPOCH 1:
loss 1.386393
crl 0.089470
accuracy 0.249471
sec_per_batch 2.008696
hr_per_epoch 1.237022
dtype: float64

Val epoch 1 has acc 0.249 and loss 1.386

TRAIN EPOCH 2:
loss 1.386381
crl 0.075422
accuracy 0.251220
sec_per_batch 1.946174
hr_per_epoch 1.198519
dtype: float64

Epoch 2: reducing learning rate of group 0 to 1.0000e-04.
Val epoch 2 has acc 0.249 and loss 1.386

TRAIN EPOCH 3:
loss 1.386379
crl 0.050537
accuracy 0.248640
sec_per_batch 1.870728
hr_per_epoch 1.152057
dtype: float64

Val epoch 3 has acc 0.249 and loss 1.386

TRAIN EPOCH 4:
loss 1.386330
crl 0.042339
accuracy 0.250779
sec_per_batch 2.006369
hr_per_epoch 1.235589
dtype: float64

Val epoch 4 has acc 0.249 and loss 1.386

TRAIN EPOCH 5:
loss 1.386332
crl 0.037035
accuracy 0.250581
sec_per_batch 1.735174
hr_per_epoch 1.068578
dtype: float64

Val epoch 5 has acc 0.249 and loss 1.386

TRAIN EPOCH 6:
loss 1.386333
crl 0.032566
accuracy 0.249394
sec_per_batch 2.384569
hr_per_epoch 1.468497
dtype: float64

Epoch 6: reducing learning rate of group 0 to 5.0000e-05.
Val epoch 6 has acc 0.249 and loss 1.386

TRAIN EPOCH 7:
loss 1.386345
crl 0.020694
accuracy 0.247829
sec_per_batch 2.088539
hr_per_epoch 1.286192
dtype: float64

Val epoch 7 has acc 0.249 and loss 1.386

TRAIN EPOCH 8:
loss 1.386309
crl 0.017643
accuracy 0.251004
sec_per_batch 1.965981
hr_per_epoch 1.210717
dtype: float64

Val epoch 8 has acc 0.249 and loss 1.386

TRAIN EPOCH 9:
loss 1.386299
crl 0.015537
accuracy 0.251415
sec_per_batch 1.872479
hr_per_epoch 1.153135
dtype: float64

Val epoch 9 has acc 0.249 and loss 1.386

TRAIN EPOCH 10:
loss 1.386302
crl 0.014494
accuracy 0.251420
sec_per_batch 1.644809
hr_per_epoch 1.012928
dtype: float64

Epoch 10: reducing learning rate of group 0 to 2.5000e-05.
Val epoch 10 has acc 0.249 and loss 1.386

TRAIN EPOCH 11:
loss 1.386306
crl 0.009551
accuracy 0.252025
sec_per_batch 1.408009
hr_per_epoch 0.867099
dtype: float64

Val epoch 11 has acc 0.249 and loss 1.386

TRAIN EPOCH 12:
loss 1.386314
crl 0.007876
accuracy 0.250382
sec_per_batch 1.419217
hr_per_epoch 0.874001
dtype: float64

Val epoch 12 has acc 0.249 and loss 1.386

TRAIN EPOCH 13:
loss 1.386337
crl 0.007333
accuracy 0.248957
sec_per_batch 1.800047
hr_per_epoch 1.108529
dtype: float64

Val epoch 13 has acc 0.249 and loss 1.386

TRAIN EPOCH 14:
loss 1.386308
crl 0.006972
accuracy 0.251202
sec_per_batch 1.691500
hr_per_epoch 1.041682
dtype: float64

Epoch 14: reducing learning rate of group 0 to 1.2500e-05.
Val epoch 14 has acc 0.249 and loss 1.386

TRAIN EPOCH 15:
loss 1.386294
crl 0.004941
accuracy 0.250033
sec_per_batch 1.976553
hr_per_epoch 1.217227
dtype: float64

Val epoch 15 has acc 0.249 and loss 1.386

TRAIN EPOCH 16:
loss 1.386299
crl 0.004361
accuracy 0.250594
sec_per_batch 2.385966
hr_per_epoch 1.469357
dtype: float64

Val epoch 16 has acc 0.249 and loss 1.386

TRAIN EPOCH 17:
loss 1.386329
crl 0.004206
accuracy 0.249658
sec_per_batch 2.463118
hr_per_epoch 1.516870
dtype: float64

Val epoch 17 has acc 0.249 and loss 1.386

TRAIN EPOCH 18:
loss 1.386311
crl 0.003819
accuracy 0.249090
sec_per_batch 2.041939
hr_per_epoch 1.257494
dtype: float64

Epoch 18: reducing learning rate of group 0 to 6.2500e-06.
Val epoch 18 has acc 0.249 and loss 1.386

TRAIN EPOCH 19:
loss 1.386334
crl 0.003092
accuracy 0.249248
sec_per_batch 1.784414
hr_per_epoch 1.098902
dtype: float64

Val epoch 19 has acc 0.249 and loss 1.386

Issue with dataset extraction using Colab.

I tried uploading both zipped and unzipped files on google drive and both ways it shows the files in the data-set might be corrupted. And since the dataset is huge, Colab won't suppport the computation. Could you recommend me some way to work on this dataset?

Why not read h5 file in VCR __init__ function

r2c/dataloaders/vcr.py

Lines 229 to 231 in 71ee684

# grp_items = {k: np.array(v, dtype=np.float16) for k, v in self.get_h5_group(index).items()}
with h5py.File(self.h5fn, 'r') as h5:
grp_items = {k: np.array(v, dtype=np.float16) for k, v in h5[str(index)].items()}

Hi @rowanz , thanks for your sharing your nice code again!
I am confused about this and want to know the reason why not read h5 file in init but in getitem.

Thank you very much!

I use bert large for pretrain on vcr and encountered the error ResourceExhaustedError: OOM when allocating tensor

I tried using bert large instead of bert in the original code, and modified three parameters (hidden size=1024, hidden layers=24, attention heads=16) in bert config.
Here's the error log:
https://gist.github.com/AeroXi/d4d273da9f443c0f2cf9f6d6872eeffe
My device is 4 1080Ti
Maybe I can skip domain adaption and just extract features? However, the generated filename starts with "bert" instead of "bert_da", I can't use it directly even changed the correct filename when training r2c. Should I make other modification?

How do you use test data set??

i want test original r2c model and check accuracy.
but eval_q2ar.py use only validation data set.

test dataset only use for eval_for_leaderboard???

thank U

How can I download vcr v1.0 data?

Hi, rowanz, thanks for your great work. But I have a problem that I cannot find the link to download VCR data, i.e. images and annotations. I click 'Annotations' and 'Images', but it dosen't have any response.
image

Error

When I run your code, there is an error 'no module named 'torchvision.layers'. and I can not use the link

pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d.

When I run sudo apt-install git-core and pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d, it also does not work.

need ~10 hrs to run one epoch

Hi,

Thanks for your great work! I tried to run your train script in the "models" folder, and it showed it will approximately take about 10 hrs to train one epoch. After I used the line_profile tool to check the running time analysis of the train script, I found that this line of code takes 90% of the total running time:
for b, (time_per_batch, batch) in enumerate(time_batch(train_loader if args.no_tqdm else tqdm(train_loader), reset_every=ARGS_RESET_EVERY)):

I think this line basically calls the collate_fn() in the DataLoader object and the get_item() in the Dataset object. Do you have any ideas that why it takes so long to run one epoch?

I'm using 4 CPUs with 20GB memory in total and Telsla V100 on a Google Cloud VM instance.

PS: I also tried to replace all the executions inside that loop with "pass",
for b, (time_per_batch, batch) in enumerate(time_batch(train_loader if args.no_tqdm else tqdm(train_loader), reset_every=ARGS_RESET_EVERY)): pass
and it still needed the same amount of time for one epoch.

eval for test data

Hello,
When I eval my model on the test data, could I submit a 'answer_preds.npy' and a 'ration_preds.npy', respectively?

Thank you

Question about adversarial Matching

In the paper, it's said that "each answer appears exactly four times in the dataset". I try to verify this. But I cannot get this conclusion for the train.jsonl and val.jsonl file. Can you explain more about this? Besides, does this mean all correct answers appears 4 times, or all answers candidates appears 4 times? Thanks.

cannot restore_checkpoint and resume training

The return value epoch_to_return, val_metric_per_epoch of restore_checkpoint() function in utils/pytorch_misc.py

return epoch_to_return, val_metric_per_epoch

are always 0, [] even though I try to restore from the existing folder and previous training method with output "Found folder! restoring"

r2c/models/train.py

Lines 102 to 105 in 71ee684

if os.path.exists(args.folder):
print("Found folder! restoring", flush=True)
start_epoch, val_metric_per_epoch = restore_checkpoint(model, optimizer, serialization_dir=args.folder,
learning_rate_scheduler=scheduler)

Cannot create a tensor proto whose content is larger than 2GB

For rationale mode, the repository provides code for calculating Question and correct answer pairs BERT embeddings only for train and validation set. While for test set, embeddings are calculated for all question/answer pairs. If we try to extend this functionality to train set, following error occurs:

Traceback (most recent call last):
  File "extract_features.py", line 245, in <module>
    for result in tqdm(estimator.predict(input_fn, yield_single_examples=True)):
  File "/usr/local/lib/python3.5/dist-packages/tqdm/_tqdm.py", line 1022, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2437, in predict
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2431, in predict
    yield_single_examples=yield_single_examples):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 549, in predict
    input_fn, model_fn_lib.ModeKeys.PREDICT)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1024, in _get_features_from_input_fn
    result = self._call_input_fn(input_fn, mode)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2354, in _call_input_fn
    return input_fn(**kwargs)
  File "/vol/vcr/r2c/data/get_bert_embeddings/vcr_loader.py", line 57, in input_fn
    dtype=tf.int32),
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/constant_op.py", line 207, in constant
    value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_util.py", line 506, in make_tensor_proto
    "Cannot create a tensor proto whose content is larger than 2GB.")

Issue with directory structure

Hi,

I think the main directory is missing a setup.py file to install this code as a module, which should probably look something like this:

from setuptools import setup, find_packages

setup(
    name='r2c',
    version='0.1',
    packages=find_packages()
)

Also, when I downloaded the data, train.jsonl lies in the vcr1/vcr1annots/ folder, and cocoontology.json lies in the dataloaders folder, but this line and this line indicate that dataloaders should be inside the vcr1/vcr1annots/ folder, but the instructions on the website say that we can have a separate folder where the data has been downloaded. Can you please help clarify the confusion?

Thanks!

FileNotFoundError: [Errno 2] No such file or directory: '/disk4/libuwei/r2c-master/data/vcr1images/movieclips_S.W.A.T./[email protected]'

Hi Rowan,

When I run :
python train.py -params multiatt/default.json -folder saves/flagship_answer

There is an error:
FileNotFoundError: [Errno 2] No such file or directory: '/disk4/libuwei/r2c-master/data/vcr1images/movieclips_S.W.A.T./[email protected]'

environment information:
PyTorch version: 1.0.1.post2
CUDA used to build PyTorch: 8.0.44
OS:centos 7.3
GCC version: 5.5
Python version: 3.6
Is CUDA available: yes
GPU: Tesla K20c
Nvidia driver version: 375.66

No idea about params file or any other command line arguments

In the code params file is asked for as input, but I cannot see any params file or even demo params file in the code base.

Even the file to be executed is just mentioned without mentioning the command line arguments as demo.

Please help me to understand how does it look like or what the path of demo params file is so that I can understand what parameters are being passed into the program.

Error

When I run your code, there is an error 'no module named 'torchvision.layers'. and I can not use the link

pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d.

RuntimeError: copy_if failed to synchronize: the launch timed out and was terminated

i have a problem like this:
RuntimeError: copy_if failed to synchronize: the launch timed out and was terminated

Traceback (most recent call last):
File "train.py", line 125, in
loss = output_dict['loss'].mean() + output_dict['cnn_regularization_loss'].mean()
File "/home/ailab/anaconda2/envs/r2c/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ailab/r2c/models/multiatt/model.py", line 156, in forward
obj_reps = self.detector(images=images, boxes=boxes, box_mask=box_mask, classes=objects, segms=segms)
File "/home/ailab/anaconda2/envs/r2c/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ailab/r2c/utils/detector.py", line 112, in forward
box_inds = box_mask.nonzero()
RuntimeError: copy_if failed to synchronize: the launch timed out and was terminated

When I was training this model, it is stop !
what can i do?

env
titan X
cuda 9.0

segmentation fault error

I try to run the train.py, but the code fails on the line:
model = Model.from_params(vocab=train.vocab, params=params['model'])
The error infomation is:
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer = [['.*final_mlp.*weight', {'type': 'xavier_uniform'}], ['.*final_mlp.*bias', {'type': 'zero'}], ['.weight_ih.', {'type': 'xavier_uniform'}], ['.weight_hh.', {'type': 'orthogonal'}], ['.bias_ih.', {'type': 'zero'}], ['.bias_hh.', {'type': 'lstm_hidden_bias'}]]
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'xavier_uniform'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = xavier_uniform
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'zero'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = zero
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'xavier_uniform'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = xavier_uniform
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'orthogonal'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = orthogonal
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'zero'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = zero
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'lstm_hidden_bias'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = lstm_hidden_bias
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:37 - INFO - allennlp.nn.initializers - Initializing parameters
01/10/2019 21:36:37 - INFO - allennlp.nn.initializers - Initializing span_encoder._module._module.weight_ih_l0 using .weight_ih. intitializer
01/10/2019 21:36:37 - INFO - allennlp.nn.initializers - Initializing span_encoder._module._module.weight_hh_l0 using .weight_hh. intitializer
Segmentation fault

i have a problem extract vcr1images.zip

i was download your dataset "vcr1images.zip"files and i extract it.
but it have a problem like this:
"an error occurred while extracting files"

what can i do?! help me please!
thank you:)

i figure out this problem!
extract command

no module named 'torchvision.layers'

When I run your code, there is an error 'no module named 'torchvision.layers'. and I can not use the link

pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d.

The solution of
sudo apt-get install git-core
pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d
doesn't work.

Low training accuracy

Hi @rowanz,

I was trying to replicate the training of R2C model at my end. But the training fetches the accuracy of around 24.9% only as opposed to upwards of 60% that we get after training on the best checkpoint available in the README.md file.

The model running is being replicated on 2 GPUs with all the settings set to default. The standard output file is attached as
train.txt. Could you please help explain what the issue could be and what can be done to try and get a comparable accuracy? Please let me know if you need any details from my side.

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Hi, i meet a problem like this:
File "train.py", line 131, in
output_dict = model(**batch)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "../models/multiatt/model.py", line 157, in forward
obj_reps = self.detector(images=images, boxes=boxes, box_mask=box_mask, classes=objects, segms=segms)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "../utils/detector.py", line 111, in forward
img_feats = self.backbone(images)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torchvision/models/resnet.py", line 98, in forward
out = self.conv2(out)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

The environment i use:
python3.6.6
cuda9.0.176
cudnn7.5.1
torch1.1.0
torchvision0.3.0
and i have tried several environment configs(like cudnn7.4, torch1.0, etc) but none of them works.
what should i do?
thank you :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.